RDF Industries
All insights
SOHO AI Build · Part 5 of 9By · Josh Rogers

Troubleshooting Docker & Linux Services on a Small Business Server

Debugging Docker and Linux services on a small-business server for the case that fools everyone: the container is "up" while the app inside it is broken.

Part 5 of a 9-part series: teaching CompTIA A+ (Core 1 / 220-1101 and Core 2 / 220-1102) through a real build, a private, local AI workstation/server for a small business.


The job: a container that won't stay up

Monday morning. Someone rebooted the server over the weekend (power work in the building) and now the AI workstation isn't responding. You SSH in, check the application, and get nothing. The frontend is dead. The backend log just says "connection refused."

You run docker compose ps and watch a container cycle through "starting" → "unhealthy" → "restarting" in a loop every thirty seconds. The restart counter in the STATUS column is climbing. The container keeps dying and coming back up broken.

This is a real failure pattern. The path to the fix teaches you three distinct things: how to read Linux logs, how Docker's service model works, and how the host's service manager relates to the containers it owns. By the time this container is healthy again you'll have a repeatable troubleshooting workflow that applies to any Linux service: containerized or not.


📘 Objectives covered (220-1102) >This article maps to the following CompTIA A+ exam objectives. If you'restudying for the exam, these are your anchors; if you're here for the build,skim past: the scenario explains itself. >Core 2 (220-1102)- 1.x, Operating Systems (Linux): Linux command-line tools (ps, ss, df, systemctl, journalctl); service management and unit files; reading logs to diagnose failures.- 3.x: Software Troubleshooting: application won't start; service restart loops; reading error output to form a theory of probable cause; testing that theory without changing multiple variables at once. >Concepts taught below: systemd unit files and dependency ordering,Docker images / containers / volumes / Compose, reading journalctl andcontainer logs, the difference between Compose-managed andsystemd-managed services, and a structured log-first troubleshooting approach.

Concepts: two service managers, one stack

Linux service management: systemd (1102 1.x)

On any modern Linux server, systemd is the init system, process 1, the ancestor of every other process. When the machine boots, systemd reads a set of unit files (typically in /etc/systemd/system/) and starts the services described there in dependency order.

A unit file has three sections:

[Unit]
Description=App backend service
Requires=docker.service network-online.target
After=docker.service network-online.target

[Service]
Type=oneshot
RemainAfterExit=yes
WorkingDirectory=/opt/app
ExecStart=/usr/bin/docker compose up -d backend
ExecStop=/usr/bin/docker compose stop backend

[Install]
WantedBy=multi-user.target

[Unit] declares dependencies. Requires=docker.service means "if Docker isn't running, don't try to start me." After= controls ordering: this unit won't start until the listed units have finished starting. Getting Requires/After right is the most common source of startup failures on a real box.

[Service] controls the process. Type=oneshot + RemainAfterExit=yes is the right pattern when ExecStart launches something and exits, like docker compose up -d, which starts containers in the background then returns. Without RemainAfterExit, systemd sees the command exit and concludes the service failed. Restart= has no effect on Type=oneshot units and is omitted here: container-level restarts are Docker Compose's job (via restart: unless-stopped in the compose file); the systemd unit just brings the stack up at boot.

The essential systemctl commands:

systemctl status app-backend.service    # current state + last log lines
systemctl start|stop|restart app-backend.service
systemctl enable app-backend.service    # auto-start on boot (enable ≠ start)
systemctl disable app-backend.service   # remove boot symlink

Reading logs with journalctl (1102 1.x)

systemd captures every service's stdout/stderr in the journal, which journalctl queries:

journalctl -u app-backend.service          # all logs for this unit
journalctl -u app-backend.service -n 50    # last 50 lines
journalctl -u app-backend.service -f       # follow (like tail -f)
journalctl -p err -u app-backend.service   # errors only

The -u <unit> filter is the most important habit to form. When a service fails to start, the journal almost always contains the actual error one or two lines before the "Failed to start" message systemd adds on top.

Docker, images, containers, volumes, and Compose (1102 1.x / 3.x)

Four terms the exam and real life both use:

  • Image: a read-only snapshot: filesystem plus metadata. A template. You don't run an image; you create a container from one.
  • Container: a running instance of an image. Lightweight, isolated via Linux kernel namespaces. Containers are ephemeral: delete one and the data written inside it disappears.
  • Volume: persistent storage outside the container filesystem. Named volumes (Docker-managed) and bind mounts (a host path mounted into the container) both survive container deletion.
  • Compose: defines and runs a multi-container application from a single docker-compose.yml. One file describes the whole service graph, containers, ports, volumes, healthchecks, and start order.

The container lifecycle commands:

docker ps                          # running containers
docker ps -a                       # all containers, including stopped ones
docker logs <container>            # stdout/stderr from the container
docker logs <container> --tail 50 -f   # follow last 50 lines
docker inspect <container>         # full metadata (mounts, env, network)
docker exec -it <container> bash   # shell inside a running container

The Compose commands:

docker compose ps              # status of all services in the compose file
docker compose up -d           # start everything in background
docker compose down            # stop and remove containers (NOT volumes)
docker compose down -v         # also remove named volumes (destructive)
docker compose logs -f backend # follow logs for one service
docker compose restart backend # restart one service

Key distinction: docker compose down removes containers but leaves named volumes. docker compose down -v removes volumes too: you lose the data. Bind-mounted host paths are never touched by either command.

Compose vs. systemd, two models in one stack

A well-structured service stack uses both, and they play different roles:

  • Docker Compose manages the containers, images, ports, dependencies, healthchecks, environment variables.
  • systemd manages the host services: including units that wrap docker compose up. On a production install, systemd starts the compose services on boot and restarts them if they crash at the host level.

The unit file above is exactly this: ExecStart=/usr/bin/docker compose up -d backend. systemd owns "should this service be running?"; Compose owns "how do the containers look?".

The trap is when both try to own the same container. If a systemd unit launches a container via Compose and an operator also runs docker compose up directly, you can end up with duplicate containers, port collisions, or a confused state where systemd reports stopped but Docker shows running.


Hands-on walkthrough: log → root cause → fix

Step 1: Read the container logs first

The restart-looping container is app-backend. Before touching anything:

docker logs app-backend --tail 100

You'll see some startup output followed by the actual error before each crash. On this specific failure, the last few lines before exit read:

undefinedFATAL: database "appdb" does not exist

Or perhaps:

could not connect to server: Connection refused
Is the server running on host "postgres" and accepting
TCP/IP connections on port 5432?

Either of these points at the same root problem: the backend container is
starting before the database container is ready, or the database volume was
wiped and the initialization scripts haven't run. Log first; the error almost
always tells you what happened.

Step 2, Check the dependency status with Compose

docker compose ps
NAME            IMAGE               STATUS
app-postgres    postgres:15-alpine  Up 2 minutes (healthy)
app-redis       redis:7-alpine      Up 2 minutes (healthy)
app-backend     app-backend:latest  Restarting (1) 30 seconds ago
app-frontend    app-frontend:latest Up 2 minutes

The database shows (healthy), so it's up. But is it actually listening?

docker compose exec app-postgres pg_isready -U appuser -d appdb
undefined/var/run/postgresql:5432 - accepting connections

It's accepting connections. So the database is alive but the backend can't
reach it. Check the backend's environment inside the container:

docker inspect app-backend | grep -A 5 '"Env"'

Look for POSTGRES_HOST and POSTGRES_PORT. If POSTGRES_HOST is set to localhost, that's the problem, inside the container network, the database isn't localhost; it's postgres (the Compose service name, which Docker resolves on the internal network). If it says postgres, dig deeper.

Step 3, The bind-mount wipe case

If the postgres container shows healthy but the backend gets "database does not exist," check when the database directory was last populated. When a postgres container starts with an empty data directory, it runs initialization scripts and creates the database. But if the data directory was wiped between runs, that initialization already happened once, and the init scripts don't run again on a pre-populated directory.

Here's the actual problem that hit this build: postgres was using a bind mount (a host path, not a named volume) for its data. A bind-mounted path survives docker compose down, but not a manual sudo rm -rf /var/lib/app/pg/data/*: which a "clean reinstall" step ran earlier in the day. That wiped the data directory.

After the wipe, postgres started with an empty directory, ran its init scripts, and came up correctly. But the backend saw postgres in starting state, decided the dependency wasn't met, and crashed before the healthcheck passed.

depends_on: condition: service_healthy guards against this: Compose waits until the dependency's healthcheck passes before starting the dependent service. But it only controls startup order on the first docker compose up. If the backend restarts independently (via restart: unless-stopped) after a crash, Compose doesn't re-check the dependency. The real fix: make the application retry the database connection with a backoff rather than crashing on the first failure.

Step 4, Check the host systemd unit

While you're here, also check the systemd unit that's supposed to be managing this:

systemctl status app-backend.service

In the failure scenario from this build, the output showed:

● app-backend.service - App backend service
     Loaded: loaded (/etc/systemd/system/app-backend.service; enabled; ...)
     Active: active (exited) since Mon 2026-06-29 08:02:14 UTC; 2h ago

active (exited) for a Type=oneshot RemainAfterExit=yes unit means: the ExecStart command ran and exited zero, so systemd considers this service "running." But the container it started may have crashed and restarted multiple times since then, systemd isn't watching the container process, only whether the docker compose up -d command succeeded.

This is a common confusion point: systemd reports the unit healthy because the compose command succeeded at boot. Docker Compose is separately managing the container restart loop. Both tools think they're in control, and neither is wrong, but the operator has to look at both to understand actual state.

Step 5, Port collision diagnosis

A second failure pattern from this build: after a server rebuild on a machine running other services, the postgres container started but the backend couldn't connect on port 5432. docker compose ps showed postgres healthy.
But:

ss -tlnp | grep 5432
undefinedLISTEN  0   128   0.0.0.0:5432   0.0.0.0:*   users:(("postgres",pid=8812,...))

A native postgres process on the host was already listening on 5432: not the container. The container's port mapping (-p 5432:5432) silently lost the race; Docker published to a random ephemeral port instead.

Confirm which process holds the port:

ss -tlnp | grep 5432
# or
lsof -i :5432

ss is always present; lsof may require apt install lsof on modern Ubuntu/Debian. If it's not the container, you have two choices: stop the conflicting host service, or shift the container's published port in the compose file:

ports:
  - "5442:5432"    # publish container's 5432 on host's 5442

Note that only the host port changes. Connections between containers on the same Compose network still use the container port (5432), because they're not going through the host's port mapping at all.

Step 6: Check disk and process state

df -h

Look for any filesystem near 100%. A full disk silently kills a database, postgres can't write WAL, and the symptom is "postgres is up but not accepting writes," showing as a transaction error rather than a connection error.

ps aux | grep postgres
ps aux | grep uvicorn       # or whatever the backend process is

If a process you expect isn't in the list, the container isn't running despite what docker compose ps says, or the process inside crashed before the container restarted.

Step 7: The fix and verification

The root cause in the bind-mount case: wiped data directory → slow postgres init → backend started before healthcheck passed → backend crashed → restarted immediately → repeated. The fix has two parts:

  1. Wait for postgres to be genuinely healthy before bringing the backend up:
   docker compose stop backend
   # wait for postgres healthcheck to show (healthy)
   docker compose ps postgres
   docker compose start backend
  1. Long-term: ensure the application handles a short DB connection delay at startup with a retry loop, so a race on boot doesn't cause a permanent restart loop.

Verification: confirm the stack is actually clean

After the fix, three levels of confirmation before calling it done:

1. All Compose services are up and healthy:

docker compose ps

Every service should show Up ... (healthy): not just Up. A container that's running but whose app inside is broken shows Up without (healthy). The healthcheck distinction is the difference between "the container process is alive" and "the application inside is responding correctly."

2. Systemd units match expectations:

A full stack has a systemd unit per Compose-wrapped service group; check them together.

systemctl status app-backend.service app-frontend.service app-model-serving.service

On a production install, all three should show active. On a development box where Compose is run directly (not through systemd), you may intentionally see some units as inactive (dead): that's expected when Compose is the live process manager and systemd is not. Know which mode your box is in.

3. Logs are clean:

journalctl -u app-backend.service -n 20
docker compose logs --tail 20 backend

No repeating errors, no "starting" → "failed" cycles. If docker compose logs still shows a crash loop, the fix didn't take.

4. Application endpoint responds:

curl -s http://localhost:8000/health/live
# {"status": "ok"}

HTTP 200 from the health endpoint confirms the application is alive and connected to its dependencies. Connection refused = container isn't listening. 503 = app is up but a dependency is unhealthy.


🎯 What the exam asks >CompTIA frames this material across the Linux and software-troubleshootingdomains. Know these cold: >- Linux service commands: systemctl status, start, stop, restart, enable, disable are all testable. Know that enablestart: enable sets the service to start at boot; start runs it now.- Reading logs: journalctl -u <unit> is the canonical way to read a systemd service's output. The exam may describe a failing service and ask which command shows why: the answer is journalctl.- Service troubleshooting methodology: the exam favors a systematic approach: identify the symptom → check the status → read the logs → form a theory → test → verify. Never change two things at once before retesting.- Common Linux CLI commands: ps aux (list processes), ss -tlnp (listening TCP sockets), df -h (disk usage): all appear in the troubleshooting domain. ps and df are the most commonly tested.- Docker container vs. image vs. volume: the exam tests whether you know that deleting a container does not delete its volume, and that images are read-only templates. docker ps -a shows all containers; docker ps shows only running ones.- Application won't start, methodology: the exam expects you to check error logs, verify dependencies are running, confirm ports aren't already in use, and check for permission or resource issues (disk full): in roughly that order.- Port conflicts: ss -tlnp | grep <port> or `netstat -tlnp | grep <port>` identifies what's holding a port. The exam tests both commands. netstat is legacy (net-tools, often absent on modern Ubuntu/Debian); ss is the current replacement and always present. Know both for the exam.

Common pitfalls (most of these are from the real build)

Bind-mount data wiped on "clean" reinstall. Named Docker volumes survive docker compose down; a bind-mounted host path does too, but not a manual rm -rf. On this build, postgres data was bind-mounted to an NVMe path for performance and got wiped during a reinstall step. Named volumes are the safer default for critical data; if you use bind mounts, document the path and protect it explicitly.

Port collisions on shared hosts. A container's published port conflicts with a native process. The symptom looks like an application-layer error, and you can waste twenty minutes on app config before running ss -tlnp. Always confirm the host port is free before chasing application-level causes.

Compose and systemd both owning the same service. If systemd units are driving the stack via docker compose up, don't also run docker compose up manually. You'll get duplicate containers, name collisions, or two processes competing for the same port. Production box: use systemd. Dev box: use Compose directly. Pick one and be consistent.

Missing dependency ordering. depends_on: condition: service_healthy only works if the dependency has a healthcheck defined. Without one, Docker considers the dependency satisfied the moment the container process starts: before the database has loaded its data files and is accepting connections. Always pair them:

postgres:
  healthcheck:
    test: ["CMD-SHELL", "pg_isready -U appuser -d appdb"]
    interval: 10s
    retries: 5

systemd says active (exited) when the container is restart-looping. A Type=oneshot RemainAfterExit=yes unit reports healthy after its ExecStart returns zero: even if the container it launched later crashes and loops. systemctl status is not the right tool for checking container health; use docker compose ps.

Not reading the logs at all. The most common mistake. Ten seconds of docker logs <container> --tail 50 before anything else resolves most of the above in under a minute. Restarting first and reading later is how a five-minute fix becomes an hour of trial and error.


Recap + what's next

You walked a container restart loop from symptom to root cause without touching anything you didn't understand first. The tools: docker compose ps for container state, docker logs for container output, journalctl -u for systemd unit output, systemctl status for unit state, ss -tlnp for port conflicts, df -h for disk pressure, ps aux for process confirmation. The failures: a bind-mount wiped on reinstall, a port collision with a host service, and a compose/systemd confusion that looked healthy at the host level while broken at the container level.

The underlying discipline is the same six-step troubleshooting methodology the A+ exam tests, identify the symptom, form a theory, test it, fix it, verify, document: applied to a concrete real-world scenario. We'll make that framework explicit in the series capstone.

But before that, there's a more pressing concern. The stack is running locally, the ports are up, the logs are clean, and the box is reachable from the LAN. Every service that listens on a network port is a potential entry point. A server that runs useful AI tools for the office is also a target.

Next up: Part 6: "Endpoint Security Basics for Local AI Tools." The AI server is powerful and local: what stops an unauthorized user from reaching it? We'll cover the 1102 security domain's core concepts (least privilege, host firewall, authentication factors, encryption at rest) by hardening the workstation we just got running. See you there.