Symptom → Root Cause: A Real Troubleshooting Workflow

Part 9 of a 9-part series: teaching CompTIA A+ (Core 1 / 220-1101 and Core 2 / 220-1102) through a real build, a private, local AI workstation/server for a small business.

The job: we've been here before

You've already watched it happen twice in this series. In Part 4, the inference server died with "out of memory" on a card that had gigabytes to spare. In Part 5, a container sat in a restart loop for twenty minutes while the database it needed was (as far as Docker was concerned) healthy. Both times the fix arrived, the service came up, and we moved on. What we didn't do was name the framework behind the work.

That's this article.

The CompTIA A+ exam calls it the six-step troubleshooting methodology (1101 5.1). It's the single most testable concept in the exam's troubleshooting domain, and it earns its place: it's what keeps a five-minute fix from becoming a three-hour hole when you're tired, on-site, and the service is down in front of a client.

We'll name all six steps, then walk the restart-loop case from Part 5 through them explicitly: showing how the structured approach gets there faster. Along the way you'll see how the installer's automated diagnostic tool encodes the same logic as code.

📘 Objectives covered (220-1101 / 220-1102) >Core 1 (220-1101)- **5.1, Given a scenario, use the best practice methodology to resolve problems:** the six-step troubleshooting process in order, identify the problem, establish a theory, test the theory, establish a plan and implement, verify full functionality and implement preventive measures, document findings. >Concepts taught below: all six steps (definitions + how they connect), thestanding "consider corporate policies first" rule, how automated preflightchecks encode Step 3 as software, and how to close the loop with documentation.

Concepts: the six-step methodology

The A+ methodology is a sequence, not a personality. The sequence matters because it prevents the most common failure modes: jumping to a fix before you've identified the cause, changing multiple things at once, or stopping the moment the symptom disappears without verifying the system is actually healthy.

Step 1, Identify the problem

Before touching anything, gather information:

Question the user, what is failing, when did it start, what does "broken" look like exactly?
Identify all symptoms. One symptom often masks others.
Determine what changed recently. A system that worked yesterday and doesn't today had something happen to it. "What changed" often contains the answer.
Back up data and config before making changes. A troubleshooting session that corrupts good data is worse than the original problem.

The output of Step 1 is a concrete symptom statement, not a guess about cause. "The backend container exits within 5 seconds of starting, log says 'connection refused' to the database, started after a host reboot Saturday" is a Step 1 output. "I think the database is broken" is not.

Step 2, Establish a theory of probable cause

Form a specific, testable hypothesis. Good theories are:

Based on the most common causes first. The exam calls this "question the obvious." A service that fails after a reboot because a dependency didn't come up in time is far more common than a code bug.
One thing at a time. A theory that says "it's either the database or the network or the configuration" is a list, not a theory.
Informed by research when needed. For an unfamiliar error, look it up before guessing.

Consider both internal causes (misconfiguration, ordering, software state) and external causes (network, dependencies, hardware).

Step 3, Test the theory to determine the cause

Do the minimum possible action to confirm or refute your theory: without making the problem worse.

If confirmed: proceed to Step 4 with a plan.
If not confirmed: go back to Step 2 with a new theory. Do not start applying fixes for an unconfirmed cause.
If you can't confirm or refute: escalate.

The key discipline: change one variable at a time. If you change three things and the problem goes away, you don't know which change fixed it, and you've introduced two changes you don't understand into a running system.

Step 4, Establish a plan of action and implement the solution

With the root cause confirmed, plan before acting. The plan states what you're changing and why, considers the impact on other services, and accounts for how to reverse the change if it makes things worse.

Then implement, and here the standing rule applies: always consider corporate policies, procedures, and impacts before implementing changes. This phrasing appears verbatim in the A+ objectives. A change that requires a maintenance window, approval, or a ticket must not be made without those, even when the fix is obvious and the service is down.

Step 5, Verify full system functionality and implement preventive measures

After the fix, verify the system is actually healthy: not just that the immediate symptom is gone. Test from the user's perspective: does the thing they were trying to do now work?

Then implement preventive measures so the same failure is harder to repeat: a startup-ordering guard, a healthcheck, a monitoring alert, or a runbook entry. A fix that leaves the system in the same state that allowed the failure is half done.

Step 6, Document findings, actions, and outcomes

Write it down. This is the step people skip when they're tired and the service is back up, and it's the one that costs the most over time. Document: the initial symptom, the theories you formed and ruled out, the confirmed root cause, the exact changes made, the verification performed, and any preventive measures added.

The A+ exam treats this step as mandatory and tests it directly, "document findings" is always last, never optional.

The standing rule: consider corporate policies first

This applies across the entire process, not only at Step 4: always consider corporate policies, procedures, and impacts before implementing changes. On a homelab it feels bureaucratic. On a multi-user production system it's what distinguishes a competent technician from a liability.

Hands-on walkthrough: the restart loop through all six steps

The failure from Part 5: Monday morning, the server was rebooted over the weekend, and the backend container is cycling "starting" → "unhealthy" → "restarting" every thirty seconds.

Step 1: Identify the problem

docker logs app-backend --tail 50
# ...
# FATAL: could not connect to server: Connection refused
# Is the server running on host "postgres" and accepting connections on port 5432?

docker compose ps
# app-postgres    postgres:15-alpine    Up 2 minutes (healthy)
# app-backend     app-backend:latest    Restarting (7) 20 seconds ago

Symptom statement: Backend container exits on startup with a database connection error. Database container reports healthy. This started after a host reboot. No application code changed.

Back up the compose file before touching anything:

cp docker-compose.yml docker-compose.yml.bak

Step 2: Establish a theory

The most common cause: startup ordering. Theory: the backend started while the database was still initializing its data files. The connection failed; restart: unless-stopped caused the backend to restart immediately; the loop repeats because each restart attempt hits the same narrow window before postgres is ready.

Hold in reserve: a port conflict grabbed port 5432 before the container's mapping took effect.

Step 3: Test the theory

Without changing anything:

# Is postgres accepting connections right now?
docker compose exec app-postgres pg_isready -U appuser -d appdb
# /var/run/postgresql:5432 - accepting connections   ← it's ready now

# When did each container start?
docker inspect app-postgres --format '{{.State.StartedAt}}'
# 2026-06-29T08:01:47Z
docker inspect app-backend --format '{{.State.StartedAt}}'
# 2026-06-29T08:01:49Z   ← 2 seconds after postgres

# What does the dependency declaration say?
grep -A 3 "depends_on" docker-compose.yml

depends_on:
  app-postgres:
    condition: service_started   # ← not service_healthy

Found it. service_started means Docker only waits for the container process to start: not for the application inside to be ready. The backend started 2 seconds after the postgres process started, before postgres had loaded its data files and begun accepting connections.

Quick check on the secondary theory:

ss -tlnp | grep 5432
# LISTEN  0  128  0.0.0.0:5432  ...  users:(("postgres",...))

One listener, it's the container. Port conflict ruled out. Primary theory confirmed.

Step 4: Plan and implement

Plan: stop the backend to end the loop; wait for postgres to show (healthy); restart backend; update the compose file to condition: service_healthy.

Impact: the backend and frontend are already offline due to the loop: no additional user impact. The compose file change requires a service restart.

Policy check: in a production environment with change control, a compose file edit is a change that needs a ticket. Emergency or expedited path: check your org's process.

Rollback: the compose change is a single line; docker-compose.yml.bak preserves the original. If the service_healthy condition causes an unexpected issue, reverting is one line back to service_started and a docker compose up -d: keep the .bak in place until a reboot confirms the fix holds.

docker compose stop app-backend

# Watch until postgres shows (healthy):
watch -n 2 'docker compose ps app-postgres'

docker compose start app-backend
docker compose logs -f app-backend   # confirm clean startup

Update the compose file:

depends_on:
  app-postgres:
    condition: service_healthy   # was: service_started

This only works if postgres has a healthcheck defined: which it does from the compose file in Part 1. Without a healthcheck, service_healthy silently falls back to service_started.

Step 5: Verify and prevent

docker compose ps
# All rows: Up ... (healthy)

curl -s http://localhost:8000/health/ready
# {"status": "ok"}   ← checks DB connectivity, not just process liveness

/health/ready is the real verification: it confirms the application is up and connected to its dependencies. /health/live only confirms the process exists.

Preventive measures:

The service_healthy fix enforces correct startup order on future reboots.
Add a startup retry loop in the application itself: Compose guards the first start but not a container-level restart after a crash. An application that retries the database connection with exponential backoff survives the race without looping.
Verify the systemd unit that wraps the compose stack logs clearly if the bring-up fails, so future failures surface at the host level too.

Step 6, Document

Incident: backend restart loop after host reboot: 2026-06-29 >Symptom: Backend restart-looped after host reboot. Database reportedhealthy; backend exited on each start with "Connection refused" to postgres. >Root cause: depends_on used condition: service_started instead ofcondition: service_healthy. Backend started 2 seconds after the postgresprocess: before postgres had initialized its data directory and begunaccepting connections. >Fix: Stopped backend, waited for postgres to show healthy, restartedbackend. Updated compose dependency to condition: service_healthy. >Verification: All services show (healthy). /health/ready returns 200. >Preventive: Compose dependency corrected. Application retry loop trackedas follow-up.

Five minutes to write. Prevents the next engineer from spending thirty minutes re-diagnosing the same failure.

How the installer automates Step 3

In the VRAM/driver case from Part 4, the Step 3 work, run nvidia-smi, check CUDA version, check the inference server's memory budget: is manual. On a production install, you can make that systematic by encoding the checks as automated probes that run in a fixed order and report failures with fix hints.

That's what the installer's diagnostic tool does. It runs three layers of checks, bottom to top:

System layer: host-level state: GPU driver present? Container runtime wired to Docker? Docker daemon running? Persistent journal configured? Disk paths above their minimum free-space thresholds?

Services layer: unit health: expected systemd units active? Expected ports listening?

Endpoints layer: one HTTP round-trip: does /health/ready return 200?

Layer     Check                     Status  Detail
system    gpu_driver                PASS    nvidia: 1 × RTX 4090
system    nvidia_container_runtime  PASS    nvidia runtime registered
system    docker_daemon             PASS    Docker version 26.1.3
system    disk.data_path            WARN    18.3 GiB free, want ≥ 20
services  systemd.app-backend       FAIL    failed (rc=3)
services  listen.backend_loopback   FAIL    127.0.0.1:8000 not listening
endpoints health_ready              FAIL    Connection refused

3 passed, 1 warning, 3 errors

The ordering is not arbitrary. A FAIL in the system layer (no GPU driver) explains a FAIL in the services layer (inference service can't start) which explains a FAIL in the endpoints layer (no health response). Fix the lowest-layer failure first: the upper-layer failures may resolve automatically.

This is Step 3 encoded as software: each probe is a single-variable check that confirms or refutes one specific hypothesis. The tool changes nothing, it only reads, reports, and suggests. That's the discipline of Step 3.

The preflight stage runs the same logic before install begins: it probes the system and refuses to proceed if any ERROR-severity check fails ("required port already in use," "disk below the critical floor"). That's Step 2 encoded as a pre-condition: testing the theory before applying anything.

Verification: the loop is closed

The methodology is complete when both Step 5 and Step 6 are done: the system works from the user's perspective, and the incident is documented.

Step 5's rule is verify full functionality, not just the component that broke. For this fix that means simulating the original failure condition (a reboot) and confirming the stack comes up clean on its own:

# Simulate the failure trigger: full stack restart
docker compose down && docker compose up -d

# Confirm all services are up and healthy (no "Restarting")
docker compose ps

# Confirm the application endpoint is reachable (not just process-alive)
curl -s http://localhost:8000/health/ready
# {"status": "ok"}

# Optional: reboot the host, then verify the systemd unit and containers
# recovered automatically without manual intervention

"No longer looping" is not the bar: docker compose ps showing every service Up (healthy) and /health/ready returning 200 is. Without that check, you've confirmed the symptom is gone but not that the system is actually sound.

A common failure in real support is stopping when the symptom disappears. The container stopped looping; the engineer went to lunch. But the compose fix is what prevents the next reboot from reproducing the failure. Preventive measures and documentation are what turn a one-time fix into a permanent improvement.

🎯 What the exam asks >- The six steps in order will appear in scenarios: either "what step comes next?" or a shuffled list to sequence. Know them cold: identify → establish theory → test theory → plan and implement → verify full functionality → document.- "Document findings" is always last and always mandatory. The exam treats skipping documentation as an error, not a time-saving shortcut.- **"Consider corporate policies, procedures, and impacts before implementing changes"** appears verbatim in the objectives. A scenario where a technician skips a ticket or change window is a wrong-answer trap even when the fix is technically correct.- Test one theory at a time. A technician who changes three things simultaneously and can't reproduce the fix chose the wrong approach.- Identify the problem before acting. Scenarios where the technician immediately replaces a component without gathering information are testing whether you know Step 1 comes before Step 4.- Verify full functionality, not just the broken component. Confirming the GPU works but not checking whether the application resumed is an incomplete loop.

Common pitfalls

Skipping straight to the fix. "I've seen this before" is not Step 1. Familiar symptoms have unfamiliar causes more often than you expect. Gather the information first, every time.

Changing multiple variables at once. If you restart the container, update the driver, and change configuration in the same step, you can't know which change fixed the problem, and you may have introduced drift that causes a different failure later.

Not testing the theory before acting. Step 2 produces a theory; Step 3 confirms or refutes it. Jumping to Step 4 means implementing a fix for an unconfirmed cause. Fixing the wrong thing is worse than not fixing it: it changes system state in ways that may mask the real cause.

Stopping when the symptom disappears. You're not done until you've verified full functionality (Step 5) and documented what happened (Step 6). A partially-closed loop is a future incident.

Skipping documentation. The cost is five minutes now; the cost of the next engineer re-diagnosing from scratch is an hour. Document while the incident is fresh.

Ignoring corporate policy. On a homelab there are no policies. On a production system there are, and bypassing them, even for an obvious fix, is a professional risk.

Recap: the arc of the series

Nine articles, one build, two exam versions.

Part	Topic	Domain
1	Workstation spec, WSL2, containers vs. VMs	1101 3.x hardware, 1101 4.x virtualization, 1102 1.x OS
2	NVMe/SSD/HDD tiers, RAID, NAS backup	1101 3.3 storage, 1101 5.3 RAID troubleshooting
3	Networking: IP/DNS, TLS, mTLS, WireGuard	1101 2.x networking
4	GPU driver stack, VRAM budgeting, Device Manager	1101 3.4/3.5 expansion/power, 1101 5.x hardware, 1102 1.x drivers
5	Docker service model, restart loops, log reading	1102 3.x software troubleshooting, 1102 1.x Linux
6	Endpoint hardening, RBAC, firewall, encryption	1102 2.x security
7	FastAPI secure deploy: TLS, JWT, MFA, lockout	1102 2.x security, 1102 4.x operational
8	Backups, WAL archiving, WORM, change management	1102 4.x operational procedures
9	Six-step troubleshooting methodology	1101 5.1 methodology

One honest gap: the 1101 Mobile Devices domain (1.x) is not covered. This is a workstation-and-server build, there are no phones, tablets, or laptops-as-the-subject in the architecture. Cover mobile devices from a dedicated resource; every other domain across both exams is represented here, in context.

What the series tried to demonstrate: A+ material is not separate from real IT practice: it's the vocabulary for describing what practitioners already do. PCIe lanes explain the CPU selection. RAID levels explain the storage layout. The six-step methodology explains why the engineer in Part 4 didn't reinstall the OS when the GPU ran out of memory. The exam tests whether you know the terms; the build shows why they're worth knowing.

The series is complete. Good luck on the exam.