Backups, Recovery & Change Management for a Self-Hosted AI Stack

Part 8 of a 9-part series: teaching CompTIA A+ (Core 1 / 220-1101 and Core 2 / 220-1102) through a real build, a private, local AI workstation/server for a small business.

The question you want to answer before the migration runs

The schema migration looks clean in staging. You run it on the production database at 11 p.m. to avoid business hours. Forty seconds later, the application is returning 500 errors on every request: the migration silently dropped a column that three API routes depend on. You roll back the migration code, but the column is gone. The data that was in it is gone.

The only question that matters now is: when was the last good backup, and can you actually restore it?

Not "do we have backups": most shops think they do. The real question is whether those backups were tested, encrypted, stored somewhere the corruption event can't reach, and documented well enough for someone who wasn't in the room when they were set up to execute a restore at midnight under pressure.

This article is about building an answer you can trust before the question gets asked. The concepts come from CompTIA A+ Domain 1102 4.x (operational procedures) the domain that candidates most often skip because it "isn't technical enough." It is. A bad restore at the wrong moment is one of the more expensive lessons in IT.

📘 Objectives covered (220-1102) >This article maps to the following CompTIA A+ exam objectives. If you'restudying for the exam, these are your anchors; if you're here for the build,skim past. >Core 2 (220-1102)- 4.x, Operational Procedures: backup methods (full, incremental, differential); backup rotation schemes (3-2-1 rule); disaster recovery and data restoration concepts; change-management documentation and rollback planning; regulatory and compliance data-handling considerations. >Concepts taught below: full vs. incremental vs. differential backups; snapshotsvs. continuous WAL shipping; the 3-2-1 backup rule; recovery testing; WORM andimmutable retention; the change-management cycle (request → approve →document → rollback plan); backup encryption at rest.

Concepts: backup types, recovery objectives, and change control

Full, incremental, and differential (the A+ trio)

The exam defines three backup types, and the differences matter operationally:

Full backup: a complete copy of all selected data at a point in time. Slowest to produce; fastest to restore from, because everything you need is in one place. Storage cost is proportional to data size every run. For a self-hosted stack, this is the nightly pg_dump of the database: one file, one restore command, nothing to chain together.

Incremental backup: only the data that changed since the last backup of any kind (full or incremental). Fast to produce; storage efficient; but to restore you have to replay the full backup plus every incremental in sequence. Miss one link in the chain and the restore fails. The WAL (write-ahead log) shipping approach in this build is incremental in spirit: each run ships only WAL segments newer than the last marker.

Differential backup: only the data that changed since the last full backup. Each differential grows as more changes accumulate, but restore is simpler: full backup + the latest differential, nothing to chain. Slower to produce than incremental over time; faster to restore. A useful middle ground when your data changes a lot intra-week but you want restore simplicity.

In practice, most production systems layer all three: a weekly or monthly full, a nightly incremental or differential, and continuous transaction-log shipping for tight recovery-point objectives (RPO).

RPO and RTO: what you're actually promising

Two numbers define a backup strategy. Recovery Point Objective (RPO): how much data loss is acceptable, a 1-hour RPO means you can lose at most the last hour of work, which demands hourly backups or continuous shipping. Recovery Time Objective (RTO): how long the service can stay down during recovery, which drives restore-path choices. A full backup restores slowly; WAL shipping

PITR gives the lowest RPO but the most complex procedure. Knowing both before designing the strategy is the difference between "we have backups" and "we have a recovery plan."

The 3-2-1 rule

The canonical backup rule is simple and survives most failure modes:

3 copies of the data
2 different storage media types
1 copy offsite (or on a network-isolated target)

For a single-workstation build this translates to: primary data on NVMe (copy 1); nightly snapshot to a NAS on the same network (copy 2, different media); and a periodic copy to a physically separate location or an encrypted cloud target (copy 3, offsite). The NAS alone satisfies the "different media" requirement but fails the "offsite" test: a fire or power surge that kills the workstation can kill a co-located NAS too.

Snapshots vs. WAL shipping

A snapshot is a point-in-time copy of a storage volume or directory. For files and bulk data it's the right tool: rsync the bulk tier to a dated directory on the NAS, and you have a recoverable state for that moment. The limitation is granularity: you can only recover to a snapshot boundary.

WAL (Write-Ahead Log) shipping is how PostgreSQL handles continuous durability. Every committed transaction is written to sequential WAL segment files before the write is acknowledged. Shipping those files to a safe location as they're produced lets you replay the log forward from any base backup to a specific moment: PITR (point-in-time recovery). The transaction log is the incremental backup; PostgreSQL's own recovery engine does the replay.

WORM and immutable retention

WORM (Write Once, Read Many) prevents overwrites and deletes for a defined retention window. It exists because ransomware and malicious insiders target backup stores before triggering their payload. A target you can write to, you can also delete from.

Immutable retention removes delete permission from the backup target for the retention period. In this build, audit log segments are sealed (cryptographically signed), moved to a WORM-designated storage tier, and then the source is unlinked: the audit record can't be altered after the seal. The same principle applies to backup archives: once written to the immutable tier, they stay there until the retention window expires.

Backup encryption

An encrypted database with an unencrypted backup on the NAS gains nothing from the live-data encryption. The backup key must be stored separately from the backup data, in this build that means a secrets manager queried at runtime, not a key file adjacent to the archive. AES-256-GCM is the practical standard: authenticated encryption where the auth tag means a corrupted or tampered backup fails to decrypt, serving as an integrity check at the same time.

Change management

The migration scenario at the top of this article isn't a backup failure: it's a change-management failure. The backup is the safety net; change management is what should have kept you from needing it.

The A+ change-management cycle is:

Request: someone proposes a change. Written, not verbal.
Review and approve: the change is evaluated by someone other than the requester. For high-risk changes, approval is explicit.
Document, the change, its expected outcome, and critically, the rollback plan are written down before the change runs.
Implement: the change runs in a controlled window.
Verify: the expected outcome is confirmed. Unexpected outcomes → rollback.
Document the outcome, what actually happened, including any deviations from the plan.

The rollback plan is the step most often skipped under time pressure, and the one you most need. "We'll figure it out" is not a rollback plan. For a database migration: "take a full backup immediately before the change; if verification fails within N minutes, restore from it": written down and pre-tested.

Hands-on walkthrough: the backup cadence in practice

The timer cadence

Three systemd timers drive the automated backup pipeline. Here they are generalized, the real units follow the same structure:

app-backup-wal-ship.timer, runs hourly at :15, picks up any new PostgreSQL WAL segments since the last run, and ships them to the NAS archive tier:

[Timer]
# Hourly at :15 — offset from other IO-heavy timers.
OnCalendar=*-*-* *:15:00
OnBootSec=5min
RandomizedDelaySec=120
Persistent=true

Persistent=true is load-bearing: if the host was down during a scheduled run, the timer fires as soon as the host comes back up, rather than waiting until the next scheduled slot. For a backup timer, skipping a run silently is the failure mode you're defending against.

app-backup-snapshot.timer, runs daily at 03:00 UTC, rsyncs the bulk data tier to a date-stamped NAS directory:

[Timer]
# Daily at 03:00 UTC — offset from the cert-renewal timer (04:00)
# to avoid two IO-heavy jobs colliding.
OnCalendar=*-*-* 03:00:00
RandomizedDelaySec=600
Persistent=true

RandomizedDelaySec jitters the fire time up to the stated window, on a multi-node cluster, this staggers the same timer across nodes so they don't all hit the NAS simultaneously.

app-audit-seal.timer, runs every 15 minutes, seals any rolled audit log segments with a cryptographic signature, moves them to the WORM tier, and verifies the copy before unlinking the source:

[Timer]
OnBootSec=2min
OnUnitActiveSec=15min
RandomizedDelaySec=120
Persistent=true

The corresponding service unit demonstrates the defense-in-depth approach to service hardening:

[Service]
Type=oneshot
User=root
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=read-only
ReadWritePaths=/var/lib/app /var/log/app
PrivateTmp=true

Even a root-owned service can be locked down: ProtectSystem=strict makes most of the filesystem read-only; ReadWritePaths re-grants write access only to the directories the service needs. If compromised, its blast radius is bounded to those two paths.

The WAL shipping worker

The WAL shipping worker is the clearest example of incremental backup logic:

# Postgres WAL segment names are 24 hex chars — lexical sort is
# chronological sort. Ship only files newer than the last marker.
candidates = sorted(p for p in wal_dir.iterdir() if p.is_file())
new_files = [
    p for p in candidates
    if last_shipped is None or p.name > last_shipped
]

The worker persists a state file (wal-ship.state) with the filename of the last shipped segment. State is written atomically, write to .tmp, rename into place, so a crash mid-write can't corrupt the marker.

The backup manager pipeline

Every backup created by the application-level manager passes through the same pipeline: compress then encrypt, in that order. The ordering matters: you can't meaningfully compress already-encrypted data (encryption destroys the patterns compression exploits). After upload, existence is verified before the operation is marked complete:

# Compress first (gzip, level 6 — balanced speed vs ratio)
compressed = gzip.compress(original_data, compresslevel=6)

# Then encrypt: AES-256-GCM with a per-backup nonce
# Output format: nonce (12 bytes) || ciphertext || auth tag (16 bytes)
nonce = secrets.token_bytes(12)
ciphertext = AESGCM(key).encrypt(nonce, compressed, None)
payload = nonce + ciphertext

# Verify existence after upload — don't trust the upload call's return value
if not await backend.exists(backup_id):
    raise RuntimeError("Backup verification failed: file not found")

The checksum of the original data is stored in backup metadata. On restore, after decrypt and decompress, it's recomputed and compared, a mismatch means corruption or a wrong key, and the restore fails loudly rather than handing back bad data silently.

The encryption key is fetched from a secrets manager at runtime, never stored adjacent to the backup data. The fallback, an ephemeral key when the secrets manager is unavailable, is a deliberate dev-environment trade-off, with an explicit warning in the logs:

Backup encryption key not loaded from Vault; generating ephemeral key —
backups produced this run won't decrypt after a restart.

That warning makes "ephemeral key in prod" loud rather than silent.

WAL archive recovery procedure (PITR steps)

When you need to recover to a specific point in time from a WAL archive, the PostgreSQL PITR procedure is:

1. Restore the base backup. PITR requires a physical base backup, a binary copy of the data directory: not a logical dump (pg_dump produces SQL and cannot serve as the WAL-replay starting point). The nightly base backup is taken with pg_basebackup:

pg_basebackup -h localhost -U replication_user -D /nas/backups/db/base-$(date +%Y%m%d) \
  --wal-method=stream --checkpoint=fast

To restore: stop PostgreSQL, replace the data directory with the base backup, then continue to step 2.

systemctl stop postgresql
rm -rf /var/lib/postgresql/data
cp -a /nas/backups/db/base-20260630/. /var/lib/postgresql/data/

2. Configure recovery parameters. Create a recovery.signal file in the data directory (PostgreSQL 12+, its presence activates recovery mode) and set the target in postgresql.auto.conf:

restore_command = 'cp /nas/wal-archive/pg-wal/%f %p'
recovery_target_time = '2026-06-30 22:45:00+00'
recovery_target_action = 'promote'

Without recovery_target_action = 'promote', the server pauses in read-only mode at the recovery target: an easy-to-miss gotcha.

3. Start PostgreSQL. The postmaster replays archived WAL until it hits recovery_target_time, then promotes to a writable primary. Confirm the "recovery stopping" log line before pointing traffic at the instance.

4. Verify application health. Run the health check and spot-check query results against known-good data.

This procedure is why the runbook exists and why it must be tested in advance: not read for the first time during an incident.

Verification: proving the backup is real

A backup you've never restored from is a hypothesis, not a fact. Three checks:

1. Snapshot exists and is encrypted

ls -lh /nas/backups/bulk-snapshots/$(date +%Y-%m-%d)/
python -m backup.manager list --source-type database --source-name appdb | grep encrypted
# expect: "encrypted": true

2. Test restore into a scratch target (different port, not production)

pg_restore -h localhost -p 5433 -U dbuser -d appdb_restore \
  /nas/backups/db/<latest-backup-id>
psql -h localhost -p 5433 -U dbuser -d appdb_restore \
  -c "SELECT COUNT(*) FROM users;"

The point is not to restore everything: it's to confirm the decrypt-decompress- restore pipeline works end to end before you need it at midnight.

3. Confirm the audit-seal timer is active

systemctl list-timers app-audit-seal.timer
journalctl -u app-audit-seal.service --since "24 hours ago" | grep -E "sealed|FAILED"

A FAILED seal exits non-zero; systemd records it in the journal. Know about it before auditors ask.

🎯 What the exam asks >CompTIA frames this material in the operational-procedures domain. Know these: >- Backup type identification is the most-tested concept. Know the definitions cold: full = complete copy every run; incremental = changes since the last backup of any kind; differential = changes since the last full backup. The exam distinguishes them by restore complexity: differential restore = full + latest differential (2 pieces); incremental restore = full + every incremental in sequence (N pieces). Wrong answer on this means losing easy points.- 3-2-1 rule: 3 copies, 2 media types, 1 offsite. The exam tests this by presenting a backup scenario and asking what's missing. The most common missing piece is the offsite copy: "we back up to a second disk in the same machine" fails 3-2-1.- Change management steps: the exam wants the cycle in order, identify the change, plan it, get approval, document the rollback plan, implement, verify, document the outcome. Expect a scenario where someone skips the rollback plan or doesn't document the result; the question asks what they did wrong.- Documentation and rollback plans: these appear as their own exam items, not just steps inside change management. Know that a rollback plan must exist before a change is implemented, not devised after it goes wrong.- Grandfather-Father-Son (GFS) rotation is a traditional backup rotation scheme the exam still tests: daily (Son), weekly (Father), monthly (Grandfather) tapes/sets, rotated on schedule. It's the formal name for the "keep some old and some new" intuition behind retention policies.- WORM / immutable storage may appear in questions about regulatory compliance or audit log retention. Know that the point is preventing deletion during the retention window, not just backup, but tamper-evident preservation.- Archive bit: full and incremental backups clear it (marking files as "backed up"); differential backups do NOT clear it: that's how differentials keep accumulating every change since the last full.- A backup you can't restore from isn't a backup. The exam phrases this as "which practice verifies backup integrity", the answer is scheduled test restores, not just confirming the backup file exists.

Common pitfalls (from the real build)

Never testing restores. The most common failure. Backups run for months; the first restore attempt happens during an incident. The key has rotated, the target is the wrong database version, or pg_restore flags a format mismatch. Test on a schedule, quarterly at minimum, monthly is better.

Backups on the same host or disk as the primary data. A disk failure or ransomware payload can destroy both simultaneously. The NAS is a different physical device; the offsite copy is on a different network. Both matter.

No immutability on the backup target. A target you can write to, you can delete from. Ransomware specifically targets backup directories before triggering. "The backup exists" and "the backup is untouchable" are different properties.

Unencrypted backups undermining encrypted primaries. A database encrypted at rest with its backup sitting in plaintext on the NAS gives an attacker with NAS access everything. Encryption follows the data, not just the primary store.

Ephemeral encryption key in production. The fallback path is for development. If the secrets manager is unreachable and the process silently continues with an ephemeral key, that backup is unrestorable after a restart. Alert on that log warning: don't just let it scroll by.

Changes without rollback plans. A migration, a config change, a dependency upgrade: any can go wrong. The rollback plan must be concrete: "if X happens within N minutes, execute these steps." Vague plans fail under pressure.

Skipping outcome verification. A change that ran without throwing an error is not a change that succeeded. Verify expected behavior before declaring done: that step would have caught the opening migration failure before the error log did.

Retention by age only. A quiet period with few writes can expire all backups simultaneously. Keep a minimum count (retention_count) as a floor regardless of age: "delete only if older than N days AND more than M copies exist."

Recap + what's next

You built a recovery story you can defend: a three-tier timer cadence (WAL shipping at :15 past every hour, daily bulk snapshot at 03:00, audit seal every 15 minutes); an application-level backup manager that compresses then encrypts every backup with AES-256-GCM, checksums both ends, and verifies existence after upload; a PITR-capable PostgreSQL archive for sub-hour RPO; and a change-management cycle that puts the rollback plan on paper before any change runs.

The backup and the restore are different things: test both before you need them. Change management enforces the same discipline: write down the plan, the success criteria, and the rollback, then execute, then confirm.

But notice what's been absent from this series: a name for what we've been doing when something breaks. In articles #4 and #5 we watched GPU driver failures and container restart loops get diagnosed and resolved. We've been doing structured troubleshooting all along: we just haven't named the framework.

Next up: Part 9: "Symptom → Root Cause: A Real Troubleshooting Workflow." The capstone. We'll name the CompTIA six-step methodology, map it onto real cases from earlier in the series, and show how encoding it into automated diagnostics is just the methodology written in Python. You've already seen all six steps: you just haven't had the formal map yet.