Disaster Recovery Playbook

Steps to recover the homelab from various failure modes. Assumes you have access to AWS account 389565021081, the GitHub repo amarcin/homelab, and physical access (or SSH via an existing working node) to the remaining hardware.

Recovery Keys You Need

All stored outside the homelab (1Password, hardware token, etc.):

AWS admin credentials for account 389565021081
Restic repo password (also in scripts/.restic-password on each node — but you need it to bootstrap a node)
GitHub SSH key (for cloning the repo to a replacement node)
.env-backup.tar.gz restic snapshot recovery — the env files for every service

Scenario 1: Single Service Corrupted

Fastest case. Restore from S3.

restic -r s3:s3.us-west-2.amazonaws.com/augustin-backups/<node> snapshots
restic -r s3:s3.us-west-2.amazonaws.com/augustin-backups/<node> restore <id> --target /tmp/restore --include /home/user/apps/<service>

Stop the service, replace its data/ dir, bring it back up.

Scenario 2: Pentium Total Loss

Pentium hosts: matrix, reader, stalwart, n8n, yamtrack, searxng, uptime-kuma, draw, pocket-id. All DBs are on Pentium.

Install Debian 13 on new hardware (or reinstalled Pentium)
Create user (UID 1000), add to sudo, enable passwordless sudo
apt install docker.io git rsync restic
Configure ~/.ssh/config aliases for i3 (192.168.1.100) and ssh-copy-id so SSH works both directions
Clone repo: git clone git@github.com:amarcin/homelab.git ~/apps
bash ~/apps/scripts/systemd/install.sh — enable the 5-min auto-pull
Put /etc/docker/daemon.json in place:
```
{"insecure-registries": ["192.168.1.100:5000"]}
```
Restart docker.

Restore Pentium’s /home/user/apps from S3:

export RESTIC_PASSWORD_FILE=~/apps/scripts/.restic-password
source ~/apps/scripts/.aws-env
restic -r s3:s3.us-west-2.amazonaws.com/augustin-backups/pentium snapshots
restic -r s3:s3.us-west-2.amazonaws.com/augustin-backups/pentium restore latest --target /tmp/restore
rsync -a /tmp/restore/home/user/apps/ ~/apps/

Restore DBs from the dumps inside the restored repo:
- matrix/matrix-db-dump.sql → once the matrix-db container is up, cat matrix-db-dump.sql | docker exec -i matrix_matrix-db.N psql -U synapse
- Same pattern for reader/miniflux-db-dump.sql
On i3, join pentium back to the swarm: get join token with docker swarm join-token manager, run the join command on pentium
Redeploy all Pentium stacks: from i3, for s in matrix reader stalwart n8n yamtrack searxng uptime-kuma draw pocket-id; do cd ~/apps/$s && set -a; source .env 2>/dev/null; set +a; docker stack deploy -c compose.yaml $s; done

Scenario 3: i3 Total Loss

More painful. i3 hosts media data on mergerfs SanDisks. Those are physical drives — recover them to the new box before anything else.

If SanDisks are intact: plug them into the new i3. Verify labels (sandisk-2tb, sandisk-1tb-lg, sandisk-1tb-dg) match /etc/fstab.
If SanDisks are lost: your media is gone. Repopulate via arr stack / torrent indexers over time. Jellyfin library metadata will rebuild, but watch progress/history is gone unless backed up separately.
Steps 1-6 from Scenario 2, substituting pentium → i3
Additional for i3:
- Install mergerfs: apt install mergerfs
- Verify /mnt/main/media mounts correctly on boot
- Set up CDI for Jellyfin (see docs/archive/media/jellyfin.md)
Restore /home/user/apps from S3 (i3 repo)
On a still-running manager (if swarm survived on Pentium): docker swarm leave --force old dead i3 node. docker swarm join-token manager to get a new join command.
If swarm is totally dead (both nodes lost): docker swarm init --advertise-addr 192.168.1.100 on new i3, then join Pentium. Stacks auto-recover from declarative compose files, but data restores must be complete first.
Traefik + cloudflared must come up first on i3 so everything else is reachable
Redeploy all stacks

Scenario 4: Swarm Quorum Lost (both managers down simultaneously but hardware OK)

Likely cause: docker daemon crash on both nodes, or one crashed and the other rebooted. Swarm Raft needs a manager majority to function.

On one node, docker swarm init --force-new-cluster --advertise-addr <node-ip> — forces a new single-manager cluster
Rejoin the other node with docker swarm join-token manager from the forced manager
All services auto-recover from the stack definitions (they’re stored in Raft, which the forced-new-cluster preserves)

Scenario 5: Restic Repo Corruption

Detected by the weekly restic check cron. Symptoms: check fails, snapshots can’t be read.

Don’t run restic forget --prune until fixed — it’ll make it worse
S3 versioning has your back: find the last-known-good pack files via aws s3api list-object-versions --bucket augustin-backups --prefix <node>/data/, restore the prior versions
Worst case: repo is unrecoverable. Start a new repo, accept data loss back to the last local snapshot. Keep old repo as read-only reference until sure new one is healthy.

Scenario 6: Cloudflared Tunnel Dead

All public URLs and ssh.augustin.ai stop working. Check in this order:

docker service ps cloudflared_cloudflared — task running?
docker service logs cloudflared_cloudflared — registration errors?
Cloudflare dashboard → Zero Trust → Networks → Tunnels — is the tunnel showing “healthy”?
If tunnel credentials are compromised/rotated: regenerate TUNNEL_TOKEN at Cloudflare, update cloudflared/.env, docker service update --force cloudflared_cloudflared

If you need out-of-band access with cloudflared down: SSH into the node on its LAN IP (192.168.1.100 or .16) from a machine on the LAN.

Known Gaps

Config for mergerfs mount order is in /etc/fstab only, not in git. If /etc/fstab is lost, reconstruct from docs/infrastructure/hardware.md.
Cron jobs are not in git either. Document: i3 has backup-i3.sh, stalwart backup, aiclient2api kiro sync, cwa-ingest-cron, torrent-to-cwa. Pentium has backup-pentium.sh, drive-check.sh (via other agent).
Docker daemon.json insecure-registries list is machine-local. Must be restored on any node that pulls from the local registry.

Overview

Documentation Index

​Recovery Keys You Need

​Scenario 1: Single Service Corrupted

​Scenario 2: Pentium Total Loss

​Scenario 3: i3 Total Loss

​Scenario 4: Swarm Quorum Lost (both managers down simultaneously but hardware OK)

​Scenario 5: Restic Repo Corruption

​Scenario 6: Cloudflared Tunnel Dead

​Known Gaps