Steps to recover the homelab from various failure modes. Assumes you have access to AWS accountDocumentation Index
Fetch the complete documentation index at: https://docs.augustin.ai/llms.txt
Use this file to discover all available pages before exploring further.
389565021081, the GitHub repo amarcin/homelab, and physical access (or SSH via an existing working node) to the remaining hardware.
Recovery Keys You Need
All stored outside the homelab (1Password, hardware token, etc.):- AWS admin credentials for account
389565021081 - Restic repo password (also in
scripts/.restic-passwordon each node — but you need it to bootstrap a node) - GitHub SSH key (for cloning the repo to a replacement node)
.env-backup.tar.gzrestic snapshot recovery — the env files for every service
Scenario 1: Single Service Corrupted
Fastest case. Restore from S3.data/ dir, bring it back up.
Scenario 2: Pentium Total Loss
Pentium hosts: matrix, reader, stalwart, n8n, yamtrack, searxng, uptime-kuma, draw, pocket-id. All DBs are on Pentium.- Install Debian 13 on new hardware (or reinstalled Pentium)
- Create
user(UID 1000), add tosudo, enable passwordless sudo apt install docker.io git rsync restic- Configure
~/.ssh/configaliases fori3(192.168.1.100) andssh-copy-idso SSH works both directions - Clone repo:
git clone git@github.com:amarcin/homelab.git ~/apps bash ~/apps/scripts/systemd/install.sh— enable the 5-min auto-pull- Put
/etc/docker/daemon.jsonin place:Restart docker. - Restore Pentium’s
/home/user/appsfrom S3: - Restore DBs from the dumps inside the restored repo:
matrix/matrix-db-dump.sql→ once thematrix-dbcontainer is up,cat matrix-db-dump.sql | docker exec -i matrix_matrix-db.N psql -U synapse- Same pattern for
reader/miniflux-db-dump.sql
- On i3, join pentium back to the swarm: get join token with
docker swarm join-token manager, run the join command on pentium - Redeploy all Pentium stacks: from i3,
for s in matrix reader stalwart n8n yamtrack searxng uptime-kuma draw pocket-id; do cd ~/apps/$s && set -a; source .env 2>/dev/null; set +a; docker stack deploy -c compose.yaml $s; done
Scenario 3: i3 Total Loss
More painful. i3 hosts media data on mergerfs SanDisks. Those are physical drives — recover them to the new box before anything else.- If SanDisks are intact: plug them into the new i3. Verify labels (
sandisk-2tb,sandisk-1tb-lg,sandisk-1tb-dg) match/etc/fstab. - If SanDisks are lost: your media is gone. Repopulate via arr stack / torrent indexers over time. Jellyfin library metadata will rebuild, but watch progress/history is gone unless backed up separately.
- Steps 1-6 from Scenario 2, substituting
pentium→i3 - Additional for i3:
- Install mergerfs:
apt install mergerfs - Verify
/mnt/main/mediamounts correctly on boot - Set up CDI for Jellyfin (see
docs/archive/media/jellyfin.md)
- Install mergerfs:
- Restore
/home/user/appsfrom S3 (i3 repo) - On a still-running manager (if swarm survived on Pentium):
docker swarm leave --forceold dead i3 node.docker swarm join-token managerto get a new join command. - If swarm is totally dead (both nodes lost):
docker swarm init --advertise-addr 192.168.1.100on new i3, then join Pentium. Stacks auto-recover from declarative compose files, but data restores must be complete first. - Traefik + cloudflared must come up first on i3 so everything else is reachable
- Redeploy all stacks
Scenario 4: Swarm Quorum Lost (both managers down simultaneously but hardware OK)
Likely cause: docker daemon crash on both nodes, or one crashed and the other rebooted. Swarm Raft needs a manager majority to function.- On one node,
docker swarm init --force-new-cluster --advertise-addr <node-ip>— forces a new single-manager cluster - Rejoin the other node with
docker swarm join-token managerfrom the forced manager - All services auto-recover from the stack definitions (they’re stored in Raft, which the forced-new-cluster preserves)
Scenario 5: Restic Repo Corruption
Detected by the weeklyrestic check cron. Symptoms: check fails, snapshots can’t be read.
- Don’t run
restic forget --pruneuntil fixed — it’ll make it worse - S3 versioning has your back: find the last-known-good pack files via
aws s3api list-object-versions --bucket augustin-backups --prefix <node>/data/, restore the prior versions - Worst case: repo is unrecoverable. Start a new repo, accept data loss back to the last local snapshot. Keep old repo as read-only reference until sure new one is healthy.
Scenario 6: Cloudflared Tunnel Dead
All public URLs andssh.augustin.ai stop working. Check in this order:
docker service ps cloudflared_cloudflared— task running?docker service logs cloudflared_cloudflared— registration errors?- Cloudflare dashboard → Zero Trust → Networks → Tunnels — is the tunnel showing “healthy”?
- If tunnel credentials are compromised/rotated: regenerate
TUNNEL_TOKENat Cloudflare, updatecloudflared/.env,docker service update --force cloudflared_cloudflared
Known Gaps
- Config for mergerfs mount order is in
/etc/fstabonly, not in git. If /etc/fstab is lost, reconstruct fromdocs/infrastructure/hardware.md. - Cron jobs are not in git either. Document: i3 has backup-i3.sh, stalwart backup, aiclient2api kiro sync, cwa-ingest-cron, torrent-to-cwa. Pentium has backup-pentium.sh, drive-check.sh (via other agent).
- Docker daemon.json insecure-registries list is machine-local. Must be restored on any node that pulls from the local registry.