Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.augustin.ai/llms.txt

Use this file to discover all available pages before exploring further.

Steps to recover the homelab from various failure modes. Assumes you have access to AWS account 389565021081, the GitHub repo amarcin/homelab, and physical access (or SSH via an existing working node) to the remaining hardware.

Recovery Keys You Need

All stored outside the homelab (1Password, hardware token, etc.):
  • AWS admin credentials for account 389565021081
  • Restic repo password (also in scripts/.restic-password on each node — but you need it to bootstrap a node)
  • GitHub SSH key (for cloning the repo to a replacement node)
  • .env-backup.tar.gz restic snapshot recovery — the env files for every service

Scenario 1: Single Service Corrupted

Fastest case. Restore from S3.
restic -r s3:s3.us-west-2.amazonaws.com/augustin-backups/<node> snapshots
restic -r s3:s3.us-west-2.amazonaws.com/augustin-backups/<node> restore <id> --target /tmp/restore --include /home/user/apps/<service>
Stop the service, replace its data/ dir, bring it back up.

Scenario 2: Pentium Total Loss

Pentium hosts: matrix, reader, stalwart, n8n, yamtrack, searxng, uptime-kuma, draw, pocket-id. All DBs are on Pentium.
  1. Install Debian 13 on new hardware (or reinstalled Pentium)
  2. Create user (UID 1000), add to sudo, enable passwordless sudo
  3. apt install docker.io git rsync restic
  4. Configure ~/.ssh/config aliases for i3 (192.168.1.100) and ssh-copy-id so SSH works both directions
  5. Clone repo: git clone git@github.com:amarcin/homelab.git ~/apps
  6. bash ~/apps/scripts/systemd/install.sh — enable the 5-min auto-pull
  7. Put /etc/docker/daemon.json in place:
    {"insecure-registries": ["192.168.1.100:5000"]}
    
    Restart docker.
  8. Restore Pentium’s /home/user/apps from S3:
    export RESTIC_PASSWORD_FILE=~/apps/scripts/.restic-password
    source ~/apps/scripts/.aws-env
    restic -r s3:s3.us-west-2.amazonaws.com/augustin-backups/pentium snapshots
    restic -r s3:s3.us-west-2.amazonaws.com/augustin-backups/pentium restore latest --target /tmp/restore
    rsync -a /tmp/restore/home/user/apps/ ~/apps/
    
  9. Restore DBs from the dumps inside the restored repo:
    • matrix/matrix-db-dump.sql → once the matrix-db container is up, cat matrix-db-dump.sql | docker exec -i matrix_matrix-db.N psql -U synapse
    • Same pattern for reader/miniflux-db-dump.sql
  10. On i3, join pentium back to the swarm: get join token with docker swarm join-token manager, run the join command on pentium
  11. Redeploy all Pentium stacks: from i3, for s in matrix reader stalwart n8n yamtrack searxng uptime-kuma draw pocket-id; do cd ~/apps/$s && set -a; source .env 2>/dev/null; set +a; docker stack deploy -c compose.yaml $s; done

Scenario 3: i3 Total Loss

More painful. i3 hosts media data on mergerfs SanDisks. Those are physical drives — recover them to the new box before anything else.
  1. If SanDisks are intact: plug them into the new i3. Verify labels (sandisk-2tb, sandisk-1tb-lg, sandisk-1tb-dg) match /etc/fstab.
  2. If SanDisks are lost: your media is gone. Repopulate via arr stack / torrent indexers over time. Jellyfin library metadata will rebuild, but watch progress/history is gone unless backed up separately.
  3. Steps 1-6 from Scenario 2, substituting pentiumi3
  4. Additional for i3:
    • Install mergerfs: apt install mergerfs
    • Verify /mnt/main/media mounts correctly on boot
    • Set up CDI for Jellyfin (see docs/archive/media/jellyfin.md)
  5. Restore /home/user/apps from S3 (i3 repo)
  6. On a still-running manager (if swarm survived on Pentium): docker swarm leave --force old dead i3 node. docker swarm join-token manager to get a new join command.
  7. If swarm is totally dead (both nodes lost): docker swarm init --advertise-addr 192.168.1.100 on new i3, then join Pentium. Stacks auto-recover from declarative compose files, but data restores must be complete first.
  8. Traefik + cloudflared must come up first on i3 so everything else is reachable
  9. Redeploy all stacks

Scenario 4: Swarm Quorum Lost (both managers down simultaneously but hardware OK)

Likely cause: docker daemon crash on both nodes, or one crashed and the other rebooted. Swarm Raft needs a manager majority to function.
  1. On one node, docker swarm init --force-new-cluster --advertise-addr <node-ip> — forces a new single-manager cluster
  2. Rejoin the other node with docker swarm join-token manager from the forced manager
  3. All services auto-recover from the stack definitions (they’re stored in Raft, which the forced-new-cluster preserves)

Scenario 5: Restic Repo Corruption

Detected by the weekly restic check cron. Symptoms: check fails, snapshots can’t be read.
  1. Don’t run restic forget --prune until fixed — it’ll make it worse
  2. S3 versioning has your back: find the last-known-good pack files via aws s3api list-object-versions --bucket augustin-backups --prefix <node>/data/, restore the prior versions
  3. Worst case: repo is unrecoverable. Start a new repo, accept data loss back to the last local snapshot. Keep old repo as read-only reference until sure new one is healthy.

Scenario 6: Cloudflared Tunnel Dead

All public URLs and ssh.augustin.ai stop working. Check in this order:
  1. docker service ps cloudflared_cloudflared — task running?
  2. docker service logs cloudflared_cloudflared — registration errors?
  3. Cloudflare dashboard → Zero Trust → Networks → Tunnels — is the tunnel showing “healthy”?
  4. If tunnel credentials are compromised/rotated: regenerate TUNNEL_TOKEN at Cloudflare, update cloudflared/.env, docker service update --force cloudflared_cloudflared
If you need out-of-band access with cloudflared down: SSH into the node on its LAN IP (192.168.1.100 or .16) from a machine on the LAN.

Known Gaps

  • Config for mergerfs mount order is in /etc/fstab only, not in git. If /etc/fstab is lost, reconstruct from docs/infrastructure/hardware.md.
  • Cron jobs are not in git either. Document: i3 has backup-i3.sh, stalwart backup, aiclient2api kiro sync, cwa-ingest-cron, torrent-to-cwa. Pentium has backup-pentium.sh, drive-check.sh (via other agent).
  • Docker daemon.json insecure-registries list is machine-local. Must be restored on any node that pulls from the local registry.