Documentation Index
Fetch the complete documentation index at: https://docs.augustin.ai/llms.txt
Use this file to discover all available pages before exploring further.
Future State Architecture
A comprehensive description of the target architecture for the homelab, AI orchestration, state management, and service model. Produced from a deep planning session on 2026-03-08.Guiding Principles
- Declarative over imperative. The system should be defined in files that describe the desired state, not scripts that perform steps.
- Machines are dumb compute. No machine is special. Any machine should be replaceable by installing the OS and converging it from an external definition.
- The control path cannot depend on what it controls. The systems used to manage infrastructure must not run on that same infrastructure.
- One brain, many hands. A single conversational AI orchestrator dispatches work to local executors on each machine. The orchestrator is the interface; the executors are the tools.
- Markdown is the universal format. All knowledge, memory, decisions, and documentation are stored as plain text files — human-readable, machine-readable, version-controllable, vendor-agnostic.
- Clear separation between services and tools. If something has its own web UI, its own users, or runs 24/7 independently, it’s a containerized service. If it’s something the AI uses, queries, or runs, it lives in the AI workspace.
Infrastructure
Machine Model
Machines (currently two Dell Optiplex 3020s, potentially more in the future) run a minimal OS with Docker and SSH. They receive their configuration from an external source and can be rebuilt from scratch without loss of state that matters. The target OS model leans toward either an immutable Linux distribution (e.g., Talos) where the OS itself is managed declaratively via API with no SSH/shell, or the current Debian setup with external provisioning (e.g., Ansible) that converges machines to a declared state. The immutable OS approach is the more architecturally pure option; Debian + Ansible is the pragmatic fallback. A self-hosted PaaS layer (e.g., Coolify) is also under consideration for the deployment and visibility layer — centralized dashboard showing all machines and services, git-push deployments for rapid prototyping, built-in Traefik management. This could coexist with or replace manual compose file management.Provisioning
Machine provisioning definitions live outside the machines — in a git repository, a cloud service, or a CI system. Adding a new machine to the fleet is: install the OS, run one bootstrap command (or PXE boot), and the machine joins and receives its workload assignments. If a machine dies, the recovery process is: install the OS on new/repaired hardware, re-run provisioning, restore service data from backups. The provisioning layer is never lost because it doesn’t live on the machines.Scaling
The architecture supports N machines without re-architecture. Services are assigned to machines based on hardware capabilities (storage, RAM) and role, but the assignment mechanism is centralized and declarative. Adding a third, fourth, or fifth machine should not require rethinking how services are distributed — only updating the assignment.Networking
Traefik remains the reverse proxy. Cloudflare tunnel remains the ingress path from the internet. These are foundational infrastructure components that live on the control plane (see below), not on workload machines.Control Plane Separation
The Problem
Currently, the entire control path — Synapse (Matrix messaging), OpenClaw (AI gateway), Traefik (routing), Cloudflare tunnel (ingress), and backup orchestration — runs on i3. If i3 goes down, the ability to communicate with and manage the infrastructure is lost. The control plane controls itself, creating a circular dependency.The Solution
The control path lives on a separate VPS (or equivalent external compute) that is not part of the homelab hardware. This VPS runs:- Synapse (Matrix homeserver) — the messaging backbone
- OpenClaw gateway — the AI orchestrator
- Cloudflare tunnel or direct ingress — public routing into the system
- Kiro CLI — for managing the VPS itself
Break-Glass Path
The VPS is the one component that cannot be fully self-managed without circular dependency. OpenClaw manages the VPS for routine operations (updates, config changes, restarts). Direct SSH into the VPS is the escape hatch for when OpenClaw itself is broken. This is an accepted tradeoff — the same pattern cloud providers use for out-of-band management.AI Orchestration
Architecture
- OpenClaw is the brain — the always-on conversational AI agent. It receives messages, reasons about them, maintains memory and context, and dispatches work.
- Kiro CLI is the hands — installed on each machine (including the VPS). It has shell access, Docker access, file access, and the AGENTS.md context for every service. It executes what OpenClaw tells it to.
- Matrix is the nervous system — the messaging protocol that connects the user to OpenClaw and potentially OpenClaw to other services.
Messaging Layer
Matrix is the messaging protocol. It provides:- Self-hosted (on the VPS, not dependent on third-party services)
- Open protocol with multiple client options
- Room-based context separation (equivalent to Discord channels)
- Native OpenClaw integration via the Matrix plugin
- Bridge support for other platforms (Discord already bridged via mautrix-discord, Telegram/WhatsApp possible via mautrix bridges)
Context Separation
Different concerns get different Matrix rooms (or equivalent channels). This prevents context pollution — research doesn’t bleed into infrastructure management, bookmarks don’t pollute daily tasks. Each room can potentially use different models or thinking levels for cost optimization (expensive models for deep reasoning, cheap models for routine checks).State Management
Taxonomy of State
All state in the system falls into one of these categories, each with a defined home:| State Type | What It Is | Where It Lives | Format |
|---|---|---|---|
| Infrastructure definitions | Compose files, Traefik configs, provisioning playbooks, AGENTS.md per service | ~/apps git repo (on machines) + provisioning repo (external) | YAML, Markdown |
| Service data | Application databases, media libraries, email stores | Docker volumes on each machine | Service-specific |
| Secrets | API keys, passwords, tokens, certificates | Encrypted storage (SOPS in git, Vault, or Bitwarden Secrets) — not plaintext .env files | Encrypted |
| Agent identity | Personality, behavior rules, communication style | OpenClaw workspace: SOUL.md, IDENTITY.md, USER.md, AGENTS.md | Markdown |
| Agent memory | Conversation history, learned preferences, distilled knowledge | OpenClaw workspace: memory/YYYY-MM-DD.md (daily logs), MEMORY.md (long-term), main.sqlite (vector search) | Markdown + SQLite |
| Personal knowledge base | Research, articles, video notes, links, ideas, reference material | OpenClaw workspace: knowledge/ directory with subdirectories by type, plus SQLite with vector embeddings for semantic search | Markdown + SQLite |
| Decision log | Architectural decisions with context and rationale | OpenClaw workspace: decisions/YYYY-MM-DD-title.md (ADR format) | Markdown |
| Operational history | What ran, what failed, health over time | OpenClaw workspace: ops/snapshots/ (daily), ops/cron-log.sqlite (automation history) | Markdown + SQLite |
| Task/project tracking | What’s in progress, planned, blocked | OpenClaw workspace: tasks/ (SQLite or markdown) | SQLite or Markdown |
| Automation state | Cron job definitions, heartbeat state, pipeline configs | OpenClaw workspace: cron system + HEARTBEAT.md | JSON + Markdown |
OpenClaw Workspace Structure
The OpenClaw workspace is the single source of truth for everything the AI knows, remembers, and manages (excluding infrastructure definitions and service data, which have their own homes).The Workspace/Container Line
The rule for what lives in the OpenClaw workspace vs. what gets its own container:- Container: has its own web UI, has its own users, needs to run 24/7 independently of the AI, is a third-party application. Examples: Jellyfin, Synapse, Sonarr, Stalwart, Miniflux, Uptime Kuma.
- Workspace: is a tool the AI uses, a database the AI queries, automation the AI runs, or a quick app the AI built. Examples: CRM (SQLite + skill), knowledge base (markdown + embeddings), daily briefing (cron + prompt), bookmark manager (skill + SQLite), health tracker (markdown + analysis).
Services
Current Services (Remain as Containers)
These are stable, long-running services that stay as Docker containers managed via compose files: Workload machines (i3 / Pentium / future machines):- Media: Jellyfin, Audiobookshelf, Calibre Web Automated
- Media management: Sonarr, Radarr, Bazarr, Prowlarr, Chaptarr, Seerr
- Downloads: qBittorrent, Gluetun (VPN), FlareSolverr, qSticky
- Reading: Miniflux, Nextflux, RSSHub, Feed Scraper
- Productivity: Excalidraw (Draw), Chromium
- Tracking: Yamtrack
- Auth: Pocket-ID
- Email: Stalwart
- Monitoring: Uptime Kuma, Homepage
- Synapse + Element (Matrix)
- OpenClaw gateway
- Cloudflare tunnel (or direct ingress)
- Traefik
Future AI-Built Tools (Live in Workspace)
These are automations and personal tools that OpenClaw builds and runs inside its workspace, not as separate containers:- Personal CRM (SQLite + natural language queries)
- Knowledge base with semantic search (markdown + vector embeddings)
- Daily briefing system (cron + data aggregation + messaging delivery)
- Bookmark/link manager (replaces paid services like Raindrop)
- Operational monitoring summaries (daily snapshots of infrastructure health)
- Task tracking (local task database managed conversationally)
- Any rapid prototypes, quick web apps, or experimental tools
Backups
Current System (Remains)
Nightly restic backups — encrypted, deduplicated, incremental — with cross-machine replication and offsite to Backblaze B2. Database dumps before backup runs. 7 daily, 4 weekly, 3 monthly retention.Additions for Future State
- OpenClaw workspace must be included in backups. It contains the AI’s entire brain — memory, knowledge base, decisions, task state, automation history. This is the most important new backup target.
- VPS control plane needs its own backup strategy. Synapse database, OpenClaw workspace, and configuration should back up to B2 (or equivalent offsite storage). The VPS itself is rebuildable from provisioning definitions, but the data on it is not.
- Secrets migration — moving from plaintext
.envfiles to encrypted storage means the backup strategy for secrets changes. Encrypted secrets in git are self-backing-up. Secrets in Vault or Bitwarden need their own backup consideration. - Automation run history (
cron-log.sqlite) and knowledge base (knowledge.sqlite) are SQLite databases that should be included in the standard backup sweep. The existing restic backup of~/appswould cover this if the workspace is mounted from within that tree.
Cost Model
- VPS for control plane: ~$5-10/month for a small instance running Synapse, OpenClaw, Cloudflare tunnel, Traefik
- LLM API tokens for OpenClaw: variable, depends on usage. Cost optimization via model routing — expensive models (Opus) for deep reasoning, cheap models for routine heartbeats, cron jobs, and simple queries
- Kiro CLI: covered by work (Kiro Power plan), no additional cost
- Backblaze B2: existing, minimal cost
- Local model hardware (future): under consideration for reducing API token costs. Would run on homelab hardware or dedicated GPU machine.
Open Questions
These are acknowledged but not yet decided:- Exact provisioning tool: Ansible, Talos, NixOS, or something else for machine convergence
- PaaS layer: whether Coolify (or similar) adds enough value for deployment visibility and rapid prototyping to justify the added platform complexity
- Matrix client: Element, custom client, or alternative — depends on UX needs for AI interaction
- Secrets management: SOPS, Vault, Bitwarden Secrets, or another solution
- Local model hosting: hardware requirements, which models, how it integrates with OpenClaw’s model routing
- OpenClaw↔Kiro interface: the exact mechanism by which OpenClaw dispatches work to Kiro CLI on each machine (SSH + subprocess, Matrix bridge, or another approach). The
kiro-bridgedesign doc inkiro-bridge/AGENTS.mdcaptures the exploration so far. - Knowledge base search tooling: QMD, SQLite with vector extensions, or another semantic search solution
- Rapid prototyping workflow: whether AI-built quick apps are served directly from the workspace, deployed via a PaaS, or handled some other way