Future State Architecture

A comprehensive description of the target architecture for the homelab, AI orchestration, state management, and service model. Produced from a deep planning session on 2026-03-08.

Guiding Principles

Declarative over imperative. The system should be defined in files that describe the desired state, not scripts that perform steps.
Machines are dumb compute. No machine is special. Any machine should be replaceable by installing the OS and converging it from an external definition.
The control path cannot depend on what it controls. The systems used to manage infrastructure must not run on that same infrastructure.
One brain, many hands. A single conversational AI orchestrator dispatches work to local executors on each machine. The orchestrator is the interface; the executors are the tools.
Markdown is the universal format. All knowledge, memory, decisions, and documentation are stored as plain text files — human-readable, machine-readable, version-controllable, vendor-agnostic.
Clear separation between services and tools. If something has its own web UI, its own users, or runs 24/7 independently, it’s a containerized service. If it’s something the AI uses, queries, or runs, it lives in the AI workspace.

Infrastructure

Machine Model

Machines (currently two Dell Optiplex 3020s, potentially more in the future) run a minimal OS with Docker and SSH. They receive their configuration from an external source and can be rebuilt from scratch without loss of state that matters. The target OS model leans toward either an immutable Linux distribution (e.g., Talos) where the OS itself is managed declaratively via API with no SSH/shell, or the current Debian setup with external provisioning (e.g., Ansible) that converges machines to a declared state. The immutable OS approach is the more architecturally pure option; Debian + Ansible is the pragmatic fallback. A self-hosted PaaS layer (e.g., Coolify) is also under consideration for the deployment and visibility layer — centralized dashboard showing all machines and services, git-push deployments for rapid prototyping, built-in Traefik management. This could coexist with or replace manual compose file management.

Provisioning

Machine provisioning definitions live outside the machines — in a git repository, a cloud service, or a CI system. Adding a new machine to the fleet is: install the OS, run one bootstrap command (or PXE boot), and the machine joins and receives its workload assignments. If a machine dies, the recovery process is: install the OS on new/repaired hardware, re-run provisioning, restore service data from backups. The provisioning layer is never lost because it doesn’t live on the machines.

Scaling

The architecture supports N machines without re-architecture. Services are assigned to machines based on hardware capabilities (storage, RAM) and role, but the assignment mechanism is centralized and declarative. Adding a third, fourth, or fifth machine should not require rethinking how services are distributed — only updating the assignment.

Networking

Traefik remains the reverse proxy. Cloudflare tunnel remains the ingress path from the internet. These are foundational infrastructure components that live on the control plane (see below), not on workload machines.

Control Plane Separation

The Problem

Currently, the entire control path — Synapse (Matrix messaging), OpenClaw (AI gateway), Traefik (routing), Cloudflare tunnel (ingress), and backup orchestration — runs on i3. If i3 goes down, the ability to communicate with and manage the infrastructure is lost. The control plane controls itself, creating a circular dependency.

The Solution

The control path lives on a separate VPS (or equivalent external compute) that is not part of the homelab hardware. This VPS runs:

Synapse (Matrix homeserver) — the messaging backbone
OpenClaw gateway — the AI orchestrator
Cloudflare tunnel or direct ingress — public routing into the system
Kiro CLI — for managing the VPS itself

Homelab machines are pure workload. They run application services (Jellyfin, *arr stack, Stalwart, etc.) and have Kiro CLI installed as the local executor. The VPS-hosted OpenClaw dispatches commands to each machine’s Kiro CLI via SSH or the Matrix bridge.

Break-Glass Path

The VPS is the one component that cannot be fully self-managed without circular dependency. OpenClaw manages the VPS for routine operations (updates, config changes, restarts). Direct SSH into the VPS is the escape hatch for when OpenClaw itself is broken. This is an accepted tradeoff — the same pattern cloud providers use for out-of-band management.

AI Orchestration

Architecture

User → Matrix client → Synapse (VPS) → OpenClaw (VPS) → Kiro CLI (on target machine) → Docker/SSH/APIs

OpenClaw is the brain — the always-on conversational AI agent. It receives messages, reasons about them, maintains memory and context, and dispatches work.
Kiro CLI is the hands — installed on each machine (including the VPS). It has shell access, Docker access, file access, and the AGENTS.md context for every service. It executes what OpenClaw tells it to.
Matrix is the nervous system — the messaging protocol that connects the user to OpenClaw and potentially OpenClaw to other services.

OpenClaw and Kiro are not competing. They operate at different layers. OpenClaw is the conversational interface, orchestrator, and memory system. Kiro is the local executor with deep tool access. OpenClaw may invoke Kiro to perform infrastructure operations, and Kiro’s results flow back through OpenClaw to the user.

Messaging Layer

Matrix is the messaging protocol. It provides:

Self-hosted (on the VPS, not dependent on third-party services)
Open protocol with multiple client options
Room-based context separation (equivalent to Discord channels)
Native OpenClaw integration via the Matrix plugin
Bridge support for other platforms (Discord already bridged via mautrix-discord, Telegram/WhatsApp possible via mautrix bridges)

The Matrix client is an open question. Element works but has UX friction (slash command prefixing). A custom lightweight Matrix client purpose-built for AI interaction is a viable project — it would treat messages to the bot natively, support context-separated rooms, and avoid the overhead of a general-purpose Matrix client.

Context Separation

Different concerns get different Matrix rooms (or equivalent channels). This prevents context pollution — research doesn’t bleed into infrastructure management, bookmarks don’t pollute daily tasks. Each room can potentially use different models or thinking levels for cost optimization (expensive models for deep reasoning, cheap models for routine checks).

State Management

Taxonomy of State

All state in the system falls into one of these categories, each with a defined home:

State Type	What It Is	Where It Lives	Format
Infrastructure definitions	Compose files, Traefik configs, provisioning playbooks, AGENTS.md per service	`~/apps` git repo (on machines) + provisioning repo (external)	YAML, Markdown
Service data	Application databases, media libraries, email stores	Docker volumes on each machine	Service-specific
Secrets	API keys, passwords, tokens, certificates	Encrypted storage (SOPS in git, Vault, or Bitwarden Secrets) — not plaintext `.env` files	Encrypted
Agent identity	Personality, behavior rules, communication style	OpenClaw workspace: SOUL.md, IDENTITY.md, USER.md, AGENTS.md	Markdown
Agent memory	Conversation history, learned preferences, distilled knowledge	OpenClaw workspace: `memory/YYYY-MM-DD.md` (daily logs), `MEMORY.md` (long-term), `main.sqlite` (vector search)	Markdown + SQLite
Personal knowledge base	Research, articles, video notes, links, ideas, reference material	OpenClaw workspace: `knowledge/` directory with subdirectories by type, plus SQLite with vector embeddings for semantic search	Markdown + SQLite
Decision log	Architectural decisions with context and rationale	OpenClaw workspace: `decisions/YYYY-MM-DD-title.md` (ADR format)	Markdown
Operational history	What ran, what failed, health over time	OpenClaw workspace: `ops/snapshots/` (daily), `ops/cron-log.sqlite` (automation history)	Markdown + SQLite
Task/project tracking	What’s in progress, planned, blocked	OpenClaw workspace: `tasks/` (SQLite or markdown)	SQLite or Markdown
Automation state	Cron job definitions, heartbeat state, pipeline configs	OpenClaw workspace: cron system + `HEARTBEAT.md`	JSON + Markdown

OpenClaw Workspace Structure

The OpenClaw workspace is the single source of truth for everything the AI knows, remembers, and manages (excluding infrastructure definitions and service data, which have their own homes).

workspace/
├── SOUL.md                        — personality and behavior
├── IDENTITY.md                    — self-conception
├── USER.md                        — user profile and preferences
├── AGENTS.md                      — operating instructions
├── TOOLS.md                       — environment-specific notes
├── MEMORY.md                      — distilled long-term memory
├── HEARTBEAT.md                   — periodic check instructions
├── memory/
│   └── YYYY-MM-DD.md              — daily conversation/activity logs
├── knowledge/
│   ├── articles/                  — ingested web content
│   ├── videos/                    — video notes and summaries
│   ├── ideas/                     — captured ideas
│   └── knowledge.sqlite           — vector embeddings for semantic search
├── decisions/
│   └── YYYY-MM-DD-title.md        — architecture decision records
├── ops/
│   ├── snapshots/                 — daily operational state snapshots
│   └── cron-log.sqlite            — automation run history
├── tasks/
│   └── tasks.sqlite               — project and task tracking
└── projects/
    ├── crm/                       — CRM database + skills (example)
    ├── briefing/                  — daily briefing automation (example)
    └── .../                       — other AI-built tools and automations

The Workspace/Container Line

The rule for what lives in the OpenClaw workspace vs. what gets its own container:

Container: has its own web UI, has its own users, needs to run 24/7 independently of the AI, is a third-party application. Examples: Jellyfin, Synapse, Sonarr, Stalwart, Miniflux, Uptime Kuma.
Workspace: is a tool the AI uses, a database the AI queries, automation the AI runs, or a quick app the AI built. Examples: CRM (SQLite + skill), knowledge base (markdown + embeddings), daily briefing (cron + prompt), bookmark manager (skill + SQLite), health tracker (markdown + analysis).

AI-built tools in the workspace are typically SQLite databases, Node.js scripts, bash scripts, markdown files, and single-file HTML apps — not Docker containers. OpenClaw executes them directly via its shell access and cron system.

Services

Current Services (Remain as Containers)

These are stable, long-running services that stay as Docker containers managed via compose files: Workload machines (i3 / Pentium / future machines):

Media: Jellyfin, Audiobookshelf, Calibre Web Automated
Media management: Sonarr, Radarr, Bazarr, Prowlarr, Chaptarr, Seerr
Downloads: qBittorrent, Gluetun (VPN), FlareSolverr, qSticky
Reading: Miniflux, Nextflux, RSSHub, Feed Scraper
Productivity: Excalidraw (Draw), Chromium
Tracking: Yamtrack
Auth: Pocket-ID
Email: Stalwart
Monitoring: Uptime Kuma, Homepage

Control plane VPS:

Synapse + Element (Matrix)
OpenClaw gateway
Cloudflare tunnel (or direct ingress)
Traefik

Future AI-Built Tools (Live in Workspace)

These are automations and personal tools that OpenClaw builds and runs inside its workspace, not as separate containers:

Personal CRM (SQLite + natural language queries)
Knowledge base with semantic search (markdown + vector embeddings)
Daily briefing system (cron + data aggregation + messaging delivery)
Bookmark/link manager (replaces paid services like Raindrop)
Operational monitoring summaries (daily snapshots of infrastructure health)
Task tracking (local task database managed conversationally)
Any rapid prototypes, quick web apps, or experimental tools

Backups

Current System (Remains)

Nightly restic backups — encrypted, deduplicated, incremental — with cross-machine replication and offsite to Backblaze B2. Database dumps before backup runs. 7 daily, 4 weekly, 3 monthly retention.

Additions for Future State

OpenClaw workspace must be included in backups. It contains the AI’s entire brain — memory, knowledge base, decisions, task state, automation history. This is the most important new backup target.
VPS control plane needs its own backup strategy. Synapse database, OpenClaw workspace, and configuration should back up to B2 (or equivalent offsite storage). The VPS itself is rebuildable from provisioning definitions, but the data on it is not.
Secrets migration — moving from plaintext .env files to encrypted storage means the backup strategy for secrets changes. Encrypted secrets in git are self-backing-up. Secrets in Vault or Bitwarden need their own backup consideration.
Automation run history (cron-log.sqlite) and knowledge base (knowledge.sqlite) are SQLite databases that should be included in the standard backup sweep. The existing restic backup of ~/apps would cover this if the workspace is mounted from within that tree.

Cost Model

VPS for control plane: ~$5-10/month for a small instance running Synapse, OpenClaw, Cloudflare tunnel, Traefik
LLM API tokens for OpenClaw: variable, depends on usage. Cost optimization via model routing — expensive models (Opus) for deep reasoning, cheap models for routine heartbeats, cron jobs, and simple queries
Kiro CLI: covered by work (Kiro Power plan), no additional cost
Backblaze B2: existing, minimal cost
Local model hardware (future): under consideration for reducing API token costs. Would run on homelab hardware or dedicated GPU machine.

Open Questions

These are acknowledged but not yet decided:

Exact provisioning tool: Ansible, Talos, NixOS, or something else for machine convergence
PaaS layer: whether Coolify (or similar) adds enough value for deployment visibility and rapid prototyping to justify the added platform complexity
Matrix client: Element, custom client, or alternative — depends on UX needs for AI interaction
Secrets management: SOPS, Vault, Bitwarden Secrets, or another solution
Local model hosting: hardware requirements, which models, how it integrates with OpenClaw’s model routing
OpenClaw↔Kiro interface: the exact mechanism by which OpenClaw dispatches work to Kiro CLI on each machine (SSH + subprocess, Matrix bridge, or another approach). The kiro-bridge design doc in kiro-bridge/AGENTS.md captures the exploration so far.
Knowledge base search tooling: QMD, SQLite with vector extensions, or another semantic search solution
Rapid prototyping workflow: whether AI-built quick apps are served directly from the workspace, deployed via a PaaS, or handled some other way

Overview

Future state

Future State Architecture

Guiding Principles

Infrastructure

Machine Model

Provisioning

Scaling

Networking

Control Plane Separation

The Problem

The Solution

Break-Glass Path

AI Orchestration

Architecture

Messaging Layer

Context Separation

State Management

Taxonomy of State

OpenClaw Workspace Structure

The Workspace/Container Line

Services

Current Services (Remain as Containers)

Future AI-Built Tools (Live in Workspace)

Backups

Current System (Remains)

Additions for Future State

Cost Model

Open Questions

Overview

Documentation Index

​Future State Architecture

​Guiding Principles

​Infrastructure

​Machine Model

​Provisioning

​Scaling

​Networking

​Control Plane Separation

​The Problem

​The Solution

​Break-Glass Path

​AI Orchestration

​Architecture

​Messaging Layer

​Context Separation

​State Management

​Taxonomy of State

​OpenClaw Workspace Structure

​The Workspace/Container Line

​Services

​Current Services (Remain as Containers)

​Future AI-Built Tools (Live in Workspace)

​Backups

​Current System (Remains)

​Additions for Future State

​Cost Model

​Open Questions

Future State Architecture

Guiding Principles

Infrastructure

Machine Model

Provisioning

Scaling

Networking

Control Plane Separation

The Problem

The Solution

Break-Glass Path

AI Orchestration

Architecture

Messaging Layer

Context Separation

State Management

Taxonomy of State

OpenClaw Workspace Structure

The Workspace/Container Line

Services

Current Services (Remain as Containers)

Future AI-Built Tools (Live in Workspace)

Backups

Current System (Remains)

Additions for Future State

Cost Model

Open Questions