Runtime topology
How a workflow run actually executes end-to-end. Sagewai is a control plane plus a worker fleet: the control plane plans, persists, queries, and dispatches; the worker fleet executes. Every page in this section reduces to claims about this topology — the security-tier model rests on where credentials live in this picture, the execution-mode model rests on which boxes are involved per step, and the sandbox-backend choice swaps out one specific component without changing the rest. This page fixes the components, where they live, what they own, and what they refuse to do.
Three invariants
Three statements that the entire platform is structured to keep true. Each is non-negotiable; everything else is consequence.
1. The worker is the only executor
The control plane (admin server, CLI, autopilot, registry) plans, persists, queries, and dispatches. It never runs a workflow step. Conversely, no scheduling or persistence happens inside a worker — workers consume work, they do not allocate it. This split is what makes Sagewai's operational characteristics tractable: planning is cheap, so the control plane can plan a million runs without saturating execution capacity, and execution is bounded, so the worker count caps concurrency independently from planning throughput.
In practice this means the admin server writes workflow_runs rows
and exposes audit/query APIs, while a separate fleet of worker
processes pulls those rows, claims them atomically, and runs the
agent loop. A worker pool can be scaled, drained, or replaced via
Kubernetes without ever touching the admin server. A saturated fleet
does not stall planning, and a stalled control plane does not corrupt
in-flight execution. If you ever find yourself writing a step
executor inside sagewai/admin/, that is the signal you have crossed
the line — redirect to the worker.
2. The execution mode is per-step, not per-deployment
A single workflow run can execute step 1 inline on the worker (Mode 0), step 2 in an isolated sandbox with no credentials (Mode 1), step 3 in a sandbox with a customer's identity (Mode 2), and step 4 with a CLI agent like Claude Code in Mode 3. Each step's mode is selected by the workflow author, the autopilot, or runtime escalation logic — independently.
The practical consequence is that you do not pay sandbox cost for a cheap planning step and you do not lose isolation for a sensitive one. Treat mode as a property of the step, not the deployment. A "build a portfolio site, then summarise what you built" workflow is wasteful if the summarise step also runs in Mode 3 with a full CLI agent and customer credentials in scope. The same workflow is unsafe if the build step runs in Mode 0 with no isolation. The correct shape is mixed, and the topology is built to make mixed the cheap default.
3. The sandbox is the trust boundary (when present)
When a step runs in Mode 1+, the sandbox container is the security
boundary. Secret values exist inside it, env-injected from the Sealed
Identity backend at sandbox-start time. The worker host knows secret
KEY NAMES (persisted as effective_secret_keys on the run row) but
never holds plaintext values. The control plane has access to neither
plaintext secrets nor the running sandbox.
This is the contract that lets one worker fleet serve many tenants safely. A bug or compromise inside a sandbox is bounded by the sandbox; it cannot reach the worker process, the control plane, or another tenant's sandbox (modulo backend-escape vulnerabilities, which are the backend vendor's problem, not Sagewai's). The per-run secret cleanup hook enforces the back end of this contract: when a sandbox is released to the warm-pool, the cleanup hook scrubs Tier-2 env, and the pool discards the sandbox if cleanup fails. The full credential model is in Security tiers.
Topology
┌────────────────────────────────────────────────────────────────────┐
│ CONTROL PLANE │
│ (admin server) │
│ │
│ postgres │
│ ├── workflow_runs queue │
│ ├── sealed_revocations │
│ ├── sealed_audit_events │
│ └── projects, agents, tokens, … │
│ │
│ REST API: /api/v1/admin/* (Next.js admin UI consumes this) │
│ CLI: sagewai admin status / sealed / profiles / … │
│ Autopilot: plans + dispatches; never executes itself │
│ │
│ RULE: this plane never runs a workflow step. │
└────────────────────────────────┬───────────────────────────────────┘
│ enqueue(run, mode={0|1|2|3|3b}, …)
│ claim(run) ← fleet pulls
▼
┌────────────────────────────────────────────────────────────────────┐
│ WORKER FLEET │
│ │
│ • The data plane. The only place execution happens. │
│ • At least 1 active worker is required for Sagewai to do work. │
│ • Workers register with the fleet registry, get approved, │
│ advertise capability labels (sandbox.backend, models_supported, │
│ project_id, …), and pull runs whose requirements they match. │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ WORKER PROCESS (host / k8s pod / VM) │ │
│ │ │ │
│ │ Tier-1 LLM keys live HERE in process env. │ │
│ │ ORCHESTRATION_OPENAI_KEY=… (or local Ollama URL etc.) │ │
│ │ These are the operator's keys for the orchestration brain. │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────┐ │ │
│ │ │ SAGEWAI AGENT (in worker process) │ │ │
│ │ │ │ │ │
│ │ │ 1. Claims run; reads execution mode from row │ │ │
│ │ │ 2. Resolves Security Identity (Sealed cascade) IF │ │ │
│ │ │ mode ≥ 2. Persists effective_*_keys (NAMES) │ │ │
│ │ │ on the run row. Never sees plaintext values. │ │ │
│ │ │ 3. Acquires sandbox from pool IF mode ≥ 1 │ │ │
│ │ │ 4. Runs the agent loop appropriate to the mode: │ │ │
│ │ │ • Mode 0: inline on worker │ │ │
│ │ │ • Mode 1: via tool runner in sandbox │ │ │
│ │ │ • Mode 2: + identity in sandbox env │ │ │
│ │ │ • Mode 3: + CLI agent (Claude Code, Codex, …) │ │ │
│ │ │ • Mode 3b: + bidirectional callback for JIT │ │ │
│ │ │ credentials (Sealed-iv) │ │ │
│ │ │ 5. Persists audit + step state │ │ │
│ │ │ 6. Releases sandbox (cleanup_run scrubs env) │ │ │
│ │ │ │ │ │
│ │ │ LLM calls Sagewai Agent makes for ITSELF (e.g. │ │ │
│ │ │ step planning) use Tier-1 keys from worker env. │ │ │
│ │ │ Sagewai Agent does NOT call user-task LLMs — that │ │ │
│ │ │ happens inside the sandbox using Tier-2 keys. │ │ │
│ │ └─────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ◀──────────── Mode 0 stops here ────────────────────────▶ │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────┐ │ │
│ │ │ SANDBOX (Mode 1+) │ │ │
│ │ │ Backend: Docker | Kubernetes | Lambda | Null │ │ │
│ │ │ See execution-backends.md. │ │ │
│ │ │ │ │ │
│ │ │ Mode 1: empty env, isolation only │ │ │
│ │ │ Mode 2: + identity (Tier-2 keys, behavior knobs) │ │ │
│ │ │ Mode 3: + CLI agent runtime + artifact creds │ │ │
│ │ │ │ │ │
│ │ │ ┌───────────────────────────────────────────────┐ │ │ │
│ │ │ │ TOOL RUNNER (daemon) │ │ │ │
│ │ │ │ • RPC server; accepts dispatches from │ │ │ │
│ │ │ │ Sagewai Agent on host │ │ │ │
│ │ │ │ • Spawns tools / CLI agent subprocesses │ │ │ │
│ │ │ │ • Streams stdout/stderr back │ │ │ │
│ │ │ │ • In Mode 3b only: also serves callback │ │ │ │
│ │ │ │ requests FROM CLI agent TO host │ │ │ │
│ │ │ └───────────────────────────────────────────────┘ │ │ │
│ │ │ │ │ │
│ │ │ ┌───────────────────────────────────────────────┐ │ │ │
│ │ │ │ CLI AGENT(S) — Mode 3+ only │ │ │ │
│ │ │ │ Claude Code, Codex, Gemini, custom │ │ │ │
│ │ │ │ • Read TIER-2 keys from os.environ │ │ │ │
│ │ │ │ • Call LLM inference points directly │ │ │ │
│ │ │ │ • Produce artifacts in /workspace │ │ │ │
│ │ │ └───────────────────────────────────────────────┘ │ │ │
│ │ │ │ │ │
│ │ │ ┌───────────────────────────────────────────────┐ │ │ │
│ │ │ │ /workspace volume │ │ │ │
│ │ │ │ CLI artifacts staged here, then pushed to │ │ │ │
│ │ │ │ artifact destination (Mode 3+: GitHub repo, │ │ │ │
│ │ │ │ S3 bucket, mounted folder) │ │ │ │
│ │ │ └───────────────────────────────────────────────┘ │ │ │
│ │ └─────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────────┘
│ network egress
│ (subject to NetworkPolicy)
▼
┌─────────────────────────┐
│ LLM inference point │
│ Anthropic / OpenAI / │
│ Ollama / vLLM / etc. │
│ External, OpenAI- │
│ compatible API. │
└─────────────────────────┘
Component definitions
| Component | Where | Process boundary | Owns | Does NOT do |
|---|---|---|---|---|
| Control plane (admin server) | host process; the operator's "console" | postgres + REST + admin UI | persistence, planning, dispatch, audit query | execute workflow steps |
| Worker fleet | logical: a set of Worker processes | n/a — it's a registry | capacity scheduling, capability advertisement, run dispatch | hold creds itself |
| Worker | host (k8s pod / VM / bare metal) | one OS process | claims runs, resolves Identity, manages sandbox lifecycle, hosts Sagewai Agent | hold Tier-2 plaintext secrets |
| Sagewai Agent | inside the worker process | thread of the worker | reads task + mode, dispatches accordingly; orchestration brain; uses Tier-1 keys for its OWN LLM calls | run user-task LLM calls (those are Tier-2, inside sandbox) |
| Sandbox | Docker container / k8s pod / Lambda invocation | depends on backend; optional (Modes 1+) | isolation: net policy, fs, env, identity | exist in Mode 0 |
| Tool runner | inside sandbox; daemon | OS process inside the sandbox | RPC dispatch, CLI subprocess management, stdout/stderr streaming | hold persistent state across runs (the sandbox warm-pool, which amortises container start-up cost across runs, wipes env between runs) |
| CLI Agent | inside sandbox; subprocess of tool runner | OS subprocess; Mode 3+ only | the actual user-task work — code gen, editing, deployment | persist beyond the run |
| Security Identity | data, env-injected into sandbox | n/a (it's data, not code) | per-customer/per-workflow Tier-2 keys + behaviour knobs | exist outside the sandbox after injection |
| Artifact destination | external (GitHub repo / S3 bucket / mounted folder) | the destination's own runtime | receive CLI agent outputs | be readable by the worker host (creds are sandbox-side) |
| LLM inference point | external | external | model weights | make tool calls (the agent is the orchestrator; the LLM is the model) |
Per-mode data flow
The flow varies by mode, but the run-level lifecycle is constant. A
caller enqueues a run; the control plane writes a workflow_runs row
with status pending, the security profile reference (Mode 2+),
effective key names from the Sealed cascade, sandbox mode, image, and
network policy. A worker matching the run's requirements claims the
row, the Sagewai Agent dispatches by mode, and on completion the row
moves to completed or failed while audit events flow into
sealed_audit_events.
In Mode 0 (Bare), no sandbox is acquired and no Identity is resolved. The Sagewai Agent runs the task inline on the worker using Tier-1 keys for any of its own LLM calls. Step results write straight to postgres. This is the cheap, fast mode for planning and other steps that touch no customer data.
In Mode 1 (Sandboxed, no Identity), the agent acquires a sandbox from the warm-pool with empty env, dispatches tool calls via the tool runner's RPC, and pipes outputs back to postgres. The sandbox provides isolation only — there are no Tier-2 keys, so the step cannot reach customer systems. This is the right shape for untrusted code or network-isolated computation.
In Mode 2 (Sandboxed plus Identity), the agent re-resolves the
Sealed cascade at sandbox-start time so any rotation that happened
between enqueue and dispatch is picked up. Tier-2 env is injected
when the container starts; tools read os.environ for credentials.
The sandbox is now able to call the customer's systems with their
keys, and the worker host still never sees plaintext.
In Mode 3 (Full plus CLI agent), the topology is the same as
Mode 2 plus the tool runner spawns a CLI agent (Claude Code, Codex,
Gemini, or a custom variant) as a subprocess inside the sandbox. The
CLI agent reads its LLM key from sandbox env, calls the LLM inference
point directly, writes artifacts to /workspace, and the artifact
destination upload (git push, aws s3 sync, cp to a mounted
folder) runs at the end with credentials that are also sandbox-side.
In Mode 3b (Full plus JIT credential callback), the topology adds a bidirectional channel back to the host. The CLI agent or tool runner can request a credential it does not have ("I need write access to repo X"); the Sagewai Agent on the host evaluates the request against policy, auto-approves, denies, or surfaces a HITL gate, and approved credentials are env-injected at runtime. See Just-in-time credential callback (Mode 3b).
The run-level lifecycle (enqueue, claim, agent dispatches by mode, audit + step state persists, sandbox returns to pool, status flips to terminal) is identical across all five modes — only the body of step 4 changes. For the per-mode walk-through, see Execution modes.
Anti-patterns
These are the violations to call out in code review or design review. They map directly onto the invariants above; each one breaks at least one.
-
Tool execution on the LLM inference point. The inference point is just a model. It cannot execute tools. Anything that talks to the LLM and calls itself a "tool runner" or "function executor" is wrong by name — the model returns tool-call requests; the tool runner inside the sandbox actually executes them.
-
Workflow step execution on the control plane. The admin server, autopilot, and CLI never run workflow steps. They write rows; workers read them. If you find yourself writing a step executor in
sagewai/admin/, redirect tosagewai/core/worker.py. -
Secrets on the worker host (when a sandbox is in scope). In Mode 1+, Tier-2 secrets must never touch the worker process env. They flow from the Sealed Identity backend into the sandbox container env via the SecretProvider's
env_for. The worker only sees key names onworkflow_runs.effective_secret_keys. -
Single mode for an entire workflow when steps differ in cost or risk. A "build a portfolio site, then summarise what you built" workflow is wasteful if it runs the summarise step in Mode 3. Steps have independent modes; pick the cheapest mode that still satisfies the isolation each step actually needs.
-
Skipping the worker entirely. There is no
Workflow.run_inline()API on the control plane. Even quick tasks go through the queue plus a worker (which may execute Mode 0, but it is still a worker). Skipping the queue silently breaks audit, replay, and capacity accounting. -
Long-running state inside the tool runner. The tool runner is per-run, or pooled with reset. Nothing persists across runs except artifacts written to the destination. State that must outlive a run lives in postgres, not in tool-runner memory or on the sandbox filesystem.
Why this topology
In short: decoupling planning from execution gives Sagewai its operational characteristics.
- Planning is cheap. The control plane can plan a million workflow runs without saturating execution capacity. Autopilot, batch enqueues, and scheduled jobs all flow through the queue.
- Execution is bounded. Worker count caps concurrency, capability labels route work to capable executors, and a worker pool can scale via Kubernetes without touching the admin server.
- Security is bounded. The sandbox is where blast radius lives — secrets, tool execution, and CLI agents all run inside it, so a bug or compromise inside the sandbox cannot reach the worker host or the control plane.
- Observability is uniform. Every step emits the same audit event shape regardless of mode, and logs and metrics flow through one OTel pipeline regardless of backend.
- Replay is decidable. Step inputs, mode, and Identity at enqueue time are persisted, so replay reproduces what was, not what is now — the replay-safety contract.
What this topology does NOT specify
This document fixes the runtime structure. It does not fix:
- Which LLM Tier-1 uses. The operator picks. Local Ollama for cheap planning is common.
- Which sandbox backend. Each deployment picks one (or runs a heterogeneous fleet with capability-routed dispatch). See Sandbox backends.
- Which Identity backend. The builtin file-based store is the default; Vault, 1Password, AWS Secrets Manager, SOPS, and Bitwarden are pluggable.
- Workflow definition syntax. Workflows are Python-defined today (
DurableWorkflowplus steps). YAML-defined workflows are a possible future API; the topology is unchanged. - What CLI agents are available. The image variant catalog (
sagewai/sandbox-claude-code,sagewai/sandbox-codex, …) is operator-curated. New CLIs are added by extending the catalog, not the topology.
Cross-references
- Security tiers
- Execution modes
- Sandbox backends
docs/architecture/runtime-topology.md(canonical reference)