Runtime topology

How a workflow run actually executes end-to-end. Sagewai is a control plane plus a worker fleet: the control plane plans, persists, queries, and dispatches; the worker fleet executes. Every page in this section reduces to claims about this topology — the security-tier model rests on where credentials live in this picture, the execution-mode model rests on which boxes are involved per step, and the sandbox-backend choice swaps out one specific component without changing the rest. This page fixes the components, where they live, what they own, and what they refuse to do.

Three invariants

Three statements that the entire platform is structured to keep true. Each is non-negotiable; everything else is consequence.

1. The worker is the only executor

The control plane (admin server, CLI, autopilot, registry) plans, persists, queries, and dispatches. It never runs a workflow step. Conversely, no scheduling or persistence happens inside a worker — workers consume work, they do not allocate it. This split is what makes Sagewai's operational characteristics tractable: planning is cheap, so the control plane can plan a million runs without saturating execution capacity, and execution is bounded, so the worker count caps concurrency independently from planning throughput.

In practice this means the admin server writes workflow_runs rows and exposes audit/query APIs, while a separate fleet of worker processes pulls those rows, claims them atomically, and runs the agent loop. A worker pool can be scaled, drained, or replaced via Kubernetes without ever touching the admin server. A saturated fleet does not stall planning, and a stalled control plane does not corrupt in-flight execution. If you ever find yourself writing a step executor inside sagewai/admin/, that is the signal you have crossed the line — redirect to the worker.

2. The execution mode is per-step, not per-deployment

A single workflow run can execute step 1 inline on the worker (Mode 0), step 2 in an isolated sandbox with no credentials (Mode 1), step 3 in a sandbox with a customer's identity (Mode 2), and step 4 with a CLI agent like Claude Code in Mode 3. Each step's mode is selected by the workflow author, the autopilot, or runtime escalation logic — independently.

The practical consequence is that you do not pay sandbox cost for a cheap planning step and you do not lose isolation for a sensitive one. Treat mode as a property of the step, not the deployment. A "build a portfolio site, then summarise what you built" workflow is wasteful if the summarise step also runs in Mode 3 with a full CLI agent and customer credentials in scope. The same workflow is unsafe if the build step runs in Mode 0 with no isolation. The correct shape is mixed, and the topology is built to make mixed the cheap default.

3. The sandbox is the trust boundary (when present)

When a step runs in Mode 1+, the sandbox container is the security boundary. Secret values exist inside it, env-injected from the Sealed Identity backend at sandbox-start time. The worker host knows secret KEY NAMES (persisted as effective_secret_keys on the run row) but never holds plaintext values. The control plane has access to neither plaintext secrets nor the running sandbox.

This is the contract that lets one worker fleet serve many tenants safely. A bug or compromise inside a sandbox is bounded by the sandbox; it cannot reach the worker process, the control plane, or another tenant's sandbox (modulo backend-escape vulnerabilities, which are the backend vendor's problem, not Sagewai's). The per-run secret cleanup hook enforces the back end of this contract: when a sandbox is released to the warm-pool, the cleanup hook scrubs Tier-2 env, and the pool discards the sandbox if cleanup fails. The full credential model is in Security tiers.

Topology

┌────────────────────────────────────────────────────────────────────┐
│                       CONTROL PLANE                                 │
│                       (admin server)                                │
│                                                                     │
│  postgres                                                           │
│  ├── workflow_runs queue                                            │
│  ├── sealed_revocations                                             │
│  ├── sealed_audit_events                                            │
│  └── projects, agents, tokens, …                                    │
│                                                                     │
│  REST API:  /api/v1/admin/* (Next.js admin UI consumes this)        │
│  CLI:       sagewai admin status / sealed / profiles / …            │
│  Autopilot: plans + dispatches; never executes itself               │
│                                                                     │
│  RULE: this plane never runs a workflow step.                       │
└────────────────────────────────┬───────────────────────────────────┘
                                 │ enqueue(run, mode={0|1|2|3|3b}, …)
                                 │ claim(run)        ← fleet pulls
                                 ▼
┌────────────────────────────────────────────────────────────────────┐
│                         WORKER FLEET                                │
│                                                                     │
│  • The data plane. The only place execution happens.                │
│  • At least 1 active worker is required for Sagewai to do work.     │
│  • Workers register with the fleet registry, get approved,          │
│    advertise capability labels (sandbox.backend, models_supported,  │
│    project_id, …), and pull runs whose requirements they match.     │
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  WORKER PROCESS (host / k8s pod / VM)                       │   │
│  │                                                             │   │
│  │  Tier-1 LLM keys live HERE in process env.                  │   │
│  │  ORCHESTRATION_OPENAI_KEY=… (or local Ollama URL etc.)      │   │
│  │  These are the operator's keys for the orchestration brain. │   │
│  │                                                             │   │
│  │  ┌─────────────────────────────────────────────────────┐    │   │
│  │  │  SAGEWAI AGENT  (in worker process)                 │    │   │
│  │  │                                                     │    │   │
│  │  │  1. Claims run; reads execution mode from row       │    │   │
│  │  │  2. Resolves Security Identity (Sealed cascade) IF  │    │   │
│  │  │     mode ≥ 2. Persists effective_*_keys (NAMES)     │    │   │
│  │  │     on the run row. Never sees plaintext values.    │    │   │
│  │  │  3. Acquires sandbox from pool IF mode ≥ 1          │    │   │
│  │  │  4. Runs the agent loop appropriate to the mode:    │    │   │
│  │  │     • Mode 0:  inline on worker                     │    │   │
│  │  │     • Mode 1:  via tool runner in sandbox           │    │   │
│  │  │     • Mode 2:  + identity in sandbox env            │    │   │
│  │  │     • Mode 3:  + CLI agent (Claude Code, Codex, …)  │    │   │
│  │  │     • Mode 3b: + bidirectional callback for JIT     │    │   │
│  │  │                 credentials (Sealed-iv)             │    │   │
│  │  │  5. Persists audit + step state                     │    │   │
│  │  │  6. Releases sandbox (cleanup_run scrubs env)       │    │   │
│  │  │                                                     │    │   │
│  │  │  LLM calls Sagewai Agent makes for ITSELF (e.g.     │    │   │
│  │  │  step planning) use Tier-1 keys from worker env.    │    │   │
│  │  │  Sagewai Agent does NOT call user-task LLMs — that  │    │   │
│  │  │  happens inside the sandbox using Tier-2 keys.      │    │   │
│  │  └─────────────────────────────────────────────────────┘    │   │
│  │                                                             │   │
│  │  ◀──────────── Mode 0 stops here ────────────────────────▶  │   │
│  │                                                             │   │
│  │  ┌─────────────────────────────────────────────────────┐    │   │
│  │  │  SANDBOX  (Mode 1+)                                 │    │   │
│  │  │  Backend: Docker | Kubernetes | Lambda | Null       │    │   │
│  │  │  See execution-backends.md.                         │    │   │
│  │  │                                                     │    │   │
│  │  │  Mode 1: empty env, isolation only                  │    │   │
│  │  │  Mode 2: + identity (Tier-2 keys, behavior knobs)   │    │   │
│  │  │  Mode 3: + CLI agent runtime + artifact creds       │    │   │
│  │  │                                                     │    │   │
│  │  │  ┌───────────────────────────────────────────────┐  │    │   │
│  │  │  │  TOOL RUNNER (daemon)                         │  │    │   │
│  │  │  │  • RPC server; accepts dispatches from        │  │    │   │
│  │  │  │    Sagewai Agent on host                      │  │    │   │
│  │  │  │  • Spawns tools / CLI agent subprocesses      │  │    │   │
│  │  │  │  • Streams stdout/stderr back                 │  │    │   │
│  │  │  │  • In Mode 3b only: also serves callback      │  │    │   │
│  │  │  │    requests FROM CLI agent TO host            │  │    │   │
│  │  │  └───────────────────────────────────────────────┘  │    │   │
│  │  │                                                     │    │   │
│  │  │  ┌───────────────────────────────────────────────┐  │    │   │
│  │  │  │  CLI AGENT(S) — Mode 3+ only                  │  │    │   │
│  │  │  │  Claude Code, Codex, Gemini, custom           │  │    │   │
│  │  │  │  • Read TIER-2 keys from os.environ           │  │    │   │
│  │  │  │  • Call LLM inference points directly         │  │    │   │
│  │  │  │  • Produce artifacts in /workspace            │  │    │   │
│  │  │  └───────────────────────────────────────────────┘  │    │   │
│  │  │                                                     │    │   │
│  │  │  ┌───────────────────────────────────────────────┐  │    │   │
│  │  │  │  /workspace volume                            │  │    │   │
│  │  │  │  CLI artifacts staged here, then pushed to    │  │    │   │
│  │  │  │  artifact destination (Mode 3+: GitHub repo,  │  │    │   │
│  │  │  │  S3 bucket, mounted folder)                   │  │    │   │
│  │  │  └───────────────────────────────────────────────┘  │    │   │
│  │  └─────────────────────────────────────────────────────┘    │   │
│  └─────────────────────────────────────────────────────────────┘   │
└────────────────────────────────────────────────────────────────────┘
                                  │ network egress
                                  │ (subject to NetworkPolicy)
                                  ▼
                       ┌─────────────────────────┐
                       │  LLM inference point    │
                       │  Anthropic / OpenAI /   │
                       │  Ollama / vLLM / etc.   │
                       │  External, OpenAI-      │
                       │  compatible API.        │
                       └─────────────────────────┘

Component definitions

Component	Where	Process boundary	Owns	Does NOT do
Control plane (admin server)	host process; the operator's "console"	postgres + REST + admin UI	persistence, planning, dispatch, audit query	execute workflow steps
Worker fleet	logical: a set of Worker processes	n/a — it's a registry	capacity scheduling, capability advertisement, run dispatch	hold creds itself
Worker	host (k8s pod / VM / bare metal)	one OS process	claims runs, resolves Identity, manages sandbox lifecycle, hosts Sagewai Agent	hold Tier-2 plaintext secrets
Sagewai Agent	inside the worker process	thread of the worker	reads task + mode, dispatches accordingly; orchestration brain; uses Tier-1 keys for its OWN LLM calls	run user-task LLM calls (those are Tier-2, inside sandbox)
Sandbox	Docker container / k8s pod / Lambda invocation	depends on backend; optional (Modes 1+)	isolation: net policy, fs, env, identity	exist in Mode 0
Tool runner	inside sandbox; daemon	OS process inside the sandbox	RPC dispatch, CLI subprocess management, stdout/stderr streaming	hold persistent state across runs (the sandbox warm-pool, which amortises container start-up cost across runs, wipes env between runs)
CLI Agent	inside sandbox; subprocess of tool runner	OS subprocess; Mode 3+ only	the actual user-task work — code gen, editing, deployment	persist beyond the run
Security Identity	data, env-injected into sandbox	n/a (it's data, not code)	per-customer/per-workflow Tier-2 keys + behaviour knobs	exist outside the sandbox after injection
Artifact destination	external (GitHub repo / S3 bucket / mounted folder)	the destination's own runtime	receive CLI agent outputs	be readable by the worker host (creds are sandbox-side)
LLM inference point	external	external	model weights	make tool calls (the agent is the orchestrator; the LLM is the model)

Per-mode data flow

The flow varies by mode, but the run-level lifecycle is constant. A caller enqueues a run; the control plane writes a workflow_runs row with status pending, the security profile reference (Mode 2+), effective key names from the Sealed cascade, sandbox mode, image, and network policy. A worker matching the run's requirements claims the row, the Sagewai Agent dispatches by mode, and on completion the row moves to completed or failed while audit events flow into sealed_audit_events.

In Mode 0 (Bare), no sandbox is acquired and no Identity is resolved. The Sagewai Agent runs the task inline on the worker using Tier-1 keys for any of its own LLM calls. Step results write straight to postgres. This is the cheap, fast mode for planning and other steps that touch no customer data.

In Mode 1 (Sandboxed, no Identity), the agent acquires a sandbox from the warm-pool with empty env, dispatches tool calls via the tool runner's RPC, and pipes outputs back to postgres. The sandbox provides isolation only — there are no Tier-2 keys, so the step cannot reach customer systems. This is the right shape for untrusted code or network-isolated computation.

In Mode 2 (Sandboxed plus Identity), the agent re-resolves the Sealed cascade at sandbox-start time so any rotation that happened between enqueue and dispatch is picked up. Tier-2 env is injected when the container starts; tools read os.environ for credentials. The sandbox is now able to call the customer's systems with their keys, and the worker host still never sees plaintext.

In Mode 3 (Full plus CLI agent), the topology is the same as Mode 2 plus the tool runner spawns a CLI agent (Claude Code, Codex, Gemini, or a custom variant) as a subprocess inside the sandbox. The CLI agent reads its LLM key from sandbox env, calls the LLM inference point directly, writes artifacts to /workspace, and the artifact destination upload (git push, aws s3 sync, cp to a mounted folder) runs at the end with credentials that are also sandbox-side.

In Mode 3b (Full plus JIT credential callback), the topology adds a bidirectional channel back to the host. The CLI agent or tool runner can request a credential it does not have ("I need write access to repo X"); the Sagewai Agent on the host evaluates the request against policy, auto-approves, denies, or surfaces a HITL gate, and approved credentials are env-injected at runtime. See Just-in-time credential callback (Mode 3b).

The run-level lifecycle (enqueue, claim, agent dispatches by mode, audit + step state persists, sandbox returns to pool, status flips to terminal) is identical across all five modes — only the body of step 4 changes. For the per-mode walk-through, see Execution modes.

Anti-patterns

These are the violations to call out in code review or design review. They map directly onto the invariants above; each one breaks at least one.

Tool execution on the LLM inference point. The inference point is just a model. It cannot execute tools. Anything that talks to the LLM and calls itself a "tool runner" or "function executor" is wrong by name — the model returns tool-call requests; the tool runner inside the sandbox actually executes them.
Workflow step execution on the control plane. The admin server, autopilot, and CLI never run workflow steps. They write rows; workers read them. If you find yourself writing a step executor in sagewai/admin/, redirect to sagewai/core/worker.py.
Secrets on the worker host (when a sandbox is in scope). In Mode 1+, Tier-2 secrets must never touch the worker process env. They flow from the Sealed Identity backend into the sandbox container env via the SecretProvider's env_for. The worker only sees key names on workflow_runs.effective_secret_keys.
Single mode for an entire workflow when steps differ in cost or risk. A "build a portfolio site, then summarise what you built" workflow is wasteful if it runs the summarise step in Mode 3. Steps have independent modes; pick the cheapest mode that still satisfies the isolation each step actually needs.
Skipping the worker entirely. There is no Workflow.run_inline() API on the control plane. Even quick tasks go through the queue plus a worker (which may execute Mode 0, but it is still a worker). Skipping the queue silently breaks audit, replay, and capacity accounting.
Long-running state inside the tool runner. The tool runner is per-run, or pooled with reset. Nothing persists across runs except artifacts written to the destination. State that must outlive a run lives in postgres, not in tool-runner memory or on the sandbox filesystem.

Why this topology

In short: decoupling planning from execution gives Sagewai its operational characteristics.

Planning is cheap. The control plane can plan a million workflow runs without saturating execution capacity. Autopilot, batch enqueues, and scheduled jobs all flow through the queue.
Execution is bounded. Worker count caps concurrency, capability labels route work to capable executors, and a worker pool can scale via Kubernetes without touching the admin server.
Security is bounded. The sandbox is where blast radius lives — secrets, tool execution, and CLI agents all run inside it, so a bug or compromise inside the sandbox cannot reach the worker host or the control plane.
Observability is uniform. Every step emits the same audit event shape regardless of mode, and logs and metrics flow through one OTel pipeline regardless of backend.
Replay is decidable. Step inputs, mode, and Identity at enqueue time are persisted, so replay reproduces what was, not what is now — the replay-safety contract.

What this topology does NOT specify

This document fixes the runtime structure. It does not fix:

Which LLM Tier-1 uses. The operator picks. Local Ollama for cheap planning is common.
Which sandbox backend. Each deployment picks one (or runs a heterogeneous fleet with capability-routed dispatch). See Sandbox backends.
Which Identity backend. The builtin file-based store is the default; Vault, 1Password, AWS Secrets Manager, SOPS, and Bitwarden are pluggable.
Workflow definition syntax. Workflows are Python-defined today (DurableWorkflow plus steps). YAML-defined workflows are a possible future API; the topology is unchanged.
What CLI agents are available. The image variant catalog (sagewai/sandbox-claude-code, sagewai/sandbox-codex, …) is operator-curated. New CLIs are added by extending the catalog, not the topology.

Cross-references

Security tiers
Execution modes
Sandbox backends
docs/architecture/runtime-topology.md (canonical reference)