Fleet Deployment

Run Sagewai workflow workers on hardware you control — your servers, your GPUs, your private network — while the Sagewai gateway handles task routing and monitoring.

Tier: Premium and Enterprise plans. Free tier is limited to one local worker.

For a conceptual overview of how the fleet system works, see Fleet. For dispatch internals, isolation model, and security layers, see Fleet Architecture.


How enrollment works

1. Admin generates an enrollment key
2. Worker binary reads the key on startup
3. Worker registers itself via the fleet API
4. Admin approves the worker (admin panel or CLI)
5. Worker is now eligible to receive tasks

Step 1: Generate an enrollment key

From the admin panel at Fleet → Enrollment Keys → New Key, or via the CLI:

sagewai fleet create-key \
  --org acme \
  --name gpu-onboarding \
  --pool gpu-workers \
  --label env=production \
  --label gpu=true \
  --max-uses 10 \
  --expires-in 30d

This returns a key shown once — store it securely. Only the SHA-256 hash is saved server-side.

Step 2: Start a worker

from sagewai import WorkflowWorker

worker = WorkflowWorker(
    project_id="my-project",
    pool="gpu-workers",
    labels={"env": "production", "gpu": "true"},
    models=["llama3:70b", "mistral:7b"],   # models available on this machine
    gateway_url="https://gateway.sagewai.ai",
    enrollment_key="wrt-1.eyJ...",          # from step 1
)
await worker.start()

The worker registers itself, starts a heartbeat loop (every 30 seconds), and begins polling for tasks.

Step 3: Approve the worker

New workers appear in the admin panel under Fleet → Workers with status PENDING. Click Approve, or use the CLI:

sagewai fleet list-workers --status pending
sagewai fleet approve-worker <worker-id>

Once approved, the worker receives tasks immediately.


Deployment options

docker run --rm \
  -e ENROLLMENT_KEY="wrt-1.eyJ..." \
  -e FLEET_GATEWAY_URL="https://gateway.sagewai.ai" \
  -e WORKER_POOL="default" \
  -e WORKER_MODELS="gpt-4o,claude-sonnet-4" \
  sagewai/worker:latest

Scale workers:

docker compose up -d --scale worker=5
# or with Podman
podman-compose up -d --scale worker=5

Worker environment variables

VariableRequiredDefaultPurpose
FLEET_GATEWAY_URLYesServer endpoint URL
ENROLLMENT_KEYYes*Auto-registration key (*or WRT_TOKEN)
WRT_TOKENYes*Pre-issued JWT auth (*or ENROLLMENT_KEY)
WORKER_POOLNodefaultPool assignment
WORKER_LABELSNoComma-separated key=value pairs
WORKER_MODELSNoComma-separated model names
OPENAI_API_KEYNoFor cloud model access
ANTHROPIC_API_KEYNoFor Anthropic Claude access
OLLAMA_HOSTNolocalhost:11434For local Ollama inference
HEARTBEAT_INTERVALNo30Seconds between heartbeats

Bare metal / systemd

uv pip install sagewai
sagewai worker start \
  --pool gpu-workers \
  --labels gpu=true,env=staging \
  --enrollment-key "wrt-1.eyJ..."

Kubernetes

CPU workers with autoscaling:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sagewai-cpu-workers
spec:
  replicas: 3
  selector:
    matchLabels: { app: sagewai-worker, pool: cpu }
  template:
    metadata:
      labels: { app: sagewai-worker, pool: cpu }
    spec:
      containers:
      - name: worker
        image: sagewai/worker:latest
        env:
        - { name: FLEET_GATEWAY_URL, value: "http://sagewai-server:8000" }
        - { name: WORKER_POOL, value: "cpu" }
        - { name: WORKER_MODELS, value: "gpt-4o,claude-sonnet-4" }
        envFrom:
        - secretRef: { name: sagewai-worker-secrets }
        resources:
          requests: { memory: "512Mi", cpu: "500m" }
          limits: { memory: "2Gi", cpu: "2000m" }
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: sagewai-cpu-workers-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: sagewai-cpu-workers
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: External
    external:
      metric: { name: sagewai_pending_runs }
      target: { type: AverageValue, averageValue: "5" }

GPU workers (one per GPU node):

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: sagewai-gpu-workers
spec:
  selector:
    matchLabels: { app: sagewai-worker, pool: gpu }
  template:
    metadata:
      labels: { app: sagewai-worker, pool: gpu }
    spec:
      nodeSelector:
        nvidia.com/gpu: "true"
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: worker
        image: sagewai/worker:latest
        env:
        - { name: FLEET_GATEWAY_URL, value: "http://sagewai-server:8000" }
        - { name: WORKER_POOL, value: "gpu" }
        - { name: WORKER_MODELS, value: "ollama/llama3.1:70b" }
        envFrom:
        - secretRef: { name: sagewai-worker-secrets }
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "8Gi"

Worker pools and labels

Use pools to segment workers by environment or purpose; use labels for fine-grained routing within a pool.

# Development pool
dev_worker = WorkflowWorker(pool="dev", labels={"region": "us-west"})

# Production GPU pool
gpu_worker = WorkflowWorker(pool="gpu-prod", labels={"gpu": "true", "vram": "80gb"})

LLM-aware routing

Specify which model a workflow step requires and the dispatcher routes to a worker that has it:

from sagewai import DurableWorkflow, UniversalAgent
from sagewai import RoutingConstraints

workflow = DurableWorkflow(name="inference-pipeline", store=store)

@workflow.step("heavy-inference", routing=RoutingConstraints(target_model="llama3:70b"))
async def heavy_inference(prompt: str) -> str:
    agent = UniversalAgent(name="llm", model="llama3:70b")
    return await agent.chat(prompt)

The target_model filter is evaluated at claim time. If no worker with the required model is available, the task queues until one becomes available.

Load balancing strategies

StrategyBehaviorBest For
DIRECTNo pre-assignment; workers self-selectSimple setups
ROUND_ROBINRotate through eligible workersEven distribution
LEAST_LOADEDPick worker with lowest active_runs / max_concurrent ratioOptimal utilization
THRESHOLDLike LEAST_LOADED but skip workers above 90% capacityHeadroom for spikes
from sagewai import RoutingConstraints

routing = RoutingConstraints(
    worker_pool="gpu",
    worker_labels={"region": "eu-west"},
    strategy="LEAST_LOADED",
    capacity_threshold=0.9,
)

Pool design patterns

PoolPurposeModelsCost
cpu-fastHigh-volume, low-complexityGPT-4o-mini, HaikuLow
cpu-smartComplex reasoningGPT-4o, Sonnet, OpusMedium-High
gpu-localOn-prem GPU inferenceOllama (Llama, Mistral)$0/token
gpu-finetuneDedicated fine-tuningUnsloth$0/token
regional-euEU data residencyAnyCompliance-driven
regional-usUS data residencyAnyCompliance-driven

Worker sizing guide

WorkloadContainer MemoryCPUGPUMax Concurrent
Cloud API agents (GPT-4o, Claude)512 MB0.5 coreNone8–16
Local Ollama 7B6 GB2 cores1x RTX 30602–4
Local Ollama 70B48 GB4 cores1x A1001–2
RAG + embeddings2 GB1 coreOptional4–8
Fine-tuning (Unsloth)16 GB4 cores1x RTX 3090+1

Monitoring

Fleet worker status is visible in the admin panel under Fleet → Workers. Each worker shows:

  • Current status (online / offline / pending approval)
  • Heartbeat timestamp
  • Tasks claimed / completed / failed
  • Declared models and labels
  • Any active anomaly alerts

The CLI equivalent:

sagewai fleet list-workers
sagewai fleet list-workers --pool gpu-prod --status online

Prometheus metrics for autoscaling or Grafana dashboards:

sagewai_pending_runs    — tasks waiting for workers
sagewai_worker_load     — per-worker utilization
sagewai_run_duration    — workflow execution time

See also

  • Fleet — conceptual overview: what the fleet system is and when to use it
  • Fleet Architecture — dispatch internals, isolation model, security layers, database schema