LLM Harness

This guide shows you how to deploy and configure the LLM Harness — a proxy and SDK middleware that routes each request to the cheapest model that can handle it.

Prerequisites: Sagewai SDK installed (pip install sagewai). For proxy mode, uvicorn is also required.

How it works

The Harness sits between AI coding tools (Claude Code, Cursor, Copilot) and your LLM providers. It scores each incoming request against 8 heuristic signals and assigns a complexity tier — SIMPLE, MEDIUM, or COMPLEX — then forwards the request to the corresponding model. A simple typo fix goes to Haiku ($0.80/M tokens); an architecture review stays on Opus ($15/M tokens). No LLM call is needed to make the routing decision.

Two deployment modes:

  • Proxy mode — a standalone HTTP server. Tools connect to it by pointing their ANTHROPIC_BASE_URL (or equivalent) at the proxy.
  • SDK mode — middleware that intercepts _call_llm() inside Sagewai agents. No separate server needed.

Proxy mode

Start the Harness as a standalone server:

from sagewai.harness.app import create_harness_app
import uvicorn

app = create_harness_app(
    anthropic_api_key="sk-ant-...",
    openai_api_key="sk-...",
)
uvicorn.run(app, host="0.0.0.0", port=8100)

Point Claude Code at the proxy:

export ANTHROPIC_BASE_URL=http://localhost:8100/v1
export ANTHROPIC_API_KEY=sk-harness-<your-key>

For Cursor or Copilot, set the API base URL to http://localhost:8100/v1 in the tool's settings.

SDK mode

Wrap an existing agent without running a proxy:

from sagewai.harness import harness_wrap, ModelTierConfig

harness_wrap(agent, tier_config=ModelTierConfig(
    simple="claude-haiku-4-5-20251001",
    medium="claude-sonnet-4-5-20250929",
    complex="claude-opus-4-6",
))

Use HarnessingAgent to apply the same routing policy across a group of agents:

from sagewai.harness import HarnessingAgent

harness = HarnessingAgent(
    agents=[planner, coder, reviewer],
    tier_config=ModelTierConfig(
        simple="claude-haiku-4-5-20251001",
        medium="claude-sonnet-4-5-20250929",
        complex="claude-opus-4-6",
    ),
)

Request classification

Every request is scored using 8 signals. The scoring is pure heuristics — no additional LLM call:

SignalWeightExample
Total tokensHigh>4000 tokens → +20
Message countMedium>10 messages → +10
System prompt sizeMedium>2000 tokens → +10
Last user message lengthHighfewer than 50 chars → -20
Code blocksMedium3+ blocks → +10
Tool countHigh>5 tools → +15
Complexity keywordsMedium"architect", "design" → +8 each
Simplicity keywordsMedium"fix", "typo", "rename" → -8 each

Scores map to three tiers: SIMPLE (0–29), MEDIUM (30–69), COMPLEX (70–100).

Policies

Define routing rules per org, team, project, or user:

from sagewai.harness.models import PolicyRule, PolicyScope

PolicyRule(
    name="intern-cap",
    scope=PolicyScope(org_id="acme", user_id="bob"),
    max_tier=ComplexityTier.MEDIUM,       # No Opus access
    blocked_models=["claude-opus-4-6"],   # Explicit block
    allow_override=False,                  # Cannot be bypassed
)

When multiple policies match a request, the most specific one wins. Resolution order: user > project > team > org.

Budget enforcement

Set spend limits per user, team, or project:

from sagewai.harness.budget import HarnessBudgetManager

budget = HarnessBudgetManager(BudgetManager())
budget.configure_user_budget("alice", max_daily_usd=5.0, max_monthly_usd=50.0, action="downgrade")

When a budget is exceeded, the action field controls what happens: warn (log only), downgrade (force the cheapest available model), stop (reject with HTTP 429).

Custom directives

Register Harness directives with the Directive Engine to override routing per request:

  • @route:simple / @route:medium / @route:complex — force a specific tier
  • @cost:estimate — report pricing for the current model without changing routing

The #model:name meta-directive always takes precedence over Harness tier decisions. If a user specifies a model explicitly, that choice is honored.

Admin API

The Harness exposes 13 endpoints under /api/v1/harness/:

MethodPathDescription
GET/policiesList policies
POST/policiesCreate policy
GET/PUT/DELETE/policies/:idManage a policy
GET/keysList keys
POST/keysCreate key
DELETE/keys/:idRevoke key
GET/spendSpend summary
GET/spend/breakdownSpend broken down by model
GET/auditAudit events
GET/PUT/configGlobal config
POST/test-classifyDry-run classification without routing

Local model discovery

When the Harness starts, it probes common local inference ports and registers any running servers as $0/token backends:

from sagewai.harness.discovery import discover_local_backends

backends = await discover_local_backends()
# Probes: Ollama (11434), LM Studio (1234), Unsloth (8001), vLLM (8000), LocalAI (8080)

No configuration is required — start your local server and the Harness picks it up automatically.

Cost example

Request typeWithout HarnessWith HarnessSavings
Fix typo (1K tokens)Opus: $0.015Haiku: $0.000895%
Refactor function (5K)Opus: $0.075Sonnet: $0.01580%
Architecture design (20K)Opus: $0.300Opus: $0.3000%
Daily (1000 requests, 70% simple)$18.90$4.3277%

For a full working example, see Example 23 (packages/sdk/sagewai/examples/23_harness_proxy.py) and Example 24 (packages/sdk/sagewai/examples/24_harness_agent.py).


See also

  • Cost Management — budget limits, model fallback chains, token controls, and spend monitoring
  • LiteLLM Integration — deploy a shared LLM gateway with per-project virtual keys and spend limits
  • Observatory — dashboards for agent metrics, costs, and latency across all runs