LLM Harness

This guide shows you how to deploy and configure the LLM Harness — a proxy and SDK middleware that routes each request to the cheapest model that can handle it.

Prerequisites: Sagewai SDK installed (pip install sagewai). For proxy mode, uvicorn is also required.

How it works

The Harness sits between AI coding tools (Claude Code, Cursor, Copilot) and your LLM providers. It scores each incoming request against 8 heuristic signals and assigns a complexity tier — SIMPLE, MEDIUM, or COMPLEX — then forwards the request to the corresponding model. A simple typo fix goes to Haiku ($0.80/M tokens); an architecture review stays on Opus ($15/M tokens). No LLM call is needed to make the routing decision.

Two deployment modes:

Proxy mode — a standalone HTTP server. Tools connect to it by pointing their ANTHROPIC_BASE_URL (or equivalent) at the proxy.
SDK mode — middleware that intercepts _call_llm() inside Sagewai agents. No separate server needed.

Proxy mode

Start the Harness as a standalone server:

from sagewai.harness.app import create_harness_app
import uvicorn

app = create_harness_app(
    anthropic_api_key="sk-ant-...",
    openai_api_key="sk-...",
)
uvicorn.run(app, host="0.0.0.0", port=8100)

Point Claude Code at the proxy:

export ANTHROPIC_BASE_URL=http://localhost:8100/v1
export ANTHROPIC_API_KEY=sk-harness-<your-key>

For Cursor or Copilot, set the API base URL to http://localhost:8100/v1 in the tool's settings.

SDK mode

Wrap an existing agent without running a proxy:

from sagewai.harness import harness_wrap, ModelTierConfig

harness_wrap(agent, tier_config=ModelTierConfig(
    simple="claude-haiku-4-5-20251001",
    medium="claude-sonnet-4-5-20250929",
    complex="claude-opus-4-6",
))

Use HarnessingAgent to apply the same routing policy across a group of agents:

from sagewai.harness import HarnessingAgent

harness = HarnessingAgent(
    agents=[planner, coder, reviewer],
    tier_config=ModelTierConfig(
        simple="claude-haiku-4-5-20251001",
        medium="claude-sonnet-4-5-20250929",
        complex="claude-opus-4-6",
    ),
)

Request classification

Every request is scored using 8 signals. The scoring is pure heuristics — no additional LLM call:

Signal	Weight	Example
Total tokens	High	>4000 tokens → +20
Message count	Medium	>10 messages → +10
System prompt size	Medium	>2000 tokens → +10
Last user message length	High	fewer than 50 chars → -20
Code blocks	Medium	3+ blocks → +10
Tool count	High	>5 tools → +15
Complexity keywords	Medium	"architect", "design" → +8 each
Simplicity keywords	Medium	"fix", "typo", "rename" → -8 each

Scores map to three tiers: SIMPLE (0–29), MEDIUM (30–69), COMPLEX (70–100).

Policies

Define routing rules per org, team, project, or user:

from sagewai.harness.models import PolicyRule, PolicyScope

PolicyRule(
    name="intern-cap",
    scope=PolicyScope(org_id="acme", user_id="bob"),
    max_tier=ComplexityTier.MEDIUM,       # No Opus access
    blocked_models=["claude-opus-4-6"],   # Explicit block
    allow_override=False,                  # Cannot be bypassed
)

When multiple policies match a request, the most specific one wins. Resolution order: user > project > team > org.

Budget enforcement

Set spend limits per user, team, or project:

from sagewai.harness.budget import HarnessBudgetManager

budget = HarnessBudgetManager(BudgetManager())
budget.configure_user_budget("alice", max_daily_usd=5.0, max_monthly_usd=50.0, action="downgrade")

When a budget is exceeded, the action field controls what happens: warn (log only), downgrade (force the cheapest available model), stop (reject with HTTP 429).

Custom directives

@route:simple / @route:medium / @route:complex — force a specific tier
@cost:estimate — report pricing for the current model without changing routing

The #model:name meta-directive always takes precedence over Harness tier decisions. If a user specifies a model explicitly, that choice is honored.

Admin API

The Harness exposes 13 endpoints under /api/v1/harness/:

Method	Path	Description
GET	/policies	List policies
POST	/policies	Create policy
GET/PUT/DELETE	/policies/:id	Manage a policy
GET	/keys	List keys
POST	/keys	Create key
DELETE	/keys/:id	Revoke key
GET	/spend	Spend summary
GET	/spend/breakdown	Spend broken down by model
GET	/audit	Audit events
GET/PUT	/config	Global config
POST	/test-classify	Dry-run classification without routing

Local model discovery

When the Harness starts, it probes common local inference ports and registers any running servers as $0/token backends:

from sagewai.harness.discovery import discover_local_backends

backends = await discover_local_backends()
# Probes: Ollama (11434), LM Studio (1234), Unsloth (8001), vLLM (8000), LocalAI (8080)

No configuration is required — start your local server and the Harness picks it up automatically.

Cost example

Request type	Without Harness	With Harness	Savings
Fix typo (1K tokens)	Opus: $0.015	Haiku: $0.0008	95%
Refactor function (5K)	Opus: $0.075	Sonnet: $0.015	80%
Architecture design (20K)	Opus: $0.300	Opus: $0.300	0%
Daily (1000 requests, 70% simple)	$18.90	$4.32	77%

For a full working example, see Example 23 (packages/sdk/sagewai/examples/23_harness_proxy.py) and Example 24 (packages/sdk/sagewai/examples/24_harness_agent.py).