Cost Management

This guide covers the four main tools for controlling LLM spend: budget limits, model fallback chains, cost-aware routing, and token controls. Use the Admin Panel for a UI over the same data.

Model cost reference

ModelCost per 1M input tokensCost per 1M output tokens
GPT-4o$2.50$10.00
GPT-4o-mini$0.15$0.60
Claude 3.5 Sonnet$3.00$15.00
Gemini 2.0 Flash$0.075$0.30

A GPT-4o agent processing 100 requests/day at 2 K tokens each costs roughly $5/day. In multi-agent workflows, costs compound across every step.


Budget Limits

Set daily and monthly spending limits per agent:

from sagewai.admin.budget import BudgetManager, BudgetLimit

budget = BudgetManager()

budget.add_limit(BudgetLimit(
    agent_name="researcher",
    max_daily_usd=5.0,
    max_monthly_usd=100.0,
    action="throttle",  # warn, throttle, or stop
))

budget.add_limit(BudgetLimit(
    agent_name="writer",
    max_daily_usd=10.0,
    max_monthly_usd=200.0,
    action="stop",
))

Recording spend

Call record_spend after each LLM call:

budget.record_spend(agent_name="researcher", cost_usd=0.015)

Checking budget before a run

result = budget.check_budget("researcher")

if result.allowed:
    response = await agent.chat(message)
else:
    print(f"Budget exceeded: {result.reason}")
    # Use a fallback model, queue the request, or reject it

Monitoring spend

status = budget.get_budget_status("researcher")
print(f"Daily spend: ${status['daily_spend_usd']:.2f} / ${status['max_daily_usd']:.2f}")
print(f"Monthly spend: ${status['monthly_spend_usd']:.2f} / ${status['max_monthly_usd']:.2f}")
print(f"Daily remaining: ${status['daily_remaining_usd']:.2f}")

Budget actions

ActionBehavior
warnLogs a warning and allows the request
throttleFalls back to a cheaper model
stopBlocks the request

Model Fallback Chains

Pair action="throttle" with a fallback_chain to automatically step down to cheaper models when a budget is exceeded:

budget.add_limit(BudgetLimit(
    agent_name="researcher",
    max_daily_usd=5.0,
    max_monthly_usd=100.0,
    action="throttle",
    fallback_chain=["gpt-4o-mini", "gemini/gemini-2.0-flash"],
))

When the daily budget is exceeded:

  1. Try gpt-4o-mini (60x cheaper than GPT-4o).
  2. If that is also over budget, try gemini/gemini-2.0-flash (30x cheaper).

Applying the fallback

fallback = budget.get_fallback_model("researcher", current_model="gpt-4o")

if fallback:
    agent = UniversalAgent(name="researcher", model=fallback)
else:
    agent = UniversalAgent(name="researcher", model="gpt-4o")

Model Router

Use ModelRouter to pick the cheapest model that fits each request, based on query characteristics:

from sagewai.core.model_router import ModelRouter, short_query_rule, tool_heavy_rule

router = ModelRouter(
    rules=[
        # Short queries go to the cheap model
        short_query_rule(threshold=100, model="gpt-4o-mini"),

        # Tool-heavy queries need a more capable model
        tool_heavy_rule(model="gpt-4o", min_tools=3),
    ],
    default_model="gpt-4o",
)

model = router.select_model(
    "What is 2+2?",
    context={"tool_count": 0},
)
# -> "gpt-4o-mini" (short query, no tools)

Combining budget and router

from sagewai.admin.budget import cost_aware_rule

router = ModelRouter(
    rules=[
        cost_aware_rule(budget, agent_name="researcher"),
        short_query_rule(threshold=100, model="gpt-4o-mini"),
    ],
    default_model="gpt-4o",
)

When the researcher agent exceeds its budget, the router automatically switches to the first model in the fallback chain.


Choosing the Right Model

The biggest savings come from matching model to task:

Task TypeRecommended ModelReason
Simple Q&Agpt-4o-miniFast, cheap, sufficient quality
Data extractiongpt-4o-miniPattern-matching, not creative
Research and retrievalgpt-4o-miniFollows tool instructions well
Creative writingclaude-3-5-sonnet-20241022Better prose quality
Complex reasoninggpt-4oStrong multi-step logic
Code generationgpt-4oBest code output
Summarizationgemini/gemini-2.0-flashFast, cheap, good at compression
Style checkinggpt-4o-miniRule-based, grammar patterns

Multi-agent cost optimization

Use cheap models for intermediate steps, expensive models only where quality matters:

# Bulk research — cheap
researcher = UniversalAgent(name="researcher", model="gpt-4o-mini")

# Final writing — premium
writer = UniversalAgent(name="writer", model="claude-3-5-sonnet-20241022")

# Grammar check — cheap
proofreader = UniversalAgent(name="proofreader", model="gpt-4o-mini")

pipeline = SequentialAgent(
    name="article-pipeline",
    agents=[researcher, writer, proofreader],
)
# Only the writing step uses an expensive model

Token Controls

Limit output tokens

agent = UniversalAgent(
    name="concise-agent",
    model="gpt-4o",
    max_tokens=500,
    system_prompt="Be concise. Answer in 2-3 sentences maximum.",
)

Context compaction

Automatically compact the context window when it exceeds a threshold:

agent = UniversalAgent(
    name="efficient-agent",
    model="gpt-4o",
    max_context_tokens=4000,
)

TokenBudgetGuard

Hard cap on per-request cost:

from sagewai.safety.guardrails import TokenBudgetGuard

agent = UniversalAgent(
    name="budget-capped-agent",
    model="gpt-4o",
    guardrails=[TokenBudgetGuard(max_usd=1.0)],
)

Monitoring Costs

Analytics API

Record and query costs per agent and per model:

from sagewai.admin.analytics import AnalyticsStore

store = AnalyticsStore()

store.record_cost(
    agent_name="researcher",
    model="gpt-4o",
    cost_usd=0.015,
    tokens=1500,
)

costs = store.get_costs()
print(f"Total cost: ${costs['total_cost_usd']:.2f}")
print(f"By model: {costs['by_model']}")
print(f"By agent: {costs['by_agent']}")

models = store.get_model_analytics()
for m in models:
    print(f"{m['model']}: ${m['cost_per_1k_tokens']:.4f}/1K tokens, "
          f"{m['request_count']} requests")

REST API

Mount the analytics router for HTTP access:

from sagewai.admin.analytics import create_analytics_router

router = create_analytics_router(store)
app.include_router(router, prefix="/analytics")

# GET /analytics/costs?agent_name=researcher
# GET /analytics/models
# GET /analytics/agents

Full Example

Production-ready setup combining all four strategies:

from sagewai.engines.universal import UniversalAgent
from sagewai.admin.budget import BudgetManager, BudgetLimit
from sagewai.admin.analytics import AnalyticsStore
from sagewai.core.model_router import ModelRouter, short_query_rule
from sagewai.admin.budget import cost_aware_rule
from sagewai.safety.guardrails import TokenBudgetGuard

analytics = AnalyticsStore()

budget = BudgetManager()
budget.add_limit(BudgetLimit(
    agent_name="assistant",
    max_daily_usd=10.0,
    max_monthly_usd=200.0,
    action="throttle",
    fallback_chain=["gpt-4o-mini", "gemini/gemini-2.0-flash"],
))

router = ModelRouter(
    rules=[
        cost_aware_rule(budget, agent_name="assistant"),
        short_query_rule(threshold=100, model="gpt-4o-mini"),
    ],
    default_model="gpt-4o",
)

model = router.select_model(user_message, context={"tool_count": len(tools)})

agent = UniversalAgent(
    name="assistant",
    model=model,
    guardrails=[TokenBudgetGuard(max_usd=0.50)],
    max_context_tokens=4000,
)

response = await agent.chat(user_message)
budget.record_spend(agent_name="assistant", cost_usd=estimated_cost)
analytics.record_cost(
    agent_name="assistant",
    model=model,
    cost_usd=estimated_cost,
    tokens=estimated_tokens,
)

Checklist

  1. Use gpt-4o-mini or gemini-2.0-flash for simple tasks.
  2. Set max_tokens to limit response length.
  3. Enable max_context_tokens for automatic compaction.
  4. Configure budget limits with fallback chains.
  5. Use the model router to select models automatically.
  6. Add TokenBudgetGuard for per-request cost caps.
  7. Record costs via the Analytics API.
  8. Review per-model analytics monthly and adjust model selection.