Cost Management
This guide covers the four main tools for controlling LLM spend: budget limits, model fallback chains, cost-aware routing, and token controls. Use the Admin Panel for a UI over the same data.
Model cost reference
| Model | Cost per 1M input tokens | Cost per 1M output tokens |
|---|---|---|
| GPT-4o | $2.50 | $10.00 |
| GPT-4o-mini | $0.15 | $0.60 |
| Claude 3.5 Sonnet | $3.00 | $15.00 |
| Gemini 2.0 Flash | $0.075 | $0.30 |
A GPT-4o agent processing 100 requests/day at 2 K tokens each costs roughly $5/day. In multi-agent workflows, costs compound across every step.
Budget Limits
Set daily and monthly spending limits per agent:
from sagewai.admin.budget import BudgetManager, BudgetLimit
budget = BudgetManager()
budget.add_limit(BudgetLimit(
agent_name="researcher",
max_daily_usd=5.0,
max_monthly_usd=100.0,
action="throttle", # warn, throttle, or stop
))
budget.add_limit(BudgetLimit(
agent_name="writer",
max_daily_usd=10.0,
max_monthly_usd=200.0,
action="stop",
))
Recording spend
Call record_spend after each LLM call:
budget.record_spend(agent_name="researcher", cost_usd=0.015)
Checking budget before a run
result = budget.check_budget("researcher")
if result.allowed:
response = await agent.chat(message)
else:
print(f"Budget exceeded: {result.reason}")
# Use a fallback model, queue the request, or reject it
Monitoring spend
status = budget.get_budget_status("researcher")
print(f"Daily spend: ${status['daily_spend_usd']:.2f} / ${status['max_daily_usd']:.2f}")
print(f"Monthly spend: ${status['monthly_spend_usd']:.2f} / ${status['max_monthly_usd']:.2f}")
print(f"Daily remaining: ${status['daily_remaining_usd']:.2f}")
Budget actions
| Action | Behavior |
|---|---|
warn | Logs a warning and allows the request |
throttle | Falls back to a cheaper model |
stop | Blocks the request |
Model Fallback Chains
Pair action="throttle" with a fallback_chain to automatically step down to cheaper models when a budget is exceeded:
budget.add_limit(BudgetLimit(
agent_name="researcher",
max_daily_usd=5.0,
max_monthly_usd=100.0,
action="throttle",
fallback_chain=["gpt-4o-mini", "gemini/gemini-2.0-flash"],
))
When the daily budget is exceeded:
- Try
gpt-4o-mini(60x cheaper than GPT-4o). - If that is also over budget, try
gemini/gemini-2.0-flash(30x cheaper).
Applying the fallback
fallback = budget.get_fallback_model("researcher", current_model="gpt-4o")
if fallback:
agent = UniversalAgent(name="researcher", model=fallback)
else:
agent = UniversalAgent(name="researcher", model="gpt-4o")
Model Router
Use ModelRouter to pick the cheapest model that fits each request, based on query characteristics:
from sagewai.core.model_router import ModelRouter, short_query_rule, tool_heavy_rule
router = ModelRouter(
rules=[
# Short queries go to the cheap model
short_query_rule(threshold=100, model="gpt-4o-mini"),
# Tool-heavy queries need a more capable model
tool_heavy_rule(model="gpt-4o", min_tools=3),
],
default_model="gpt-4o",
)
model = router.select_model(
"What is 2+2?",
context={"tool_count": 0},
)
# -> "gpt-4o-mini" (short query, no tools)
Combining budget and router
from sagewai.admin.budget import cost_aware_rule
router = ModelRouter(
rules=[
cost_aware_rule(budget, agent_name="researcher"),
short_query_rule(threshold=100, model="gpt-4o-mini"),
],
default_model="gpt-4o",
)
When the researcher agent exceeds its budget, the router automatically switches to the first model in the fallback chain.
Choosing the Right Model
The biggest savings come from matching model to task:
| Task Type | Recommended Model | Reason |
|---|---|---|
| Simple Q&A | gpt-4o-mini | Fast, cheap, sufficient quality |
| Data extraction | gpt-4o-mini | Pattern-matching, not creative |
| Research and retrieval | gpt-4o-mini | Follows tool instructions well |
| Creative writing | claude-3-5-sonnet-20241022 | Better prose quality |
| Complex reasoning | gpt-4o | Strong multi-step logic |
| Code generation | gpt-4o | Best code output |
| Summarization | gemini/gemini-2.0-flash | Fast, cheap, good at compression |
| Style checking | gpt-4o-mini | Rule-based, grammar patterns |
Multi-agent cost optimization
Use cheap models for intermediate steps, expensive models only where quality matters:
# Bulk research — cheap
researcher = UniversalAgent(name="researcher", model="gpt-4o-mini")
# Final writing — premium
writer = UniversalAgent(name="writer", model="claude-3-5-sonnet-20241022")
# Grammar check — cheap
proofreader = UniversalAgent(name="proofreader", model="gpt-4o-mini")
pipeline = SequentialAgent(
name="article-pipeline",
agents=[researcher, writer, proofreader],
)
# Only the writing step uses an expensive model
Token Controls
Limit output tokens
agent = UniversalAgent(
name="concise-agent",
model="gpt-4o",
max_tokens=500,
system_prompt="Be concise. Answer in 2-3 sentences maximum.",
)
Context compaction
Automatically compact the context window when it exceeds a threshold:
agent = UniversalAgent(
name="efficient-agent",
model="gpt-4o",
max_context_tokens=4000,
)
TokenBudgetGuard
Hard cap on per-request cost:
from sagewai.safety.guardrails import TokenBudgetGuard
agent = UniversalAgent(
name="budget-capped-agent",
model="gpt-4o",
guardrails=[TokenBudgetGuard(max_usd=1.0)],
)
Monitoring Costs
Analytics API
Record and query costs per agent and per model:
from sagewai.admin.analytics import AnalyticsStore
store = AnalyticsStore()
store.record_cost(
agent_name="researcher",
model="gpt-4o",
cost_usd=0.015,
tokens=1500,
)
costs = store.get_costs()
print(f"Total cost: ${costs['total_cost_usd']:.2f}")
print(f"By model: {costs['by_model']}")
print(f"By agent: {costs['by_agent']}")
models = store.get_model_analytics()
for m in models:
print(f"{m['model']}: ${m['cost_per_1k_tokens']:.4f}/1K tokens, "
f"{m['request_count']} requests")
REST API
Mount the analytics router for HTTP access:
from sagewai.admin.analytics import create_analytics_router
router = create_analytics_router(store)
app.include_router(router, prefix="/analytics")
# GET /analytics/costs?agent_name=researcher
# GET /analytics/models
# GET /analytics/agents
Full Example
Production-ready setup combining all four strategies:
from sagewai.engines.universal import UniversalAgent
from sagewai.admin.budget import BudgetManager, BudgetLimit
from sagewai.admin.analytics import AnalyticsStore
from sagewai.core.model_router import ModelRouter, short_query_rule
from sagewai.admin.budget import cost_aware_rule
from sagewai.safety.guardrails import TokenBudgetGuard
analytics = AnalyticsStore()
budget = BudgetManager()
budget.add_limit(BudgetLimit(
agent_name="assistant",
max_daily_usd=10.0,
max_monthly_usd=200.0,
action="throttle",
fallback_chain=["gpt-4o-mini", "gemini/gemini-2.0-flash"],
))
router = ModelRouter(
rules=[
cost_aware_rule(budget, agent_name="assistant"),
short_query_rule(threshold=100, model="gpt-4o-mini"),
],
default_model="gpt-4o",
)
model = router.select_model(user_message, context={"tool_count": len(tools)})
agent = UniversalAgent(
name="assistant",
model=model,
guardrails=[TokenBudgetGuard(max_usd=0.50)],
max_context_tokens=4000,
)
response = await agent.chat(user_message)
budget.record_spend(agent_name="assistant", cost_usd=estimated_cost)
analytics.record_cost(
agent_name="assistant",
model=model,
cost_usd=estimated_cost,
tokens=estimated_tokens,
)
Checklist
- Use
gpt-4o-miniorgemini-2.0-flashfor simple tasks. - Set
max_tokensto limit response length. - Enable
max_context_tokensfor automatic compaction. - Configure budget limits with fallback chains.
- Use the model router to select models automatically.
- Add
TokenBudgetGuardfor per-request cost caps. - Record costs via the Analytics API.
- Review per-model analytics monthly and adjust model selection.