Observability and cost — see what your AI feature is doing and what it costs
This tutorial walks you through bringing up Sagewai's observability stack and reading the two dashboards that ship with it. After following it you will have live metrics and logs flowing from the admin backend into Grafana, a real-time mission view in the Sagewai HUD, and per-customer cost rollups your finance team can use without engineering help.
Before you start
- Docker (or a compatible runtime) and
docker compose. - A working Sagewai SDK install —
pip install sagewaior a local checkout. - The admin backend reachable on
localhost. If you have not run it before, see the Admin Panel guide. - About three minutes for the dashboards to fill once you start driving load.
What you will have running
Sagewai exposes telemetry through a single OpenTelemetry pipeline that feeds two surfaces:
- The Sagewai HUD — a mission-control view of agents on a graph, missions in flight, and fleet posture. Use it for demos, all-hands, and at-a-glance status.
- A Grafana board — request rates, p95 latencies, status-code distributions, OpenTelemetry pipeline health, and structured logs. Use it for SRE work and alerting.
Both read the same telemetry stream, so they cannot disagree.
Turn on observability
Start the bundled stack and a load-driver example. The dashboards fill within three minutes.
docker compose -f docker-compose.observability.yml up -d --build
python 43_observatory_live.py
The example (source) fires a mixed-tenant workload at the admin backend — real HTTP traffic, real OpenTelemetry spans. Open Grafana at http://localhost:3000 (login admin/admin) and the HUD at http://localhost:3001/hud.
Trace cost per customer
Every span Sagewai emits is tagged with sagewai.project_id, so you can break the bill down by tenant without manual reconciliation. To see the rollup end to end:
python 34_observatory_cost_tracking.py
The script (source) emits cost-tagged spans across two tenants and prints a per-project rollup at the end. The per-project breakdown is also built into the Grafana board — once you have run any workload, the dashboard URL is the deliverable.
Watch the fleet under load
For a heavier demo — 20+ workers, mixed workload, the HUD live across the full graph:
python 40_fleet_under_load.py
This example (source) produces a screen-recording-quality run: heavy enough to stress the dispatcher, sparse enough to stay readable on the graph.
Cap spend with budgets
Cost dashboards tell you what happened. Budgets stop runaway spend before it does. Per-user, per-team, and per-project caps are wired into the SDK:
python 12_budget_enforcement.py
The script (source) shows the foundation API. See the Cost management guide for the full configuration surface.
What the Grafana board shows
The board has five rows. Top to bottom, the story flows from "is the platform healthy" through "where is the latency" to "show me the actual log lines."
| Row | Panels | Use it when |
|---|---|---|
| 1 — System Health | Request Rate · Error Rate (5xx) · Active Requests · Avg Latency | "Is anything red right now?" |
| 2 — HTTP Performance | Per-route request rate · per-route p95 latency | Hunting a slow route |
| 3 — Requests by Status | Stacked status-code distribution · response-size p95 | Tracking 4xx/5xx mix |
| 4 — OTel Pipeline Health | Spans processed · log records sent · log queue size | Diagnosing missing telemetry |
| 5 — Application Logs | Live tail of structured admin events | Reading the audit trail |
Row 5 doubles as an audit feed because the admin backend emits business events as structured logs: agent.created, agent.run.*, auth.login.*, setup.completed, provider.test.*. Filter by project_id in the panel query box to scope to one tenant.
The metrics in rows 1–4 are standard OTel HTTP server instrumentation: http_server_duration_milliseconds (histogram), http_server_active_requests (gauge), http_server_response_size_bytes (histogram). Each is labelled with http_target (the route), http_status_code, and service_name="sagewai-admin".
Use your existing observability stack
The bundled compose file is a self-contained setup, but the OpenTelemetry collector is the only fixed seam. If you already run Datadog, Grafana Cloud, or your own Prometheus + Loki stack, point your collector at Sagewai's http_server_* metrics on :8889 and the dashboards work unchanged. The provisioning files in observability/grafana/ import directly into Grafana Cloud.
Sagewai does not host an observability tier. Your telemetry stays on your infrastructure unless you route it elsewhere.
How Sagewai keeps the dashboards honest
- One pipeline, two surfaces. The HUD and Grafana read the same OTel stream. There is no shadow data path or demo-only feed.
- Health-check noise filtered. Liveness probes are excluded from the logs pipeline so operator-readable signal stays signal.
- Histograms preserved end-to-end. Sagewai exports metrics through the OTel collector's Prometheus exporter on
:8889, scraped by VictoriaMetrics every 10 seconds. Histograms and counters arrive intact. Do not swap the exporter toprometheusremotewrite— it silently drops both.
What you're responsible for
- Retention and storage sizing. VictoriaMetrics and VictoriaLogs default to local volumes. Configure retention and disk for your traffic.
- Authentication on Grafana. The bundled compose file enables anonymous read access for convenience. Lock it down before exposing externally.
- Alerting rules. Sagewai ships dashboards, not paging policy. Build alerts on the panels that matter for your SLOs.
- Tagging hygiene. Per-tenant cost rollups depend on
sagewai.project_idbeing set on every request. Use the project selector or set theX-Project-IDheader on API calls.
Common issues
- Metrics missing in Grafana but the OTel collector logs look fine. If you swapped the exporter to
prometheusremotewrite, switch back. That exporter silently drops histograms and counters; Sagewai's pipeline relies on the Prometheus exporter on:8889plus VictoriaMetrics scraping. - HUD shows no agents but Grafana has traffic. The HUD reads the admin REST API directly, not the OTel stream. Confirm the admin backend is reachable from your browser.
- Per-project numbers all roll up under "unscoped". Spans are not being tagged. Set
X-Project-IDon the API client or pick a project in the admin sidebar before driving traffic. - Custom metrics do not appear in the dashboards. Emit them through OTLP, not a sidecar Prometheus. The collector is the only seam the dashboards read.
See also
- Observatory overview — the HUD and Grafana surfaces in detail.
- Cost management guide — budget caps, meters, and per-team attribution.
- Production multitenancy — the per-project tagging that powers cost rollups.
- Train your own model — once cost is visible, it is a target you can shrink.
- Admin Panel — where Audit, Runs, and Profiles live.