Inference
Sagewai is the platform for the senior engineer who got the "add AI to the product this quarter" email and now has an Anthropic bill that doubled twice. The first quarter you ship on Opus or GPT-5 because they work and your deadline is real. By Q3 your CFO asks why the API bill quadrupled and you need a credible answer that does not require a research team.
This section is the answer. It walks the inference spectrum — five tiers of where to run model inference, from a free Colab T4 to a production-grade Modal serverless endpoint — and tells you exactly when each tier wins.
The arc
Bootstrap with the juggernauts. Capture their answers. Train your own model. Deploy locally. Never pay per-token again.
That is the Sagewai Training Loop pillar in one sentence. Every page in this section makes one part of it concrete:
| Page | What it answers |
|---|---|
| This page | Which tier wins for which job, with prices |
| Start with juggernauts | Why Opus and GPT-5 are the right starting point in Q1 |
| Free CUDA via Colab | How to train your first SLM on a free Tesla T4 |
| Rent when you grow | RunPod, Vast.ai, Modal — when to pick which |
| Deploy locally | Ollama and LiteLLM, "never pay per-token again" explained |
The five tiers
We split the GPU-provisioning landscape into five tiers because no single provider wins every job. Pricing models matter as much as hardware: a per-second serverless function and a 24-hour bare-metal rental are different shapes for different problems.
Tier 0 — local (your laptop)
Ollama or LM Studio running on the developer's own machine. Free. Adequate for development, inference of small models, and "make me forget I'm not on Anthropic" work. Limited by the laptop's GPU (or lack of one). Already shipped — see Example 18 — local LLM routing.
Tier 1 — free CUDA via Google Colab
Anyone with a Google account gets a Tesla T4 (16GB) for free with a session limit of about 12 hours. Sagewai's Example 44 orchestrates Colab via Drive-sync: ships a notebook template, the user opens it once, the notebook runs the fine-tune on the free T4, the orchestrator picks up the LoRA from Drive. Zero dollars.
The most important page in this section is the one that walks you through this — see Free CUDA via Colab.
Tier 2 — Vast.ai (the budget aggregator)
Marketplace tier. Mature Python CLI (vastai), per-host reliability
scoring (uptime history, internet speed, dlperf benchmarks), wide
GPU selection from consumer RTX 3090s to datacentre A100s.
- 24GB GPUs: $0.20-$0.45/hr (RTX 3090 / 4090, varies by host)
- A100 80GB: $0.80-$1.60/hr
- Best for: the cheapest sustained cost for batch fine-tunes; per-host reliability scoring lets you avoid flaky machines
See Example 45 — Vast.ai marketplace bidding.
Tier 3 — RunPod (the gold standard for CLI automation)
The reliability tier. Bare-metal access, root permission, mature
runpodctl CLI plus Python SDK.
- RTX 4090 (24GB): ~$0.34/hr
- RTX 5090 (32GB): ~$0.69/hr
- Best for: the default "this just works" tier; the example most operators copy-paste
See Example 47 — RunPod fine-tune orchestration.
Tier 4 — Modal (serverless inference)
Fundamentally different. You do not provision a server — you decorate a Python function and Modal provisions on demand. Per-second billing, cold-starts measured in seconds, not minutes.
- A10G 24GB: ~$0.0006/sec when warm
- Best for: production inference, not training. Sub-second autoscaling without idle-GPU spend.
See Example 48 — Modal serverless inference.
Plus: bring your own endpoint
Already running inference on a vendor we do not ship? Wrap any OpenAI-compatible HTTP endpoint as a Sagewai tool or LLM backend with Example 46. The framework is bring-your-own; we do not lock you in.
The five-tier comparison table
| Provider | Best for | 24GB GPU price | CLI maturity | Example |
|---|---|---|---|---|
| Local (Ollama) | Dev, small-model inference | $0 | mature | Ex 18 |
| Colab + Drive-sync | Free CUDA, education, first SLM fine-tune | $0 | drive-sync wrapper | Ex 44 |
| Vast.ai | Lowest cost, long soaks, per-host reliability scoring | $0.20-$0.45/hr | mature | Ex 45 |
| RunPod | Reliable automation, the default tier | $0.34-$0.70/hr | gold standard | Ex 47 |
| Modal | Production inference, serverless | per-second | gold standard | Ex 48 |
| Custom endpoint | Bring your own — vendor-agnostic plugin | varies | n/a | Ex 46 |
The under-$25 launch demo budget
A senior engineer with a corporate card and one weekend can walk this entire arc for under $25. That is not a marketing claim; it is the sum of the realistic spend on each tier:
| Step | Tier | Cost |
|---|---|---|
| Hobby experiment | Tier 1 (Colab free) | $0 |
| First production fine-tune | Tier 3 (RunPod RTX 4090, 4 hours) | $1.36 |
| Big production fine-tune | Tier 2 (Vast.ai RTX 3090, 8 hours) | ~$1.60-$3.20 |
| Production serving | Tier 4 (Modal A10G, ~10K inferences/month) | ~$5-15/month |
| Full demo | All four tiers | under $25 |
Every example linked from these pages is runnable. None require a
paid LLM key to walk through in stub mode. Live mode needs a vendor
key in ~/.sagewai/.env; the example tells you which one and what
it costs.
Where to start
If you have never fine-tuned a model before: read Free CUDA via Colab. It is the page that removes the last excuse — you do not need a corporate card, you do not need a GPU, you need a Google account and 60 seconds.
If you already know fine-tuning works and you want the cheapest production path: read Rent when you grow.
If you want to skip provisioning altogether and just plug your existing inference endpoint into Sagewai: see Example 46.
Anti-patterns
-
Treating one tier as the answer for every workload. The tiers are different shapes for different problems. A 12-hour batch fine-tune does not belong on Modal (per-second billing compounds); a 200ms inference call does not belong on a Vast.ai bare-metal rental (provisioning latency dominates).
-
Skipping the bootstrap step. Starting with Opus or GPT-5 in Q1 is the right call, not a compromise. Capturing their answers via the Curator is what makes the Q3 cost-down credible. Read Start with juggernauts.
-
Renting before measuring. The Sagewai Observatory cost dashboard tells you which workflow is the actual API-bill driver. Fine-tune that first.
-
Locking yourself to one provider. Every example here is a swappable orchestration script, not a vendor SDK wrapper. Optionality is the brand.
Cross-references
- Training Loop pillar — the SDK surface this section uses
- Observatory — where the before-and-after cost numbers come from
- Self-Learning Agents — the broader conceptual frame