Inference

Choose where to run model inference for a Sagewai agent — from a free Colab T4 to a serverless production endpoint. This page maps workload shapes to tiers, gives you realistic price estimates, and points you to the runnable example for each path.

Before you start

  • pip install sagewai and a Python 3.10+ environment.
  • A ~/.sagewai/.env file for vendor keys when you move to live mode.
  • A rough sense of your workload: dev loop, batch fine-tune, or production serving.

Pick a tier

The right provider depends on the shape of the job, not just the headline price. Use this table to match workload to tier:

WorkloadPickWhy
Dev loop, small-model inference on your laptopLocal (Ollama / LM Studio)Free, instant, no network
First fine-tune, no budget, no GPUColab free T4Free CUDA via a Google account
Long batch fine-tune at lowest costVast.aiMarketplace pricing with per-host reliability scores
Reliable fine-tune, root accessRunPodMature runpodctl, predictable per-hour pricing
Production inference with autoscalingModalPer-second billing, no idle GPU spend
You already have an inference endpointCustomWrap any OpenAI-compatible URL as a Sagewai backend

Tiers in detail

Local — your laptop

Ollama or LM Studio running on your own machine. Free. Adequate for development and small-model inference during the dev loop. Limited to the laptop's GPU (or CPU fallback). See Example 18 — local LLM routing.

Free CUDA via Google Colab

A Google account gets you a Tesla T4 (16 GB) for free, with a session limit of about 12 hours. Example 44 orchestrates Colab via Drive-sync: upload a notebook template, open it once in your browser, the notebook runs the fine-tune on the free T4, the orchestrator picks up the LoRA from Drive. Cost: $0. This is the fastest path to a first fine-tune if you don't have hardware. See Free CUDA via Colab for the walkthrough.

Vast.ai — GPU marketplace

A GPU marketplace with a mature Python CLI (vastai) and per-host reliability scoring (uptime history, internet speed, dlperf benchmarks). Wide GPU selection from consumer RTX 3090s to datacentre A100s.

  • 24 GB GPUs: $0.20–$0.45/hr (RTX 3090 / 4090, varies by host)
  • A100 80 GB: $0.80–$1.60/hr
  • Best for: sustained batch fine-tunes where you want the lowest cost-per-hour and are willing to filter for reliable hosts

See Example 45 — Vast.ai marketplace bidding.

RunPod — the default rental tier

Bare-metal access, root permission, mature runpodctl CLI plus Python SDK.

  • RTX 4090 (24 GB): ~$0.34/hr
  • RTX 5090 (32 GB): ~$0.69/hr
  • Best for: the default rental option for fine-tunes; most operators copy the RunPod example first

See Example 47 — RunPod fine-tune orchestration.

You don't provision a server — you decorate a Python function and Modal provisions on demand. Per-second billing; cold-starts measured in seconds rather than minutes.

  • A10G 24 GB: ~$0.0006/sec when warm
  • Best for: production inference, not training. Autoscales without leaving idle GPU spend.

See Example 48 — Modal serverless inference.

Bring your own endpoint

Running inference on a vendor Sagewai doesn't ship with? Wrap any OpenAI-compatible HTTP endpoint as a Sagewai LLM backend using Example 46. The framework is vendor-agnostic.

Comparison at a glance

ProviderBest for24 GB GPU priceCLI maturityExample
Local (Ollama)Dev, small-model inference$0matureEx 18
Colab + Drive-syncFree CUDA, first SLM fine-tune$0drive-sync wrapperEx 44
Vast.aiLowest cost, long batch runs$0.20–$0.45/hrmatureEx 45
RunPodReliable automation, default rental$0.34–$0.70/hrgold standardEx 47
ModalProduction inference, serverlessper-secondgold standardEx 48
Custom endpointBring your own — vendor-agnosticvariesn/aEx 46

Realistic budget for the full arc

Walking every tier — free experiment, first fine-tune, larger fine-tune, production serving — stays under $25:

StepTierCost
Hobby experimentColab free T4$0
First production fine-tuneRunPod RTX 4090, 4 hours$1.36
Big production fine-tuneVast.ai RTX 3090, 8 hours~$1.60–$3.20
Production servingModal A10G, ~10K inferences/month~$5–15/month
Full demoAll four tiersunder $25

Every example linked from these pages is runnable. None require a paid LLM key in stub mode. Live mode needs a vendor key in ~/.sagewai/.env; each example tells you which key it needs and what it costs.

Suggested progression

Most teams move through four stages:

  1. Start on a frontier model (Opus, GPT-5, Sonnet) to ship in Q1. See Start with the big providers.
  2. Capture the answers. Sagewai records prompts and responses you can curate into a training dataset.
  3. Train your own small model on free Colab or a cheap rental. See Free CUDA via Colab.
  4. Deploy it locally and stop paying per token. See Deploy locally.

The shorthand: start on a frontier model, capture its answers, train your own model, deploy locally.

Where to start

Anti-patterns

  1. Using one tier for every workload. A 12-hour batch fine-tune doesn't belong on Modal (per-second billing compounds); a 200ms inference call doesn't belong on a Vast.ai bare-metal rental (provisioning latency dominates).

  2. Skipping the frontier-model bootstrap. Starting on a capable frontier model in Q1 is the right call, not a compromise. Capturing those answers into a training dataset is what makes cost reduction credible later. Read Start with the big providers.

  3. Renting before measuring. The Observatory cost dashboard shows you which workflow is driving your API bill. Fine-tune that workload first.

  4. Locking yourself to one provider. Every example here is a swappable orchestration script, not a vendor SDK wrapper. Keep it that way.

See also