Inference deployment — pick a GPU path that fits your job
This tutorial helps you choose where to run a Sagewai-trained model in production. After reading it you will know which GPU path matches your workload — free, bid-priced, reliable rental, serverless, bring-your-own, or local — and you will have a runnable example for each one.
This page covers the deploy half of the loop. For the full capture-to-deploy sequence, start with Train your own model.
Before you start
- A working Sagewai install (
pip install sagewai). - A captured training dataset under
~/.sagewai/training/if you plan to fine-tune. See Train your own model. - Vendor accounts only for the paths you actually use (Google for Colab, Vast.ai, RunPod, or Modal). The local and bring-your-own paths need none.
How the paths fit together
Every path produces the same artifact — a LoRA adapter — and feeds the same Sagewai SDK surface (UniversalAgent). Pick a path by job shape, not by vendor brand.
Pick a path
| Path | When it wins | Cost | Companion example |
|---|---|---|---|
| Local laptop | Development, local-mode inference, identical SDK surface to cloud | Free (existing hardware) | 18 — local LLM routing |
| Free Colab T4 | First fine-tune, no GPU at home, no vendor account | $0 | 44 — Colab free CUDA |
| Vast.ai bid | Cost-sensitive iterations, willing to handle interruptions | ~$0.20–0.45/hr | 45 — Vast.ai marketplace bid |
| RunPod rental | Production fine-tunes, SLA and cleanup-on-failure required | ~$0.34–0.70/hr | 47 — RunPod orchestration |
| Modal serverless | Bursty inference traffic, pay only when hot | Per-second (~$0.0006/s A10G) | 48 — Modal serverless inference |
| Bring your own | On-prem, air-gapped, customer-hosted, vendor lock-out | Varies | 46 — custom inference as tool |
| Local Ollama | Production serving on commodity hardware | Free (existing VPS) | 18 — local LLM routing |
| Local mlx-lm | Production serving on Apple Silicon Mac shops | Free (existing hardware) | 38a — mlx_lm.server deploy |
For longer-form vendor analysis, see the Inference education section — a five-page walkthrough of the full spectrum.
Run a path end-to-end
Each example below runs standalone. Each has a stub mode that prints the orchestration plan without provisioning anything, so you can sanity-check the wiring before spending money.
Free path — Colab T4
python 44_colab_free_cuda.py
The script Drive-syncs the dataset to a Colab notebook, fine-tunes on the free T4, and Drive-syncs the LoRA back. No vendor account beyond a Google login. Sessions disconnect at ~12 hours, and T4 memory caps the model size at ~7B.
Cheapest path — Vast.ai bid
export VASTAI_API_KEY=...
python 45_vastai_marketplace_bid.py
The script bids the cheapest A10G with a reliability filter and prints the preempted-host risk. Always filter reliability>0.9 before renting — the marketplace includes hosts that disappear mid-job.
Reliable path — RunPod
export RUNPOD_API_KEY=...
python 47_runpod_finetune_orchestration.py
24 GB GPU, cleanup-on-failure, SLA-grade. The script brings up the pod, runs the fine-tune, downloads the LoRA, and tears the pod down — all idempotently. A typical 4-hour fine-tune on a 4090 costs about $1.36.
Serverless path — Modal
modal token new
python 48_modal_serverless_inference.py
Per-second billing on an A10G. Cold-start is around 9s on T4 once the function image caches; warm calls are around 281ms. A real run logged 10.10s for $0.002147 on a T4. Use Modal for serving, not for long batch fine-tunes — per-second billing on a 12-hour batch is the worst of both worlds.
Bring-your-own — custom endpoint
python 46_custom_inference_as_tool.py
Wraps any HTTP completion endpoint — your Modal deploy, your on-prem TGI, your customer's vLLM cluster — as a Sagewai-callable tool. The endpoint URL is a config value; revoke it and the agent fails closed.
Apple Silicon deploy — mlx-lm
pip install mlx-lm
python 38a_mlx_lm_server_deploy.py
mlx_lm.server for Mac shops with Apple Silicon. Serves the LoRA from Example 38 over a LiteLLM-compatible OpenAI endpoint. Do not try to dockerize this on macOS — Docker on macOS has no Metal access and the server silently falls back to CPU.
Where you'd use this
Solo founder fine-tuning their first model
You are pre-Series A. You need to prove the loop works before the VP of Eng greenlights a real GPU budget.
| Concern | What to use |
|---|---|
| No vendor account and cannot get one this week | Example 44 — needs only a Google login |
| You need a real LoRA at the end, not a demo | The example saves a working LoRA you can deploy locally |
| Laptop has no GPU | Colab T4 has 16 GB; that is plenty for a 3B fine-tune |
Mid-size SaaS productionising a Q1 prototype
Your AI feature shipped on Opus in Q1. Now in Q3 you need to cut costs without breaking it.
| Concern | What to use |
|---|---|
| SLA-grade infrastructure for the production fine-tune | Example 47 (RunPod) — cleanup-on-failure, reliable rental |
| Compare the fine-tuned model against the cloud baseline | Example 38 prints per-call cost delta against the original Opus baseline |
| Your finance team wants a vendor relationship | RunPod, Modal, and Vast.ai all offer enterprise contracts |
On-prem-first SaaS in regulated industries
Your customers run on-prem. They want fine-tuning to happen on their hardware, not yours.
| Concern | What to use |
|---|---|
| Customer's training data must never leave their VPC | Example 46 — their on-prem TGI fronts the model; your code calls it |
| You do not want to certify each customer's hardware | Same SDK surface; if the endpoint is OpenAI-compatible, Sagewai talks to it |
| Customer's IT wants to revoke at any time | The endpoint URL is a config value; revoke it and the agent fails closed |
Bursty B2C app with unpredictable load
Your app's AI feature handles 100x peak vs trough traffic. Reserved capacity means paying for trough.
| Concern | What to use |
|---|---|
| No idle GPU spend at trough | Example 48 (Modal serverless) — per-second, pay only when hot |
| Cold-start must be tolerable | T4 cold-start is around 9s; warm calls around 281ms |
| Scale to zero overnight | Modal scales to zero by default |
Apple-shop developer team
Your team is all on M-series Macs. No CUDA hardware in the office.
| Concern | What to use |
|---|---|
| Fine-tuning | Use Example 44 (Colab) — the LoRA is base-architecture-agnostic |
| Deploy locally on Apple Silicon | Example 38a uses mlx_lm.server with Metal acceleration |
| Same model on dev Macs and Linux CI | The Ollama path (Example 18) runs on both — CI uses Ollama, devs use mlx-lm |
How Sagewai protects your deploy
- Same SDK surface across every path. A
UniversalAgentconfigured for a Modal endpoint, a RunPod-trained Ollama deploy, or a customer's on-prem cluster looks identical in your code. You can swap paths without touching agent logic. - Cleanup-on-failure on rental paths. The RunPod and Vast.ai examples tear the pod down when the script exits, even on error. A stuck pod cannot silently drain your budget.
- Stub-mode for every cloud path. Run any cloud example with no API key set and it prints the orchestration plan instead of provisioning. Use this to verify the wiring before spending.
- Endpoint URLs are config, not code. Bring-your-own and customer-hosted paths revoke at the config layer — remove the URL and the agent fails closed.
What you're responsible for
- Set vendor API keys as environment variables or in
~/.sagewai/.env. The examples read from there. - Filter Vast.ai by reliability (
reliability>0.9or higher) before renting. The marketplace includes hosts that disappear mid-job. - Do not dockerize MLX on macOS. Docker on macOS has no Metal access. Use
mlx_lm.serverdirectly via launchd orbrew services. - Match the path to the workload. Modal for serving, RunPod or Vast.ai for training. Per-second billing on a 12-hour batch fine-tune is the worst of both worlds.
- Check the cost dashboard. Verify spend matches the estimate after the first live run — see Observability and cost.
Companion examples
| # | Example | What it adds |
|---|---|---|
| 18 | local_llm_routing | Foundation — Ollama + LM Studio swap |
| 38 | unsloth_finetune | Real Unsloth fine-tune with cost delta |
| 38a | mlx_lm_server_deploy | Apple Silicon deploy via mlx_lm.server |
| 44 | colab_free_cuda | Free Tesla T4 via Drive-sync |
| 45 | vastai_marketplace_bid | Bid-cheapest aggregator with reliability filter |
| 46 | custom_inference_as_tool | Bring-your-own endpoint |
| 47 | runpod_finetune_orchestration | RunPod reliable rental, cleanup-on-failure |
| 48 | modal_serverless_inference | Modal per-second serverless |
Next steps
- Train your own model — the full capture-to-deploy loop. This page is the deploy zoom; that page is the loop.
- Rent when you grow — long-form vendor comparison for RunPod, Vast.ai, and Modal.
- Free CUDA via Colab — the free path in detail.
- Deploy locally — Ollama and mlx-lm deploy paths.
- Observability and cost — verify your deploy spend matches the estimate.