Inference deployment — pick a GPU path that fits your job

This tutorial helps you choose where to run a Sagewai-trained model in production. After reading it you will know which GPU path matches your workload — free, bid-priced, reliable rental, serverless, bring-your-own, or local — and you will have a runnable example for each one.

This page covers the deploy half of the loop. For the full capture-to-deploy sequence, start with Train your own model.

Before you start

  • A working Sagewai install (pip install sagewai).
  • A captured training dataset under ~/.sagewai/training/ if you plan to fine-tune. See Train your own model.
  • Vendor accounts only for the paths you actually use (Google for Colab, Vast.ai, RunPod, or Modal). The local and bring-your-own paths need none.

How the paths fit together

Every path produces the same artifact — a LoRA adapter — and feeds the same Sagewai SDK surface (UniversalAgent). Pick a path by job shape, not by vendor brand.

Loading diagram...

Pick a path

PathWhen it winsCostCompanion example
Local laptopDevelopment, local-mode inference, identical SDK surface to cloudFree (existing hardware)18 — local LLM routing
Free Colab T4First fine-tune, no GPU at home, no vendor account$044 — Colab free CUDA
Vast.ai bidCost-sensitive iterations, willing to handle interruptions~$0.20–0.45/hr45 — Vast.ai marketplace bid
RunPod rentalProduction fine-tunes, SLA and cleanup-on-failure required~$0.34–0.70/hr47 — RunPod orchestration
Modal serverlessBursty inference traffic, pay only when hotPer-second (~$0.0006/s A10G)48 — Modal serverless inference
Bring your ownOn-prem, air-gapped, customer-hosted, vendor lock-outVaries46 — custom inference as tool
Local OllamaProduction serving on commodity hardwareFree (existing VPS)18 — local LLM routing
Local mlx-lmProduction serving on Apple Silicon Mac shopsFree (existing hardware)38a — mlx_lm.server deploy

For longer-form vendor analysis, see the Inference education section — a five-page walkthrough of the full spectrum.

Run a path end-to-end

Each example below runs standalone. Each has a stub mode that prints the orchestration plan without provisioning anything, so you can sanity-check the wiring before spending money.

Free path — Colab T4

python 44_colab_free_cuda.py

The script Drive-syncs the dataset to a Colab notebook, fine-tunes on the free T4, and Drive-syncs the LoRA back. No vendor account beyond a Google login. Sessions disconnect at ~12 hours, and T4 memory caps the model size at ~7B.

Cheapest path — Vast.ai bid

export VASTAI_API_KEY=...
python 45_vastai_marketplace_bid.py

The script bids the cheapest A10G with a reliability filter and prints the preempted-host risk. Always filter reliability>0.9 before renting — the marketplace includes hosts that disappear mid-job.

Reliable path — RunPod

export RUNPOD_API_KEY=...
python 47_runpod_finetune_orchestration.py

24 GB GPU, cleanup-on-failure, SLA-grade. The script brings up the pod, runs the fine-tune, downloads the LoRA, and tears the pod down — all idempotently. A typical 4-hour fine-tune on a 4090 costs about $1.36.

Serverless path — Modal

modal token new
python 48_modal_serverless_inference.py

Per-second billing on an A10G. Cold-start is around 9s on T4 once the function image caches; warm calls are around 281ms. A real run logged 10.10s for $0.002147 on a T4. Use Modal for serving, not for long batch fine-tunes — per-second billing on a 12-hour batch is the worst of both worlds.

Bring-your-own — custom endpoint

python 46_custom_inference_as_tool.py

Wraps any HTTP completion endpoint — your Modal deploy, your on-prem TGI, your customer's vLLM cluster — as a Sagewai-callable tool. The endpoint URL is a config value; revoke it and the agent fails closed.

Apple Silicon deploy — mlx-lm

pip install mlx-lm
python 38a_mlx_lm_server_deploy.py

mlx_lm.server for Mac shops with Apple Silicon. Serves the LoRA from Example 38 over a LiteLLM-compatible OpenAI endpoint. Do not try to dockerize this on macOS — Docker on macOS has no Metal access and the server silently falls back to CPU.

Where you'd use this

Solo founder fine-tuning their first model

You are pre-Series A. You need to prove the loop works before the VP of Eng greenlights a real GPU budget.

ConcernWhat to use
No vendor account and cannot get one this weekExample 44 — needs only a Google login
You need a real LoRA at the end, not a demoThe example saves a working LoRA you can deploy locally
Laptop has no GPUColab T4 has 16 GB; that is plenty for a 3B fine-tune

Mid-size SaaS productionising a Q1 prototype

Your AI feature shipped on Opus in Q1. Now in Q3 you need to cut costs without breaking it.

ConcernWhat to use
SLA-grade infrastructure for the production fine-tuneExample 47 (RunPod) — cleanup-on-failure, reliable rental
Compare the fine-tuned model against the cloud baselineExample 38 prints per-call cost delta against the original Opus baseline
Your finance team wants a vendor relationshipRunPod, Modal, and Vast.ai all offer enterprise contracts

On-prem-first SaaS in regulated industries

Your customers run on-prem. They want fine-tuning to happen on their hardware, not yours.

ConcernWhat to use
Customer's training data must never leave their VPCExample 46 — their on-prem TGI fronts the model; your code calls it
You do not want to certify each customer's hardwareSame SDK surface; if the endpoint is OpenAI-compatible, Sagewai talks to it
Customer's IT wants to revoke at any timeThe endpoint URL is a config value; revoke it and the agent fails closed

Bursty B2C app with unpredictable load

Your app's AI feature handles 100x peak vs trough traffic. Reserved capacity means paying for trough.

ConcernWhat to use
No idle GPU spend at troughExample 48 (Modal serverless) — per-second, pay only when hot
Cold-start must be tolerableT4 cold-start is around 9s; warm calls around 281ms
Scale to zero overnightModal scales to zero by default

Apple-shop developer team

Your team is all on M-series Macs. No CUDA hardware in the office.

ConcernWhat to use
Fine-tuningUse Example 44 (Colab) — the LoRA is base-architecture-agnostic
Deploy locally on Apple SiliconExample 38a uses mlx_lm.server with Metal acceleration
Same model on dev Macs and Linux CIThe Ollama path (Example 18) runs on both — CI uses Ollama, devs use mlx-lm

How Sagewai protects your deploy

  • Same SDK surface across every path. A UniversalAgent configured for a Modal endpoint, a RunPod-trained Ollama deploy, or a customer's on-prem cluster looks identical in your code. You can swap paths without touching agent logic.
  • Cleanup-on-failure on rental paths. The RunPod and Vast.ai examples tear the pod down when the script exits, even on error. A stuck pod cannot silently drain your budget.
  • Stub-mode for every cloud path. Run any cloud example with no API key set and it prints the orchestration plan instead of provisioning. Use this to verify the wiring before spending.
  • Endpoint URLs are config, not code. Bring-your-own and customer-hosted paths revoke at the config layer — remove the URL and the agent fails closed.

What you're responsible for

  • Set vendor API keys as environment variables or in ~/.sagewai/.env. The examples read from there.
  • Filter Vast.ai by reliability (reliability>0.9 or higher) before renting. The marketplace includes hosts that disappear mid-job.
  • Do not dockerize MLX on macOS. Docker on macOS has no Metal access. Use mlx_lm.server directly via launchd or brew services.
  • Match the path to the workload. Modal for serving, RunPod or Vast.ai for training. Per-second billing on a 12-hour batch fine-tune is the worst of both worlds.
  • Check the cost dashboard. Verify spend matches the estimate after the first live run — see Observability and cost.

Companion examples

#ExampleWhat it adds
18local_llm_routingFoundation — Ollama + LM Studio swap
38unsloth_finetuneReal Unsloth fine-tune with cost delta
38amlx_lm_server_deployApple Silicon deploy via mlx_lm.server
44colab_free_cudaFree Tesla T4 via Drive-sync
45vastai_marketplace_bidBid-cheapest aggregator with reliability filter
46custom_inference_as_toolBring-your-own endpoint
47runpod_finetune_orchestrationRunPod reliable rental, cleanup-on-failure
48modal_serverless_inferenceModal per-second serverless

Next steps