Deploy locally
This is where the cost-down arc closes. You started with Opus in Q1, captured answers via the Curator, fine-tuned a LoRA on free Colab CUDA or rented RunPod, and now the LoRA sits on your laptop or your inference server. "Never pay per-token again" is the tagline. This page is what "never" actually means in operational terms.
What's the situation?
You have a .gguf file (or a Hugging Face directory, or an MLX
model) that contains your fine-tuned weights. You have an
application that today calls claude-opus-4-7 for $15/1M output
tokens. You want that application to call your local model
instead, with no application-code change, no per-call dollars,
and no vendor in the loop.
Two pieces of infrastructure carry this:
- Ollama (or LM Studio, or
mlx_lm.server) serves the model as a local OpenAI-compatible HTTP endpoint - LiteLLM routes any OpenAI-compatible model name to that
endpoint, so application code keeps using the same
model=...string surface
Sagewai uses LiteLLM internally for every LLM call. "OpenAI compatibility" is not a marketing phrase here — it is the contract that lets your fine-tuned LoRA drop into the same application code that called Opus yesterday.
What's the recommended path?
The deployment is one ollama create invocation. The application
swap is changing one string. Total wall-clock between "LoRA
exists" and "production traffic on local model" is under 30
minutes for a workload that already has a clean model-string
swap.
Show me a runnable thing
# 1. Install Ollama (if not already)
curl -fsSL https://ollama.com/install.sh | sh
# 2. Create an Ollama model from your LoRA's .gguf file
cat > Modelfile <<'EOF'
FROM ./my-finetune.gguf
TEMPLATE """{{ .Prompt }}"""
PARAMETER temperature 0.2
EOF
ollama create my-finetune -f Modelfile
# 3. Verify it serves
curl http://127.0.0.1:11434/api/generate \
-d '{"model": "my-finetune", "prompt": "Hello"}'
# 4. Swap the model string in your Sagewai code
# Before:
# agent = Agent(name="triage", model="claude-opus-4-7", ...)
# After:
# agent = Agent(name="triage", model="ollama/my-finetune:latest", ...)
That is the cost-down endpoint in four steps. Sagewai's Example 38 is the canonical end-to-end demo of the Unsloth fine-tune followed by the Ollama deploy step. The same example also covers the before-vs-after cost numbers using the Observatory cost dashboard.
For Apple Silicon laptops where Ollama's Metal backend isn't yet
optimal for your model shape, swap Ollama for mlx_lm.server —
see Example 38a
for the same deployment shape with MLX.
What "never pay per-token again" actually means
Honest version of the tagline:
| Cost | Before (Opus) | After (local Ollama) |
|---|---|---|
| Per-call | $0.005-0.020 (varies by tokens) | $0 |
| Per-month inference compute | $500-2000 (typical SaaS) | $0 (your laptop / server) |
| Hardware cost (one-time) | $0 | $0 if laptop; $200-2000 if dedicated server |
| Maintenance time | $0 | A few hours per quarter for retraining |
You did not eliminate cost; you converted per-call API spend into fixed compute spend. That is the entire point. Per-call spend grows with usage; fixed compute does not. At any meaningful scale, fixed wins.
The other thing you eliminated: vendor risk. Anthropic raising prices 10×, deprecating a model, throttling your account, or having an outage no longer hurts your product. That is the optionality Sagewai sells.
Bring your own endpoint
If you already run inference on a vendor we do not ship — Together, Replicate, Anyscale, your in-house vLLM cluster — wire it as a custom endpoint via Example 46. The pattern is the same: any OpenAI-compatible HTTP endpoint plugs in as a Sagewai LLM backend or as a tool. The framework is bring-your-own; we do not lock you in.
What would I do next?
- Run Example 38 to see the full Unsloth → Ollama loop.
- Take your own LoRA from Free CUDA via Colab or Rent when you grow and run the four-step deploy above.
- Swap the model string in one application code path. Run end-to-end.
- Open the Observatory cost dashboard and watch the per-call API spend on that path drop to zero.
- Repeat for the next workload.
Anti-patterns
-
Deploying to Ollama before measuring quality. A LoRA that hits 70% of Opus's accuracy on your task is a downgrade you should not ship. Run Example 36's eval pass first; deploy only when the LoRA clears your quality bar.
-
Running Ollama inside Docker on macOS. macOS Docker has no Metal access; the model silently falls back to CPU. Run Ollama natively on the laptop (via
brew install ollamaor the installer). Docker is the right answer on Linux + CUDA, not Apple Silicon. -
Treating local deploy as the end of the story. The Curator keeps capturing fresh training data after deploy. Plan for a re-fine-tune every few weeks (hours of work, not weeks) so the local model tracks new failure modes as they appear.
-
Hiding vendor switching behind feature flags. The Sagewai model-string surface is the swap point. Application code does not need to know which model is in flight. Wire one Observatory panel per model string, not a feature-flag matrix.
Cross-references
- Example 38 — Unsloth fine-tune — the canonical full Unsloth → Ollama loop
- Example 38a — mlx_lm.server deploy — Apple Silicon path
- Example 46 — custom inference as tool — bring-your-own endpoint
- Observatory — where the before-and-after numbers come from
- Start with juggernauts — the start of this arc