Train your own model — the weekend training loop

By the end of this tutorial you will have captured production traffic from a running agent, fine-tuned a 3B open-source model on it, and deployed the result on your own hardware. The fine-tune itself is a short GPU job — free on Colab, or a few dollars of rental time elsewhere; what decides whether the model is any good is the data you capture and how you evaluate it. A model served locally has no per-token API bill.

Before you start

  • A Sagewai project with at least one running agent — the first agent guide gets you there in ten minutes.
  • A Hugging Face account (free) — Unsloth pulls the base model from HF.
  • Access to one GPU tier:
    • A Google account for free Colab T4, or
    • A Vast.ai, RunPod, or Modal account (~$0.20–0.70/hr), or
    • An Apple Silicon Mac with 16 GB+ RAM for the mlx-lm path.
  • Python 3.10+ and pip install "sagewai[training]".

How the loop works

Your agent calls a cloud LLM in production. Sagewai records each call as a structured JSONL record. Once you have around 1,000 records, run an Unsloth fine-tune on that JSONL. Unsloth finishes in 30–60 minutes on any of the GPU tiers listed above and saves a LoRA adapter (~50 MB) to models/. Load that adapter into Ollama or mlx_lm.server and point your agent at the local endpoint. The cloud spend stops growing.

Loading diagram...

Step 1 — Capture production traffic

Run Example 25 — training data pipeline. It fires a few agent calls and writes each one to ~/.sagewai/training/<agent-name>.jsonl. Confirm the output looks right:

head -1 ~/.sagewai/training/your-agent.jsonl | jq .

Each record contains the instruction, input, output, model name, and timestamp. Let the agent run in production for a week or two until you have at least 1,000 records — that is the minimum for a fine-tune worth deploying.

Step 2 — Close the loop (optional)

To trigger fine-tuning automatically once the dataset reaches a threshold, run Example 36 — autopilot training loop. It watches the dataset and kicks off a fine-tune job when the threshold is crossed. Skip this step if you prefer to trigger fine-tunes manually.

Step 3 — Pick a GPU tier

TierWhen it winsCostExample
Free Colab T4First fine-tune, no GPU at home, OK with 12-hour session limit$044
Vast.ai bidCost-sensitive iterations, OK handling interruptions~$0.20–0.45/hr45
RunPod reliable rentalNeed 24 GB VRAM and cleanup-on-failure~$0.34–0.70/hr47
Modal serverlessPay-per-second, only when the model is hot~$0.0002/call on T448

Each example provisions the GPU, syncs the dataset, and tears down on completion or failure.

Step 4 — Run the fine-tune

Run Example 38 — Unsloth fine-tune. It fine-tunes Qwen2.5-3B-Instruct on your captured dataset, prints the loss curve, and saves a LoRA adapter to models/.

Expected wall-clock on the GPU tiers above: 30–60 minutes for ~1,000 examples, 3 epochs.

Step 5 — Deploy locally

Pick your deploy target:

  • Anywhere — Ollama. Convert the LoRA to GGUF, write a Modelfile, and run ollama create. Sagewai's harness auto-discovers Ollama on startup.
  • Apple Silicon — mlx-lm. Run Example 38a. mlx_lm.server uses Metal directly; no Docker layer needed.
  • Custom endpoint. Run Example 46 to wrap any HTTP completion endpoint as a Sagewai-callable tool — on-prem, customer-hosted, or air-gapped.

Once the local model is running, point your agent at it:

from sagewai import UniversalAgent, providers

agent = UniversalAgent(
    name="legal-reviewer-v2",
    **providers.ollama("legal-llm"),
)

response = await agent.chat("Review this non-compete clause...")

Cloud handles novel queries; the local model handles the patterns it has seen before.

Where you'd use this

The capture-accumulate-fine-tune-deploy loop suits four common situations.

SaaS support triage

You shipped a triage agent on Claude Haiku. By Q2 you're triaging 12,000 emails a month at $0.0007 per call.

ConcernWhat to do
Cut the API bill in halfFine-tune Qwen2.5-3B-Instruct on the 12K captured triage decisions; deploy via Ollama on a $40/month VPS; per-call cost drops to zero
Quality must not regress on P0/P1 casesThe fine-tune trains on real production data; the soak harness in _soaks/directives_soak.py grades the candidate before promotion
Prove the cost deltaExample 38 prints the loss curve, eval-set accuracy, and $/call delta

E-commerce product description generation

Your catalogue has 50K SKUs. You're generating descriptions on GPT-4o at $0.0024 per SKU — $120/month and climbing.

ConcernWhat to do
Match the brand voiceCapture 1K human-edited descriptions; fine-tune Mistral-7B on them; the LoRA learns the voice
Add categories without re-fine-tuningThe dataset accumulates in ~/.sagewai/training/; the next fine-tune trains on the merged corpus
No flaky third-party dependencyOllama runs on the same machine as the ingestion job; no outbound network call

Healthcare-compliant note summarisation

Your scribe app summarises clinical notes. HIPAA forbids sending PHI to OpenAI without a BAA.

ConcernWhat to do
PHI must not leave the boundaryThe fine-tuned model runs on a HIPAA-eligible Modal endpoint or on-prem; PHI never reaches a third-party LLM
Produce a model cardExample 38 emits the training-set hash, eval-set accuracy, and the LoRA SHA
Upgrade the base model when a new one shipsRe-run Example 38 against Llama-3.2-3B and compare; the captured dataset is base-model-agnostic

Internal knowledge-base Q&A on engineering wikis

Your platform team answers "why is X failing?" questions across 200 services and a 5K-page Confluence. Cloud LLM via RAG costs $300/month.

ConcernWhat to do
Internal tooling cost is hard to justifyFine-tune a 3B model on captured Q&A pairs; deploy on a $40/month VPS; the cost line vanishes
New runbooks land every weekCaptured Q&A from the live tool feeds the next fine-tune automatically
Keep internal data self-hostedOllama on-prem — no IP leaves the boundary

How Sagewai protects your training data

  • Capture writes to ~/.sagewai/training/ on the host you control. Conversations never leave your infrastructure.
  • The fine-tune runs on a GPU you rent or own. The dataset is uploaded to that GPU and torn down with it.
  • Per-project isolation: each project's captured dataset is scoped to its project ID. Agents in project A cannot read training data from project B.
  • The deployed model is served from your own Ollama, mlx-lm, or custom endpoint. No data crosses to a model provider after deployment.

What you're responsible for

  • Redact PII before fine-tuning if your dataset contains regulated data. The export step has a strip_pii=True flag — verify it removed what you expect.
  • Watch the GPU rental meter. Vast.ai and RunPod bill until you tear the pod down. The example scripts include cleanup-on-failure, but a hung process will still accumulate charges.
  • Check the base model licence. Llama, Mistral, and Qwen have different commercial-use terms.
  • Back up the LoRA file (~50 MB). It is the only artifact you need to redeploy.

Companion examples

#ExampleWhat it adds
25training_data_pipelineCapture surface for production traffic
36autopilot_training_loopAuto-trigger fine-tune when dataset crosses threshold
38unsloth_finetuneReal Unsloth fine-tune, real loss curves
38amlx_lm_server_deployApple Silicon deploy via mlx_lm.server
44colab_free_cudaFree Tesla T4 via Colab Drive-sync
45vastai_marketplace_bidBid-cheapest aggregator with reliability scoring
46custom_inference_as_toolBring-your-own HTTP endpoint
47runpod_finetune_orchestrationRunPod reliable rental with cleanup-on-failure
48modal_serverless_inferencePer-second serverless inference

See also