Rent when you grow
Colab is the right starting point but has real limits: the T4 caps model size at ~7 B, sessions disconnect after ~12 hours, and it doesn't support scheduled or automated runs. Once any of those constraints bite, you need a rented GPU.
This page covers three rented-GPU tiers, each suited to a different job, at prices that fit a realistic budget.
When to graduate from Colab
You're ready to move once you need one of these:
- A larger model (13B, 32B, 70B) — Colab's 16 GB won't hold it
- A longer training run (24 hours, multi-day) — Colab will disconnect
- A reproducible, scheduled fine-tune in CI — Colab is interactive
- A production inference endpoint with autoscaling — Colab is not a server
Pick a tier
| Situation | Pick | Why |
|---|---|---|
| Default — reliable fine-tune | RunPod | Most-mature CLI, root access, predictable per-hour pricing |
| Long batch fine-tune at lowest cost | Vast.ai | Marketplace pricing on RTX 3090s/4090s, per-host reliability scoring |
| Trained the model; need to serve it | Modal | Per-second billing, sub-second autoscaling, no idle GPU spend |
RunPod — the default tier
RunPod Community Cloud is the default option. RTX 4090 (24 GB) at
~$0.34/hr, RTX 5090 (32 GB) at ~$0.69/hr. Bare-metal access, root
permission. The runpodctl CLI is the most polished of the three.
runpodctl create pod \
--imageName unsloth/unsloth:latest \
--gpuType "NVIDIA RTX 5090" \
--gpuCount 1 \
--containerDiskInGb 20
Sagewai's Example 47 wraps this end-to-end: provision the pod, upload the Curator's JSONL, run the Unsloth fine-tune, download the LoRA, tear the pod down. Cleanup-on-failure is included so a stuck pod doesn't silently drain your budget.
A typical 4-hour fine-tune on a 4090 costs $1.36.
Vast.ai — GPU marketplace
Vast.ai is a marketplace where independent hosts list their GPUs.
The vastai Python CLI is mature, and every host carries a
reliability score: max_perf, dlperf, internet_speed, downtime
history. You can query offers, sort by cost-per-hour, and filter to
hosts above a reliability threshold before renting.
- 24 GB GPUs: $0.20–$0.45/hr (RTX 3090 / 4090)
- A100 80 GB: $0.80–$1.60/hr
Sagewai's Example 45
wraps vastai for batch fine-tunes where cost-per-hour matters more
than provisioning latency. The example sets a budget cap and filters
by reliability score so a flaky host doesn't consume your training
data.
vastai search offers \
'gpu_name=RTX_3090 reliability>0.95 dph<0.30' \
--order dph
A typical 8-hour batch fine-tune on a 3090 costs $1.60–$3.20, lower than RunPod for the same work — at the cost of longer provisioning and a slightly higher rate of host issues.
Modal — serverless inference
Modal is the tier for production inference. You don't provision a server — you decorate a Python function and Modal provisions on demand:
import modal
app = modal.App("my-fine-tuned-model")
image = modal.Image.debian_slim(python_version="3.11").pip_install("vllm")
@app.function(gpu="A10G", image=image, serialized=True)
def serve(prompt: str) -> str:
# vLLM serves the LoRA loaded at startup
return llm.generate(prompt)
Per-second billing on an A10G is ~$0.0006/sec when warm. Cold-starts are 8–10 seconds (debian_slim image; larger images cost more on the cold path). Sagewai's Example 48 takes the LoRA produced by Example 47's RunPod fine-tune and wraps it in a Modal serverless function. The full lifecycle: train on RunPod, serve on Modal, integrate into the agent loop.
A serving endpoint handling ~10K inferences/month typically costs $5–15 in Modal compute, compared to a dedicated bare-metal rental sitting idle 90% of the time.
Cost summary
| Tier | Best for | 24 GB GPU price | Example |
|---|---|---|---|
| RunPod | Reliable fine-tunes, default | $0.34–$0.70/hr | Ex 47 |
| Vast.ai | Cheapest sustained cost | $0.20–$0.45/hr | Ex 45 |
| Modal | Production inference | per-second (~$0.0006/s A10G) | Ex 48 |
Run it — RunPod path end-to-end
# 1. Set RUNPOD_API_KEY in ~/.sagewai/.env
echo "RUNPOD_API_KEY=your_key_here" >> ~/.sagewai/.env
# 2. Run the example
pip install sagewai python-dotenv
python packages/sdk/sagewai/examples/47_runpod_finetune_orchestration.py
The example provisions, trains, downloads the LoRA, and tears the pod down. Stub mode (no API key) prints the orchestration plan without provisioning anything, so you can verify the flow first.
Steps to follow
- Pick the tier that matches your job (RunPod for fine-tunes, Modal for serving).
- Add the vendor key to
~/.sagewai/.env. - Run the example's stub mode to verify the wiring.
- Run live. Check the Observatory cost dashboard to confirm the spend matches your estimate.
- Once the LoRA is ready, deploy locally — see Deploy locally.
Anti-patterns
-
Renting a Modal A10G for a 12-hour batch fine-tune. Per-second billing on a long batch job costs more than RunPod or Vast.ai for the same work. Use RunPod or Vast.ai for training; use Modal for serving.
-
Using Vast.ai without a reliability filter. The marketplace includes hosts with months of uptime and hosts that disappear mid-job. Always filter
reliability>0.9or higher before renting. -
Removing the cleanup hook. A stuck pod on RunPod or Vast.ai bills until you notice. The example wrappers include cleanup-on-failure for exactly this reason — keep it.
-
Picking by landing-page polish. All three vendors have rough edges. Pick the one that matches your workload shape.