Local Inference & LLM Providers
Sagewai supports 100+ LLM models across cloud and local providers. This guide covers how to configure each option, what keys you need, and how the intelligence layer falls back to local models automatically.
Cloud API providers
| Provider | Models | Free tier | Env variable | Signup |
|---|---|---|---|---|
| OpenAI | GPT-4o, GPT-4o-mini, o1, o3 | $5 credit | OPENAI_API_KEY | platform.openai.com |
| Anthropic | Claude Opus, Sonnet, Haiku | Pay-as-you-go | ANTHROPIC_API_KEY | console.anthropic.com |
| Gemini 2.5 Flash, Gemini 2.5 Pro | Free tier available | GOOGLE_API_KEY | aistudio.google.com | |
| Groq | Llama 3.1, Mixtral | Free tier | GROQ_API_KEY | console.groq.com |
| Together AI | Open-source models | $5 credit | TOGETHER_API_KEY | api.together.ai |
| Fireworks | Open-source models | $1 credit | FIREWORKS_API_KEY | fireworks.ai |
| Cerebras | Fast inference | Free tier | CEREBRAS_API_KEY | cloud.cerebras.ai |
Which API keys do you need?
| Use case | Keys required |
|---|---|
| Admin panel (monitoring, fleet management) | None |
| Running agents | At least one provider key (any from the table above) |
| Context Engine (RAG) | None — falls back to local SentenceTransformer |
| Harness proxy | Keys for each backend you want to route to |
| Intelligence layer (entity extraction, summarization) | None — falls back to local models |
| Fine-tuning with Unsloth | None — runs locally |
Local inference providers
Ollama
The most straightforward option. Runs on macOS, Linux, and Windows.
Install:
# macOS / Linux
curl -fsSL https://ollama.ai/install.sh | sh
# Windows — download from https://ollama.ai/download
Pull a model and start:
ollama pull llama3.1:8b
ollama serve # listens on localhost:11434
Use with Sagewai:
from sagewai import UniversalAgent, providers
agent = UniversalAgent(
name="local-bot",
system_prompt="You are a helpful assistant.",
**providers.ollama("llama3.1:8b"),
)
response = await agent.chat("Explain quantum computing")
Environment variable: OLLAMA_HOST (default: http://localhost:11434)
GPU passthrough in containers:
# Docker
docker run -d --gpus all -p 11434:11434 ollama/ollama
# Podman
podman run -d --device nvidia.com/gpu=all -p 11434:11434 ollama/ollama
vLLM
High-throughput serving with continuous batching and tensor parallelism. Use this for production serving on GPU hardware.
pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000
Use with Sagewai:
from sagewai import UniversalAgent, providers
agent = UniversalAgent(
name="vllm-bot",
**providers.custom(
model="meta-llama/Llama-3.1-8B-Instruct",
api_base="http://localhost:8000/v1",
),
)
vLLM requires an NVIDIA GPU with enough VRAM for the model you choose.
LM Studio
Desktop application with a model browser and a built-in local inference server.
- Download from lmstudio.ai
- Browse and download a model (for example, Mistral 7B)
- Toggle on the local server from within the app
Use with Sagewai:
from sagewai import UniversalAgent, providers
agent = UniversalAgent(
name="lmstudio-bot",
**providers.lm_studio("mistral-7b-instruct"),
)
Environment variable: LM_STUDIO_HOST (default: http://localhost:1234)
llama.cpp
Minimal resource usage. Runs on CPU and works well on machines without a GPU. Uses GGUF model format.
# macOS
brew install llama.cpp
# Start the server with a GGUF model
llama-server -m model.gguf --port 8080
Use with Sagewai:
from sagewai import UniversalAgent, providers
agent = UniversalAgent(
name="llama-cpp-bot",
**providers.llama_cpp(),
)
Environment variable: LLAMA_CPP_HOST (default: http://localhost:8080)
Unsloth (fine-tuning + serving)
Fine-tune a domain-specific model on your own data, then serve it locally at $0/token.
pip install unsloth
See Example 38 for the full pipeline:
- An agent generates Q&A training pairs from your domain data
- Export in Alpaca or ChatML format
- Unsloth fine-tunes a base model using 4-bit QLoRA
- Export to GGUF format
- Serve via Ollama or
llama-server - The Harness auto-discovers the model on port 8001
Harness auto-discovery
When the LLM Harness starts, it probes these ports and registers any running server as a $0/token backend:
| Port | Server | Probe endpoint |
|---|---|---|
| 11434 | Ollama | GET /api/tags |
| 1234 | LM Studio | GET /v1/models |
| 8001 | Unsloth | GET /v1/models |
| 8000 | vLLM | GET /v1/models |
| 8080 | LocalAI / llama.cpp | GET /v1/models |
No configuration is needed. Start your local server and the Harness picks it up.
Source: sagewai/harness/discovery.py — discover_local_backends()
Intelligence layer local models
The intelligence layer uses a tiered fallback. Local models are attempted first, with no API keys required:
| Capability | Local (primary) | API (fallback) | Hash (final fallback) |
|---|---|---|---|
| Embeddings | SentenceTransformer (all-MiniLM-L6-v2, 384-dim) | API Embedder (API key required) | HashEmbedder (deterministic, always works) |
| Entity extraction | GLiNER (urchade/gliner_medium-v2.1, ~50 MB) | LLM Extractor (any model) | — |
| Transcription | faster-whisper (CTranslate2, ~150 MB) | OpenAI Whisper API | — |
| Summarization | SemanticSummarizer (embedding-scored extractive) | BART abstractive | — |
Configure the providers explicitly:
from sagewai.intelligence.config import IntelligenceConfig
config = IntelligenceConfig(
embedding_provider="local", # or "api", "hash", "auto"
extraction_provider="local", # or "llm", "auto"
transcription_provider="local", # or "api", "disabled", "auto"
)
Local-only .env template
To run Sagewai with no cloud API keys at all:
# .env
OLLAMA_HOST=http://localhost:11434
# Intelligence layer defaults to local models automatically.
# Set explicitly if you want to be certain:
# SAGEWAI_INTELLIGENCE_EMBEDDING=local
# SAGEWAI_INTELLIGENCE_EXTRACTION=local
Provider factory reference
All factory functions from sagewai/providers.py:
| Function | Provider | Default port | Env variable |
|---|---|---|---|
providers.ollama(model) | Ollama | 11434 | OLLAMA_HOST |
providers.lm_studio(model) | LM Studio | 1234 | LM_STUDIO_HOST |
providers.llama_cpp(model) | llama.cpp | 8080 | LLAMA_CPP_HOST |
providers.openai(model) | OpenAI | — | OPENAI_API_KEY |
providers.anthropic(model) | Anthropic | — | ANTHROPIC_API_KEY |
providers.gemini(model) | Google Gemini | — | GOOGLE_API_KEY |
providers.groq(model) | Groq | — | GROQ_API_KEY |
providers.together(model) | Together AI | — | TOGETHER_API_KEY |
providers.fireworks(model) | Fireworks | — | FIREWORKS_API_KEY |
providers.cerebras(model) | Cerebras | — | CEREBRAS_API_KEY |
providers.huggingface(model) | HuggingFace | — | HF_TOKEN |
providers.custom(model, api_base) | Any OpenAI-compatible server | — | — |
Each function returns a dict you can unpack directly: UniversalAgent(name="bot", **providers.ollama("llama3.1:8b")).