Local Inference & LLM Providers

Sagewai supports 100+ LLM models across cloud and local providers. This guide covers how to configure each option, what keys you need, and how the intelligence layer falls back to local models automatically.

Cloud API providers

Provider	Models	Free tier	Env variable	Signup
OpenAI	GPT-4o, GPT-4o-mini, o1, o3	$5 credit	`OPENAI_API_KEY`	platform.openai.com
Anthropic	Claude Opus, Sonnet, Haiku	Pay-as-you-go	`ANTHROPIC_API_KEY`	console.anthropic.com
Google	Gemini 2.5 Flash, Gemini 2.5 Pro	Free tier available	`GOOGLE_API_KEY`	aistudio.google.com
Groq	Llama 3.1, Mixtral	Free tier	`GROQ_API_KEY`	console.groq.com
Together AI	Open-source models	$5 credit	`TOGETHER_API_KEY`	api.together.ai
Fireworks	Open-source models	$1 credit	`FIREWORKS_API_KEY`	fireworks.ai
Cerebras	Fast inference	Free tier	`CEREBRAS_API_KEY`	cloud.cerebras.ai

Which API keys do you need?

Use case	Keys required
Admin panel (monitoring, fleet management)	None
Running agents	At least one provider key (any from the table above)
Context Engine (RAG)	None — falls back to local SentenceTransformer
Harness proxy	Keys for each backend you want to route to
Intelligence layer (entity extraction, summarization)	None — falls back to local models
Fine-tuning with Unsloth	None — runs locally

Local inference providers

Ollama

The most straightforward option. Runs on macOS, Linux, and Windows.

Install:

# macOS / Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Windows — download from https://ollama.ai/download

Pull a model and start:

ollama pull llama3.1:8b
ollama serve  # listens on localhost:11434

Use with Sagewai:

from sagewai import UniversalAgent, providers

agent = UniversalAgent(
    name="local-bot",
    system_prompt="You are a helpful assistant.",
    **providers.ollama("llama3.1:8b"),
)
response = await agent.chat("Explain quantum computing")

Environment variable: OLLAMA_HOST (default: http://localhost:11434)

GPU passthrough in containers:

# Docker
docker run -d --gpus all -p 11434:11434 ollama/ollama

# Podman
podman run -d --device nvidia.com/gpu=all -p 11434:11434 ollama/ollama

vLLM

High-throughput serving with continuous batching and tensor parallelism. Use this for production serving on GPU hardware.

pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

Use with Sagewai:

from sagewai import UniversalAgent, providers

agent = UniversalAgent(
    name="vllm-bot",
    **providers.custom(
        model="meta-llama/Llama-3.1-8B-Instruct",
        api_base="http://localhost:8000/v1",
    ),
)

vLLM requires an NVIDIA GPU with enough VRAM for the model you choose.

LM Studio

Desktop application with a model browser and a built-in local inference server.

Download from lmstudio.ai
Browse and download a model (for example, Mistral 7B)
Toggle on the local server from within the app

Use with Sagewai:

from sagewai import UniversalAgent, providers

agent = UniversalAgent(
    name="lmstudio-bot",
    **providers.lm_studio("mistral-7b-instruct"),
)

Environment variable: LM_STUDIO_HOST (default: http://localhost:1234)

llama.cpp

Minimal resource usage. Runs on CPU and works well on machines without a GPU. Uses GGUF model format.

# macOS
brew install llama.cpp

# Start the server with a GGUF model
llama-server -m model.gguf --port 8080

Use with Sagewai:

from sagewai import UniversalAgent, providers

agent = UniversalAgent(
    name="llama-cpp-bot",
    **providers.llama_cpp(),
)

Environment variable: LLAMA_CPP_HOST (default: http://localhost:8080)

Unsloth (fine-tuning + serving)

Fine-tune a domain-specific model on your own data, then serve it locally at $0/token.

pip install unsloth

See Example 38 for the full pipeline:

An agent generates Q&A training pairs from your domain data
Export in Alpaca or ChatML format
Unsloth fine-tunes a base model using 4-bit QLoRA
Export to GGUF format
Serve via Ollama or llama-server
The Harness auto-discovers the model on port 8001

Harness auto-discovery

When the LLM Harness starts, it probes these ports and registers any running server as a $0/token backend:

Port	Server	Probe endpoint
11434	Ollama	`GET /api/tags`
1234	LM Studio	`GET /v1/models`
8001	Unsloth	`GET /v1/models`
8000	vLLM	`GET /v1/models`
8080	LocalAI / llama.cpp	`GET /v1/models`

No configuration is needed. Start your local server and the Harness picks it up.

Source: sagewai/harness/discovery.py — discover_local_backends()

Intelligence layer local models

The intelligence layer uses a tiered fallback. Local models are attempted first, with no API keys required:

Capability	Local (primary)	API (fallback)	Hash (final fallback)
Embeddings	SentenceTransformer (`all-MiniLM-L6-v2`, 384-dim)	API Embedder (API key required)	HashEmbedder (deterministic, always works)
Entity extraction	GLiNER (`urchade/gliner_medium-v2.1`, ~50 MB)	LLM Extractor (any model)	—
Transcription	faster-whisper (CTranslate2, ~150 MB)	OpenAI Whisper API	—
Summarization	SemanticSummarizer (embedding-scored extractive)	BART abstractive	—

Configure the providers explicitly:

from sagewai.intelligence.config import IntelligenceConfig

config = IntelligenceConfig(
    embedding_provider="local",      # or "api", "hash", "auto"
    extraction_provider="local",     # or "llm", "auto"
    transcription_provider="local",  # or "api", "disabled", "auto"
)

Local-only .env template

To run Sagewai with no cloud API keys at all:

# .env
OLLAMA_HOST=http://localhost:11434

# Intelligence layer defaults to local models automatically.
# Set explicitly if you want to be certain:
# SAGEWAI_INTELLIGENCE_EMBEDDING=local
# SAGEWAI_INTELLIGENCE_EXTRACTION=local

Provider factory reference

All factory functions from sagewai/providers.py:

Function	Provider	Default port	Env variable
`providers.ollama(model)`	Ollama	11434	`OLLAMA_HOST`
`providers.lm_studio(model)`	LM Studio	1234	`LM_STUDIO_HOST`
`providers.llama_cpp(model)`	llama.cpp	8080	`LLAMA_CPP_HOST`
`providers.openai(model)`	OpenAI	—	`OPENAI_API_KEY`
`providers.anthropic(model)`	Anthropic	—	`ANTHROPIC_API_KEY`
`providers.gemini(model)`	Google Gemini	—	`GOOGLE_API_KEY`
`providers.groq(model)`	Groq	—	`GROQ_API_KEY`
`providers.together(model)`	Together AI	—	`TOGETHER_API_KEY`
`providers.fireworks(model)`	Fireworks	—	`FIREWORKS_API_KEY`
`providers.cerebras(model)`	Cerebras	—	`CEREBRAS_API_KEY`
`providers.huggingface(model)`	HuggingFace	—	`HF_TOKEN`
`providers.custom(model, api_base)`	Any OpenAI-compatible server	—	—

Each function returns a dict you can unpack directly: UniversalAgent(name="bot", **providers.ollama("llama3.1:8b")).