Local Inference & LLM Providers

Sagewai supports 100+ LLM models across cloud and local providers. This guide covers how to configure each option, what keys you need, and how the intelligence layer falls back to local models automatically.

Cloud API providers

ProviderModelsFree tierEnv variableSignup
OpenAIGPT-4o, GPT-4o-mini, o1, o3$5 creditOPENAI_API_KEYplatform.openai.com
AnthropicClaude Opus, Sonnet, HaikuPay-as-you-goANTHROPIC_API_KEYconsole.anthropic.com
GoogleGemini 2.5 Flash, Gemini 2.5 ProFree tier availableGOOGLE_API_KEYaistudio.google.com
GroqLlama 3.1, MixtralFree tierGROQ_API_KEYconsole.groq.com
Together AIOpen-source models$5 creditTOGETHER_API_KEYapi.together.ai
FireworksOpen-source models$1 creditFIREWORKS_API_KEYfireworks.ai
CerebrasFast inferenceFree tierCEREBRAS_API_KEYcloud.cerebras.ai

Which API keys do you need?

Use caseKeys required
Admin panel (monitoring, fleet management)None
Running agentsAt least one provider key (any from the table above)
Context Engine (RAG)None — falls back to local SentenceTransformer
Harness proxyKeys for each backend you want to route to
Intelligence layer (entity extraction, summarization)None — falls back to local models
Fine-tuning with UnslothNone — runs locally

Local inference providers

Ollama

The most straightforward option. Runs on macOS, Linux, and Windows.

Install:

# macOS / Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Windows — download from https://ollama.ai/download

Pull a model and start:

ollama pull llama3.1:8b
ollama serve  # listens on localhost:11434

Use with Sagewai:

from sagewai import UniversalAgent, providers

agent = UniversalAgent(
    name="local-bot",
    system_prompt="You are a helpful assistant.",
    **providers.ollama("llama3.1:8b"),
)
response = await agent.chat("Explain quantum computing")

Environment variable: OLLAMA_HOST (default: http://localhost:11434)

GPU passthrough in containers:

# Docker
docker run -d --gpus all -p 11434:11434 ollama/ollama

# Podman
podman run -d --device nvidia.com/gpu=all -p 11434:11434 ollama/ollama

vLLM

High-throughput serving with continuous batching and tensor parallelism. Use this for production serving on GPU hardware.

pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

Use with Sagewai:

from sagewai import UniversalAgent, providers

agent = UniversalAgent(
    name="vllm-bot",
    **providers.custom(
        model="meta-llama/Llama-3.1-8B-Instruct",
        api_base="http://localhost:8000/v1",
    ),
)

vLLM requires an NVIDIA GPU with enough VRAM for the model you choose.

LM Studio

Desktop application with a model browser and a built-in local inference server.

  1. Download from lmstudio.ai
  2. Browse and download a model (for example, Mistral 7B)
  3. Toggle on the local server from within the app

Use with Sagewai:

from sagewai import UniversalAgent, providers

agent = UniversalAgent(
    name="lmstudio-bot",
    **providers.lm_studio("mistral-7b-instruct"),
)

Environment variable: LM_STUDIO_HOST (default: http://localhost:1234)

llama.cpp

Minimal resource usage. Runs on CPU and works well on machines without a GPU. Uses GGUF model format.

# macOS
brew install llama.cpp

# Start the server with a GGUF model
llama-server -m model.gguf --port 8080

Use with Sagewai:

from sagewai import UniversalAgent, providers

agent = UniversalAgent(
    name="llama-cpp-bot",
    **providers.llama_cpp(),
)

Environment variable: LLAMA_CPP_HOST (default: http://localhost:8080)

Unsloth (fine-tuning + serving)

Fine-tune a domain-specific model on your own data, then serve it locally at $0/token.

pip install unsloth

See Example 38 for the full pipeline:

  1. An agent generates Q&A training pairs from your domain data
  2. Export in Alpaca or ChatML format
  3. Unsloth fine-tunes a base model using 4-bit QLoRA
  4. Export to GGUF format
  5. Serve via Ollama or llama-server
  6. The Harness auto-discovers the model on port 8001

Harness auto-discovery

When the LLM Harness starts, it probes these ports and registers any running server as a $0/token backend:

PortServerProbe endpoint
11434OllamaGET /api/tags
1234LM StudioGET /v1/models
8001UnslothGET /v1/models
8000vLLMGET /v1/models
8080LocalAI / llama.cppGET /v1/models

No configuration is needed. Start your local server and the Harness picks it up.

Source: sagewai/harness/discovery.pydiscover_local_backends()

Intelligence layer local models

The intelligence layer uses a tiered fallback. Local models are attempted first, with no API keys required:

CapabilityLocal (primary)API (fallback)Hash (final fallback)
EmbeddingsSentenceTransformer (all-MiniLM-L6-v2, 384-dim)API Embedder (API key required)HashEmbedder (deterministic, always works)
Entity extractionGLiNER (urchade/gliner_medium-v2.1, ~50 MB)LLM Extractor (any model)
Transcriptionfaster-whisper (CTranslate2, ~150 MB)OpenAI Whisper API
SummarizationSemanticSummarizer (embedding-scored extractive)BART abstractive

Configure the providers explicitly:

from sagewai.intelligence.config import IntelligenceConfig

config = IntelligenceConfig(
    embedding_provider="local",      # or "api", "hash", "auto"
    extraction_provider="local",     # or "llm", "auto"
    transcription_provider="local",  # or "api", "disabled", "auto"
)

Local-only .env template

To run Sagewai with no cloud API keys at all:

# .env
OLLAMA_HOST=http://localhost:11434

# Intelligence layer defaults to local models automatically.
# Set explicitly if you want to be certain:
# SAGEWAI_INTELLIGENCE_EMBEDDING=local
# SAGEWAI_INTELLIGENCE_EXTRACTION=local

Provider factory reference

All factory functions from sagewai/providers.py:

FunctionProviderDefault portEnv variable
providers.ollama(model)Ollama11434OLLAMA_HOST
providers.lm_studio(model)LM Studio1234LM_STUDIO_HOST
providers.llama_cpp(model)llama.cpp8080LLAMA_CPP_HOST
providers.openai(model)OpenAIOPENAI_API_KEY
providers.anthropic(model)AnthropicANTHROPIC_API_KEY
providers.gemini(model)Google GeminiGOOGLE_API_KEY
providers.groq(model)GroqGROQ_API_KEY
providers.together(model)Together AITOGETHER_API_KEY
providers.fireworks(model)FireworksFIREWORKS_API_KEY
providers.cerebras(model)CerebrasCEREBRAS_API_KEY
providers.huggingface(model)HuggingFaceHF_TOKEN
providers.custom(model, api_base)Any OpenAI-compatible server

Each function returns a dict you can unpack directly: UniversalAgent(name="bot", **providers.ollama("llama3.1:8b")).