Local Inference & LLM Providers

Sagewai supports 100+ LLM models through LiteLLM. Use cloud APIs, managed inference services, or run models locally with zero API costs.

Cloud API Providers

ProviderModelsFree TierEnv VariableSignup
OpenAIGPT-4o, GPT-4o-mini, o1, o3$5 creditOPENAI_API_KEYplatform.openai.com
AnthropicClaude Opus, Sonnet, HaikuPay-as-you-goANTHROPIC_API_KEYconsole.anthropic.com
GoogleGemini 2.5 Flash, Gemini 2.5 ProFree tier availableGOOGLE_API_KEYaistudio.google.com
GroqLlama 3.1, MixtralFree tierGROQ_API_KEYconsole.groq.com
Together AIOpen-source models$5 creditTOGETHER_API_KEYapi.together.ai
FireworksOpen-source models$1 creditFIREWORKS_API_KEYfireworks.ai
CerebrasFast inferenceFree tierCEREBRAS_API_KEYcloud.cerebras.ai

Which API Keys Do You Actually Need?

  • Admin panel (monitoring, fleet management): No LLM key needed
  • Running agents: At least 1 provider key (any from the table above)
  • Context Engine (RAG): Embedding model — falls back to local SentenceTransformer automatically (no key needed)
  • Harness proxy: Keys for each backend you want to route to
  • Intelligence layer (entity extraction, summarization): Auto-falls-back to local models (no key needed)
  • Fine-tuning with Unsloth: No API key needed (runs locally)

Local Inference Providers

Ollama (Recommended for Beginners)

The easiest way to run models locally. Works on macOS, Linux, and Windows.

Install:

# macOS / Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Windows — download from https://ollama.ai/download

Pull a model and start:

ollama pull llama3.1:8b
ollama serve  # starts on localhost:11434

Use with Sagewai:

from sagewai import UniversalAgent, providers

agent = UniversalAgent(
    name="local-bot",
    system_prompt="You are a helpful assistant.",
    **providers.ollama("llama3.1:8b"),
)
response = await agent.chat("Explain quantum computing")

Environment variable: OLLAMA_HOST (default: http://localhost:11434)

GPU passthrough (containers):

# Docker
docker run -d --gpus all -p 11434:11434 ollama/ollama

# Podman
podman run -d --device nvidia.com/gpu=all -p 11434:11434 ollama/ollama

vLLM (Recommended for Production Serving)

High-throughput serving with continuous batching and tensor parallelism.

pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

Use with Sagewai:

from sagewai import UniversalAgent, providers

agent = UniversalAgent(
    name="vllm-bot",
    **providers.custom(
        model="meta-llama/Llama-3.1-8B-Instruct",
        api_base="http://localhost:8000/v1",
    ),
)

vLLM requires an NVIDIA GPU with sufficient VRAM for the model.

LM Studio (Recommended for GUI Users)

Desktop app with a model browser and one-click local server.

  1. Download from lmstudio.ai
  2. Browse and download a model (e.g., Mistral 7B)
  3. Start the local server (toggle in the app)

Use with Sagewai:

from sagewai import UniversalAgent, providers

agent = UniversalAgent(
    name="lmstudio-bot",
    **providers.lm_studio("mistral-7b-instruct"),
)

Environment variable: LM_STUDIO_HOST (default: http://localhost:1234)

llama.cpp (Lightweight C++ Inference)

Minimal resource usage, runs well on CPU. Uses GGUF model format.

# macOS
brew install llama.cpp

# Start server with a GGUF model
llama-server -m model.gguf --port 8080

Use with Sagewai:

from sagewai import UniversalAgent, providers

agent = UniversalAgent(
    name="llama-cpp-bot",
    **providers.llama_cpp(),
)

Environment variable: LLAMA_CPP_HOST (default: http://localhost:8080)

Unsloth (Fine-Tuning + Serving)

Fine-tune domain-specific models, then serve them locally at $0/token.

pip install unsloth

See Example 17 (sagewai/examples/17_unsloth_finetune.py) for the full pipeline:

  1. Agent generates Q&A training pairs from your domain data
  2. Export in Alpaca/ChatML format
  3. Unsloth fine-tunes a base model (4-bit QLoRA)
  4. Export to GGUF format
  5. Serve via Ollama or llama-server
  6. Harness auto-discovers on port 8001

Harness Auto-Discovery

When you start the LLM Harness, it automatically probes common local inference ports:

PortServerProbe Endpoint
11434OllamaGET /api/tags
1234LM StudioGET /v1/models
8001UnslothGET /v1/models
8000vLLMGET /v1/models
8080LocalAI / llama.cppGET /v1/models

Discovered servers are automatically registered as backends with $0/token cost tracking. No configuration needed — just start your local server and the harness finds it.

Source: sagewai/harness/discovery.pydiscover_local_backends()

Intelligence Layer Local Models

The intelligence layer uses a tiered fallback system. Local models are tried first — no API keys required:

CapabilityLocal (Primary)API (Fallback)Hash (Final Fallback)
EmbeddingsSentenceTransformer (all-MiniLM-L6-v2, 384-dim)LiteLLM Embedder (API key required)HashEmbedder (deterministic, always works)
Entity ExtractionGLiNER (urchade/gliner_medium-v2.1, ~50 MB)LLM Extractor (any model)
Transcriptionfaster-whisper (CTranslate2, ~150 MB)OpenAI Whisper API
SummarizationSemanticSummarizer (embedding-scored extractive)BART abstractive

Configure explicitly:

from sagewai.intelligence.config import IntelligenceConfig

config = IntelligenceConfig(
    embedding_provider="local",      # or "api", "hash", "auto"
    extraction_provider="local",     # or "llm", "auto"
    transcription_provider="local",  # or "api", "disabled", "auto"
)

Local-Only .env Template

Run Sagewai with zero cloud API keys:

# .env — No cloud API keys needed!
OLLAMA_HOST=http://localhost:11434

# Intelligence layer defaults to local models automatically.
# Explicitly set if you want to be sure:
# SAGEWAI_INTELLIGENCE_EMBEDDING=local
# SAGEWAI_INTELLIGENCE_EXTRACTION=local

Provider Factory Reference

All factory functions from sagewai/providers.py:

FunctionProviderDefault PortEnv Variable
providers.ollama(model)Ollama11434OLLAMA_HOST
providers.lm_studio(model)LM Studio1234LM_STUDIO_HOST
providers.llama_cpp(model)llama.cpp8080LLAMA_CPP_HOST
providers.openai(model)OpenAIOPENAI_API_KEY
providers.anthropic(model)AnthropicANTHROPIC_API_KEY
providers.gemini(model)Google GeminiGOOGLE_API_KEY
providers.groq(model)GroqGROQ_API_KEY
providers.together(model)Together AITOGETHER_API_KEY
providers.fireworks(model)FireworksFIREWORKS_API_KEY
providers.cerebras(model)CerebrasCEREBRAS_API_KEY
providers.huggingface(model)HuggingFaceHF_TOKEN
providers.custom(model, api_base)Any OpenAI-compatible

All return a dict for direct unpacking: UniversalAgent(name="bot", **providers.ollama("llama3.1:8b"))