Memory and retrieval — make a cheap LLM hold the thread

This tutorial shows you how to give a small or local LLM the same conversational memory and multi-hop reasoning that a frontier model handles natively. After reading it you will know which retrieval pattern to reach for — semantic checkpoint recall, graph-over-vector, or both — and you will have runnable examples for each.

Before you start

  • A working Sagewai install (pip install sagewai).
  • One LLM you can call. Any of these works: an Anthropic key, an OpenAI key, or a local Ollama install with llama3.2 pulled.
  • For the graph-retrieval example, optionally a NebulaGraph instance if you want to exercise the production backend rather than the in-memory one.

Pick a pattern

Choose by the question your application needs to answer.

PatternWhen it winsCompanion example
Semantic checkpoint recallLong multi-turn conversations where vague references like "back to that earlier point" must resolve to the right history slice. Lets a 7B local model behave like a frontier model on long sessions.37 — semantic checkpoint recall
Graph-over-vector retrievalMulti-hop questions over typed relations ("what depends on X, and what broke last time any of those did?"). Vector chunking finds similar text; graph traversal finds the actual chain.41 — graph memory incident dependency

Most production apps end up wanting both: vector for "what did we say about X?", graph for "what depends on X?".

How semantic checkpoint recall works

You embed each new turn and semantically search the conversation history. Only the focused slice — typically the top three most-relevant turns — goes to the LLM. The model never sees the full thread, so a small model can keep up with a long one.

Loading diagram...

The same code runs on claude-opus, gpt-4o-mini, and ollama/llama3:8b. The example cycles through all three and prints the per-LLM agreement on the recalled slice.

Run it

pip install sagewai
ollama pull llama3.2
python 37_semantic_checkpoint_recall.py

The script builds a 12-turn conversation about a fictional product launch, fires a deliberately vague "let's get back to our real business" turn, retrieves the focused slice, and compares per-LLM responses. Real numbers from a clean-machine run print at the end.

How graph-over-vector retrieval works

For questions that walk typed relations — "which services depend on Service A, and which of those caused incidents in the last 90 days?" — vector retrieval returns chunks that look similar to the query, but it cannot follow transitive structure across them. Graph traversal walks the edges and returns the actual chain.

Loading diagram...

GraphMemory ships with both an in-memory backend and a NebulaGraph backend behind one API. QueryRouter auto-classifies a query as relational vs lexical and dispatches to the right store.

Run it

python 41_graph_memory_incident_dependency.py

The script seeds 15–20 incidents and 5–10 services with realistic root-cause edges, issues four query types (single-hop, multi-hop, temporal, constraint propagation), and prints a side-by-side against vector retrieval. It reports average traversal depth, answer completeness vs vector, and p50/p99 latency.

To exercise the production NebulaGraph backend:

python 41_graph_memory_incident_dependency.py --backend nebula

You need a reachable NebulaGraph instance. The example reads its connection settings from environment variables — see the script header.

Where you'd use this

These patterns come up when your first agent starts losing the thread, or when your RAG pipeline returns chunks that look similar but do not answer the question.

Customer-support chatbot for long sessions

A support bot handles multi-turn conversations where customers walk it through reproducing a bug. Sessions hit 30 turns. Opus is fine but expensive at scale; Haiku loses context.

ConcernHow this pattern solves it
A cheap model must hold a 30-turn threadEmbed each turn, retrieve the top-3 most-relevant history turns, pass only that slice
The customer asks "can you re-check what I said about the staging environment?"Semantic search resolves the vague reference to the staging-environment turn from 18 messages back
You need to swap from Opus to Haiku to local without rewritingSame code; the example demonstrates it works on all three

On-call and incident response with cross-incident reasoning

A platform team needs to answer "did this break before?" questions. The on-call tool RAGs over the incident wiki and returns chunks; the engineer still has to follow the dependency chain.

ConcernHow this pattern solves it
Vector retrieval finds similar-sounding incidents but misses transitive dependenciesGraph traversal walks affects_service and caused_by edges and finds the actual root cause across hops
You need explainability — why did the bot suggest this is the same root cause?Graph retrieval emits the path: "Service A → caused_by → Service B → previous_incident → INC-1234"
You already run NebulaGraph in production for service mapsThe same example code switches to --backend nebula; no new infrastructure

Compliance-document Q&A with multi-hop reasoning

A RAG bot over a 5K-page compliance corpus returns chunks. Auditors ask "if Section 5.2 changes, which downstream policies need a review?" and the bot cannot answer.

ConcernHow this pattern solves it
Multi-hop questions like "which Y depend on X, and what is the latest update to each?" fail with vectorGraph retrieval over typed edges (depends_on, references, superseded_by) walks the chain
You need to explain answers if a compliance review comes upThe graph path is the audit trail; print it next to the answer
The corpus updates weeklyRe-extract the graph nightly; embedding-based vector RAG alone is not sufficient

Scribe app for primary-care physicians

A scribe summarises a 45-minute consultation. The doctor says "go back to what they said about the rash" halfway through writing the summary.

ConcernHow this pattern solves it
HIPAA forbids sending full transcripts to a frontier API without a BAALocal model + semantic-checkpoint pattern: only the rash-relevant slice goes to the LLM; the transcript stays on-prem
7B local models lose the thread on long consultationsThey do not, if they only see the relevant slice — that is the point
Doctor wants to verify the bot's recallPrint the slice next to the summary; the doctor reads it and signs off

SDK basics — memory examples

Start here if you want to understand the storage and retrieval primitives before running the examples above.

#ExampleWhat it adds
04memory_agentBasic agent memory
29memory_strategiesStrategy-based extraction (semantic facts, preferences, summaries)
31grounded_multi_modelMulti-LLM grounded retrieval
32global_shared_memoryCross-agent shared knowledge

See also

  • Concept page: Memory and RAG — the API surface for RAGEngine, VectorMemory, GraphMemory, and MemoryBranch.
  • SDK overview: SDK — the substrate these examples exercise.
  • Related tutorial: Train your own model — pairs naturally with semantic-checkpoint recall (cheap LLM plus smart retrieval).
  • Related tutorial: Production multitenancy — the on-call agent's full Sealed boundary.