Memory and retrieval — make a cheap LLM hold the thread
This tutorial shows you how to give a small or local LLM the same conversational memory and multi-hop reasoning that a frontier model handles natively. After reading it you will know which retrieval pattern to reach for — semantic checkpoint recall, graph-over-vector, or both — and you will have runnable examples for each.
Before you start
- A working Sagewai install (
pip install sagewai). - One LLM you can call. Any of these works: an Anthropic key, an OpenAI key, or a local Ollama install with
llama3.2pulled. - For the graph-retrieval example, optionally a NebulaGraph instance if you want to exercise the production backend rather than the in-memory one.
Pick a pattern
Choose by the question your application needs to answer.
| Pattern | When it wins | Companion example |
|---|---|---|
| Semantic checkpoint recall | Long multi-turn conversations where vague references like "back to that earlier point" must resolve to the right history slice. Lets a 7B local model behave like a frontier model on long sessions. | 37 — semantic checkpoint recall |
| Graph-over-vector retrieval | Multi-hop questions over typed relations ("what depends on X, and what broke last time any of those did?"). Vector chunking finds similar text; graph traversal finds the actual chain. | 41 — graph memory incident dependency |
Most production apps end up wanting both: vector for "what did we say about X?", graph for "what depends on X?".
How semantic checkpoint recall works
You embed each new turn and semantically search the conversation history. Only the focused slice — typically the top three most-relevant turns — goes to the LLM. The model never sees the full thread, so a small model can keep up with a long one.
The same code runs on claude-opus, gpt-4o-mini, and ollama/llama3:8b. The example cycles through all three and prints the per-LLM agreement on the recalled slice.
Run it
pip install sagewai
ollama pull llama3.2
python 37_semantic_checkpoint_recall.py
The script builds a 12-turn conversation about a fictional product launch, fires a deliberately vague "let's get back to our real business" turn, retrieves the focused slice, and compares per-LLM responses. Real numbers from a clean-machine run print at the end.
How graph-over-vector retrieval works
For questions that walk typed relations — "which services depend on Service A, and which of those caused incidents in the last 90 days?" — vector retrieval returns chunks that look similar to the query, but it cannot follow transitive structure across them. Graph traversal walks the edges and returns the actual chain.
GraphMemory ships with both an in-memory backend and a NebulaGraph backend behind one API. QueryRouter auto-classifies a query as relational vs lexical and dispatches to the right store.
Run it
python 41_graph_memory_incident_dependency.py
The script seeds 15–20 incidents and 5–10 services with realistic root-cause edges, issues four query types (single-hop, multi-hop, temporal, constraint propagation), and prints a side-by-side against vector retrieval. It reports average traversal depth, answer completeness vs vector, and p50/p99 latency.
To exercise the production NebulaGraph backend:
python 41_graph_memory_incident_dependency.py --backend nebula
You need a reachable NebulaGraph instance. The example reads its connection settings from environment variables — see the script header.
Where you'd use this
These patterns come up when your first agent starts losing the thread, or when your RAG pipeline returns chunks that look similar but do not answer the question.
Customer-support chatbot for long sessions
A support bot handles multi-turn conversations where customers walk it through reproducing a bug. Sessions hit 30 turns. Opus is fine but expensive at scale; Haiku loses context.
| Concern | How this pattern solves it |
|---|---|
| A cheap model must hold a 30-turn thread | Embed each turn, retrieve the top-3 most-relevant history turns, pass only that slice |
| The customer asks "can you re-check what I said about the staging environment?" | Semantic search resolves the vague reference to the staging-environment turn from 18 messages back |
| You need to swap from Opus to Haiku to local without rewriting | Same code; the example demonstrates it works on all three |
On-call and incident response with cross-incident reasoning
A platform team needs to answer "did this break before?" questions. The on-call tool RAGs over the incident wiki and returns chunks; the engineer still has to follow the dependency chain.
| Concern | How this pattern solves it |
|---|---|
| Vector retrieval finds similar-sounding incidents but misses transitive dependencies | Graph traversal walks affects_service and caused_by edges and finds the actual root cause across hops |
| You need explainability — why did the bot suggest this is the same root cause? | Graph retrieval emits the path: "Service A → caused_by → Service B → previous_incident → INC-1234" |
| You already run NebulaGraph in production for service maps | The same example code switches to --backend nebula; no new infrastructure |
Compliance-document Q&A with multi-hop reasoning
A RAG bot over a 5K-page compliance corpus returns chunks. Auditors ask "if Section 5.2 changes, which downstream policies need a review?" and the bot cannot answer.
| Concern | How this pattern solves it |
|---|---|
| Multi-hop questions like "which Y depend on X, and what is the latest update to each?" fail with vector | Graph retrieval over typed edges (depends_on, references, superseded_by) walks the chain |
| You need to explain answers if a compliance review comes up | The graph path is the audit trail; print it next to the answer |
| The corpus updates weekly | Re-extract the graph nightly; embedding-based vector RAG alone is not sufficient |
Scribe app for primary-care physicians
A scribe summarises a 45-minute consultation. The doctor says "go back to what they said about the rash" halfway through writing the summary.
| Concern | How this pattern solves it |
|---|---|
| HIPAA forbids sending full transcripts to a frontier API without a BAA | Local model + semantic-checkpoint pattern: only the rash-relevant slice goes to the LLM; the transcript stays on-prem |
| 7B local models lose the thread on long consultations | They do not, if they only see the relevant slice — that is the point |
| Doctor wants to verify the bot's recall | Print the slice next to the summary; the doctor reads it and signs off |
SDK basics — memory examples
Start here if you want to understand the storage and retrieval primitives before running the examples above.
| # | Example | What it adds |
|---|---|---|
| 04 | memory_agent | Basic agent memory |
| 29 | memory_strategies | Strategy-based extraction (semantic facts, preferences, summaries) |
| 31 | grounded_multi_model | Multi-LLM grounded retrieval |
| 32 | global_shared_memory | Cross-agent shared knowledge |
See also
- Concept page: Memory and RAG — the API surface for
RAGEngine,VectorMemory,GraphMemory, andMemoryBranch. - SDK overview: SDK — the substrate these examples exercise.
- Related tutorial: Train your own model — pairs naturally with semantic-checkpoint recall (cheap LLM plus smart retrieval).
- Related tutorial: Production multitenancy — the on-call agent's full Sealed boundary.