Memory and retrieval — make a cheap LLM hold the thread

This tutorial shows you how to give a small or local LLM the same conversational memory and multi-hop reasoning that a frontier model handles natively. After reading it you will know which retrieval pattern to reach for — semantic checkpoint recall, graph-over-vector, or both — and you will have runnable examples for each.

Before you start

A working Sagewai install (pip install sagewai).
One LLM you can call. Any of these works: an Anthropic key, an OpenAI key, or a local Ollama install with llama3.2 pulled.
For the graph-retrieval example, optionally a NebulaGraph instance if you want to exercise the production backend rather than the in-memory one.

Pick a pattern

Choose by the question your application needs to answer.

Pattern	When it wins	Companion example
Semantic checkpoint recall	Long multi-turn conversations where vague references like "back to that earlier point" must resolve to the right history slice. Lets a 7B local model behave like a frontier model on long sessions.	37 — semantic checkpoint recall
Graph-over-vector retrieval	Multi-hop questions over typed relations ("what depends on X, and what broke last time any of those did?"). Vector chunking finds similar text; graph traversal finds the actual chain.	41 — graph memory incident dependency

Most production apps end up wanting both: vector for "what did we say about X?", graph for "what depends on X?".

How semantic checkpoint recall works

You embed each new turn and semantically search the conversation history. Only the focused slice — typically the top three most-relevant turns — goes to the LLM. The model never sees the full thread, so a small model can keep up with a long one.

Loading diagram...

The same code runs on claude-opus, gpt-4o-mini, and ollama/llama3:8b. The example cycles through all three and prints the per-LLM agreement on the recalled slice.

Run it

pip install sagewai
ollama pull llama3.2
python 37_semantic_checkpoint_recall.py

The script builds a 12-turn conversation about a fictional product launch, fires a deliberately vague "let's get back to our real business" turn, retrieves the focused slice, and compares per-LLM responses. Real numbers from a clean-machine run print at the end.

How graph-over-vector retrieval works

For questions that walk typed relations — "which services depend on Service A, and which of those caused incidents in the last 90 days?" — vector retrieval returns chunks that look similar to the query, but it cannot follow transitive structure across them. Graph traversal walks the edges and returns the actual chain.

Loading diagram...

GraphMemory ships with both an in-memory backend and a NebulaGraph backend behind one API. QueryRouter auto-classifies a query as relational vs lexical and dispatches to the right store.

Run it

python 41_graph_memory_incident_dependency.py

The script seeds 15–20 incidents and 5–10 services with realistic root-cause edges, issues four query types (single-hop, multi-hop, temporal, constraint propagation), and prints a side-by-side against vector retrieval. It reports average traversal depth, answer completeness vs vector, and p50/p99 latency.

To exercise the production NebulaGraph backend:

python 41_graph_memory_incident_dependency.py --backend nebula

You need a reachable NebulaGraph instance. The example reads its connection settings from environment variables — see the script header.

Where you'd use this

These patterns come up when your first agent starts losing the thread, or when your RAG pipeline returns chunks that look similar but do not answer the question.

Customer-support chatbot for long sessions

A support bot handles multi-turn conversations where customers walk it through reproducing a bug. Sessions hit 30 turns. Opus is fine but expensive at scale; Haiku loses context.

Concern	How this pattern solves it
A cheap model must hold a 30-turn thread	Embed each turn, retrieve the top-3 most-relevant history turns, pass only that slice
The customer asks "can you re-check what I said about the staging environment?"	Semantic search resolves the vague reference to the staging-environment turn from 18 messages back
You need to swap from Opus to Haiku to local without rewriting	Same code; the example demonstrates it works on all three

On-call and incident response with cross-incident reasoning

A platform team needs to answer "did this break before?" questions. The on-call tool RAGs over the incident wiki and returns chunks; the engineer still has to follow the dependency chain.

Concern	How this pattern solves it
Vector retrieval finds similar-sounding incidents but misses transitive dependencies	Graph traversal walks `affects_service` and `caused_by` edges and finds the actual root cause across hops
You need explainability — why did the bot suggest this is the same root cause?	Graph retrieval emits the path: "Service A → caused_by → Service B → previous_incident → INC-1234"
You already run NebulaGraph in production for service maps	The same example code switches to `--backend nebula`; no new infrastructure

Compliance-document Q&A with multi-hop reasoning

A RAG bot over a 5K-page compliance corpus returns chunks. Auditors ask "if Section 5.2 changes, which downstream policies need a review?" and the bot cannot answer.

Concern	How this pattern solves it
Multi-hop questions like "which Y depend on X, and what is the latest update to each?" fail with vector	Graph retrieval over typed edges (`depends_on`, `references`, `superseded_by`) walks the chain
You need to explain answers if a compliance review comes up	The graph path is the audit trail; print it next to the answer
The corpus updates weekly	Re-extract the graph nightly; embedding-based vector RAG alone is not sufficient

Scribe app for primary-care physicians

A scribe summarises a 45-minute consultation. The doctor says "go back to what they said about the rash" halfway through writing the summary.

Concern	How this pattern solves it
HIPAA forbids sending full transcripts to a frontier API without a BAA	Local model + semantic-checkpoint pattern: only the rash-relevant slice goes to the LLM; the transcript stays on-prem
7B local models lose the thread on long consultations	They do not, if they only see the relevant slice — that is the point
Doctor wants to verify the bot's recall	Print the slice next to the summary; the doctor reads it and signs off

SDK basics — memory examples

Start here if you want to understand the storage and retrieval primitives before running the examples above.

#	Example	What it adds
04	memory_agent	Basic agent memory
29	memory_strategies	Strategy-based extraction (semantic facts, preferences, summaries)
31	grounded_multi_model	Multi-LLM grounded retrieval
32	global_shared_memory	Cross-agent shared knowledge

Memory and retrieval — make a cheap LLM hold the thread

Before you start

Pick a pattern

How semantic checkpoint recall works

Run it

How graph-over-vector retrieval works

Run it

Where you'd use this

Customer-support chatbot for long sessions

On-call and incident response with cross-incident reasoning

Compliance-document Q&A with multi-hop reasoning

Scribe app for primary-care physicians

SDK basics — memory examples

See also