Agent Long-Term Memory
Your agent doesn’t forget because its context is full. It forgets because nobody wrote anything down. The gap between a chatbot and a colleague is a memory system — and in 2026 that system is finally its own design problem, separate from retrieval and separate from the prompt.
§ 00 · THE GOLDFISH PROBLEMA full context window is not a memory
Ask a demo agent to “remember that I’m allergic to peanuts,” and it will — for exactly as long as that sentence stays inside the context window. Close the tab, start a new session, and the agent is a stranger again. This is the single most common gap between something that demos well and something people actually rely on.
The confusion is understandable: a 400,000-token context window looks like a lot of memory. But a context window is working memory, not long-term memory — it is scoped to a single conversation and cleared when that conversation ends. Even within one long session it degrades: models use the middle of a long context far worse than the ends, so “just put everything in the prompt” quietly loses information long before it runs out of tokens.
The fix is to treat memory as its own subsystem: something that decides what to write down, where to keep it, and what to pull backwhen it’s relevant. That subsystem is what turns a stateless tool into something that accumulates a working relationship with you.
§ 01 · THREE PLACES TO PUT MEMORYContext, retrieval, and an explicit store
Every memory design is some mix of three substrates. They are not rivals so much as layers — but each has a distinct failure mode, and knowing which one you’re leaning on tells you which failure to expect.
- The context window. Keep the conversation in the prompt. Zero infrastructure, perfect fidelity — until the window fills and the oldest turns fall off the front. Bounded and per-session by definition.
- A retrieval store. Embed past turns and facts into a vector index; fetch the most similar ones each turn. Durable and cheap, and the natural extension of RAG to conversation history. Its blind spot is time: nearest-neighbour search will happily return a fact that used to be true.
- An explicit memory module. Store facts as structured records — increasingly a small temporal knowledge graphTemporal knowledge graph. A knowledge graph whose edges carry validity intervals, so a new fact ('moved to Berlin') can invalidate an old one ('lives in Toronto') rather than sitting alongside it. — that the agent can query, update, and invalidate. More machinery, but the only substrate that handles a fact changing.
§ 02 · THE SESSION-15 TESTTell it six things. Change one. Wait.
Here is a test that separates the strategies cleanly. In session 1, a user tells the agent six things about herself. In session 8, one of them changes — she moves from Toronto to Berlin. In session 15, you ask two questions: what do you remember about me? and where do I live? The first measures raw recall; the second measures whether the memory can handle an update, which is where naive designs fall over.
No memory: the agent is a goldfish. Everything from session 1 is gone by session 2 — cheap, and useless for anything longitudinal. Figures are illustrative of the failure modes, not a benchmark.
Three things to feel in the lab. “No memory” fails immediately— it’s the demo default and it’s a goldfish. “Full transcript” fails slowly and expensively — perfect until it truncates, and the bill grows every session. “Retrieval” fails on the update: it keeps cost flat but serves the stale “Toronto” because similarity has no notion of which fact is current. Only the strategies that write structured facts and invalidate old ones pass both questions at a fixed cost.
§ 03 · WRITING IT DOWNSummaries, facts, and temporal graphs
The interesting engineering is on the write side — deciding what is worth keeping and in what shape. Production systems in 2026 tend to run three write paths at once, from cheapest to richest:
- Rolling summaries.Periodically compress the conversation into a running summary that rides along in the prompt. Cheap and durable, but lossy — re-summarizing erodes the least-salient details first, as the lab’s “rolling summary” mode shows.
- Extracted facts.Run a small model over each turn to pull out durable statements (“user is a backend engineer”) and store them as records. Retrieval over facts beats retrieval over raw turns because the unit is already distilled.
- Temporal edges. Store facts as graph edges with validity intervals. When a new fact contradicts an old one, you invalidate rather than append — the move to Berlin closes the Toronto edge. This is the piece that passes the update test, and the direction recent work like RaMem’s contextual reinstatement and temporal-graph memory frameworks have pushed on.
The read side then becomes a small retrieval problem over a much cleaner corpus: instead of searching thousands of raw turns, you query a few hundred structured facts, filtered to the ones currently valid. Smaller, cleaner, and time-aware — which is exactly what the raw-transcript and naive-retrieval strategies lack.
§ 04 · LEARNING WHAT TO REMEMBERMemory as a policy, not a dump
Writing everything down is its own failure: a memory store that keeps every utterance is just a slower, more expensive transcript. The 2026 shift is to treat what-to-remember as a decision the agent learns, not a fixed rule. Frameworks like MemAgent train the memory operations themselves — when to write, what to keep, what to overwrite — with reinforcement learning against long-horizon tasks, rather than hand-coding a heuristic.
You don’t need RL to benefit from the framing. Even a hand-written policy improves sharply once you make the four operations explicit: write (is this durable enough to keep?), select (which stored facts are relevant now?), compress (can these three facts become one?), and invalidate (does this new fact retire an old one?). Those are the same operations agentic context engineering applies within a single loop — long-term memory is the same discipline, extended across sessions and grounded in a durable store.
§ 05 · A REFERENCE MEMORY STACKHow the layers fit together
The shape that keeps recurring: a thin memory managerbetween the agent and the stores. On the way in, it decides what to write and where. On the way out, it selects the handful of currently-valid facts worth spending context on. The agent itself stays simple; the intelligence is in the manager’s four operations and the temporal store behind them. Build that layer once and every session after the first starts warm.
§ · FURTHER READINGReferences & deeper sources
- (2026). Memory in the Age of AI Agents: A Survey (paper list) · GitHub · Agent-Memory-Paper-List
- (2026). RaMem: Contextual Reinstatement for Long-term Agentic Memory · arXiv:2606.22844
- (2025). MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent · OpenReview
- (2025). Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents · arXiv:2509.23040
- (2023). Lost in the Middle: How Language Models Use Long Contexts · arXiv:2307.03172 (TACL 2024)
- (2026). Blueprint: Give Your Agent Long-Term Memory · Brain Drip Blueprints
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.