Drip · Agents & RAG · 16 min read

Agentic Context Engineering

“Prompt engineering” describes a single turn. Production agents run hundreds. The skill that determines whether they hold their behavior over a long loop is context engineering — and it has four named operations.

The bottom line. Phil Schmid (April 2026) and the team behind dev.Journal (March 2026) converged on the same diagnosis: most agent failures are context failures, not model failures. Four operations cover most fixes — Write (persist memory outside the window), Select (retrieve only what this turn needs), Compress (summarize and prune as history grows), Isolate (give each agent only its slice via sub-agents). The lab below runs the same 30-turn agent under each strategy. Raw context loses the early rules by turn 12; Write stabilizes adherence; Isolate flatlines cost.

If you haven’t read it yet: Context Engineering is the sibling drip on the static side of this — what to put into a single window, the lost-in-the-middle problem, the token budget. This piece picks up where that one stops: what happens to that context across a long agent loop.

§ 00 · WHY THE REFRAMEPrompt engineering describes a turn. Agents run hundreds.

Prompt engineering as a discipline was shaped by the chatbot era: one user message, one model response, optimize the prompt that sits between them. It was the right frame for the work — until agents arrived.

A production agent in 2026 isn’t shaped like a chat completion. It runs in a loop. The model decides on an action, a tool returns a result, the model reads the result, decides on the next action. The loop runs for dozens or hundreds of turns until the goal is met or the orchestrator cuts it off. The “prompt” isn’t a string anymore — it’s a sliding window over an ever-growing conversation history, plus whatever the tools dump in, plus whatever the model itself produced last turn.

Most agent bugs in production look like model bugs. The agent forgets a constraint it was told at turn 1. It contradicts a rule it just acknowledged. It loops because it can’t see the tool error from three turns ago. Engineers reach for a better model. The bug doesn’t go away. The bug isn’t in the model — it’s in the context the model sees at the moment of failure.

Phil Schmid’s April 2026 essay crystallized what production teams had been doing for months under different names: a single four-operation framework that covers most of the fixes.

§ 01 · WRITE — PERSIST OUTSIDE THE WINDOWYour prompt is not memory. Stop using it like one.

The first operation is to recognize that the context window isn’t storage. It’s a hot working buffer. Long-term constraints, accumulated facts, identity rules, anything the agent needs to remember across the run — all of it belongs outside the window, in a structured place the agent can read and write deliberately.

The canonical shape is a small markdown file the agent maintains itself. Anthropic’s Claude Code calls it CLAUDE.md. Cursor calls it project rules. OpenAI’s agent.md is the same idea. As of April 2026, those three formats merged into a single convention: AGENTS.md, read natively by every major agent CLI. The convergence is the tell — production teams independently arrived at the same shape because the same shape worked.

The discipline is to reinject this file every single turn, regardless of how long the conversation has run. The model sees it again at turn 1, turn 15, turn 50. The rules don’t decay because they’re structurally renewed. Five hundred tokens of stable, well-written rules beats 200,000 tokens of raw history every time — and the lab below shows you exactly why.

§ 02 · SELECT — RETRIEVE, DON’T DUMPFacts live in a database. The prompt carries only what this turn needs.

The second operation comes from RAG and looks like RAG, but it’s narrower: every turn, pull the specific facts this turn requires. Nothing more. The retrieval target isn’t a vector store full of marketing copy — it’s the user’s own past state, the system’s ground truth, the documents relevant to the question the agent is currently asking.

Two tables, drawn carefully, cover most cases. A state table for current facts — the user’s preferences, the agent’s current task, the running goal. An events table for the audit trail — every tool call, every mutation, every decision the agent made. The state table answers “what is true right now”; the events table answers “how did we get here.” The agent reads from both, but only the slice it needs.

The cost dynamics are striking. A raw-context agent at turn 30 is carrying 30 × 320 ≈ 10,000 tokens of history, most of which is irrelevant to the current decision. A select-shaped agent at turn 30 is carrying maybe 800 tokens of relevant retrieval plus the 500-token memory file. Same model, 1/8 the input cost — and better recall, because the model isn’t triaging which of the 30 turns matters.

§ 03 · COMPRESS — PRUNE AS YOU GOOld turns are summaries waiting to happen.

Some conversation history does matter — the user’s tone, the running narrative of what the agent has tried, the acknowledgements that bind a multi-step plan together. You can’t throw it all away. But you also can’t carry every word of it forward indefinitely.

The compress operation runs in the background of every long conversation: once a turn falls out of the active window (typically 6–10 turns back), summarize it into a single short paragraph and replace the raw exchange with the summary in the running context. The agent now has a compact narrative of the whole run that fits in 2–3K tokens regardless of conversation length, plus the verbatim recent turns for the immediate next-step decision.

The execution detail that matters: the summarizer model doesn’t need to be the same model as the agent. A small, cheap summarizer (Haiku, GPT-4o-mini) running async between turns is enough — and lets you reserve frontier tokens for the actual reasoning.

§ 04 · ISOLATE — SUB-AGENTS FOR SLICESThe orchestrator carries the goal, not the work.

The fourth operation acknowledges what the first three can’t solve: some tasks have irreducibly large context requirements. The agent needs to read three different documents, cross-reference an API schema, run a query against a database, and reconcile the results. No amount of clever compression makes all of that fit cleanly in a single window.

The fix is to stop trying. Spawn a sub-agent per domain. The document-reading sub-agent only sees the documents. The query-running sub-agent only sees the schema and the database. Each runs to completion in its own clean context and returns a structured result. The orchestrator agent — which carries only the goal — composes those results into the final answer.

Cost-wise this is a force multiplier. Each sub-agent context is small and fast and finite. The orchestrator carries roughly the same 2K tokens at turn 1 as at turn 100, because the “work” lives in disposable child contexts that get torn down when their slice is done. The lab below makes this visible — the Isolate strategy is the only one whose token-per-turn graph flatlines.

§ 05 · CONTEXT DRIFT, DEMONSTRATEDSame prompt, four strategies, 30 turns

The lab below simulates one canonical scenario: an agent given a clear set of rules at turn 1 (“respond in JSON”, “max 100 words”, “cite at least one source”) runs for 30 turns under each of the four strategies. The synthetic adherence score reflects what teams measure in production — rule-following degrades non-linearly as the early context gets diluted, then collapses. Stack the operations and the collapse moves further out, or vanishes.

Lab · context drift over 30 turnsSame agent loop, four context strategies — watch rule adherence fall as raw context inflates

Every turn carries the entire history forward. Tokens explode, the model loses early rules.

Tokens per turnfinal: 9,960
turn 1turn 15turn 30
Rule adherence (synthetic 0–100%)final: 32%
0%60% (failing)100%
Avg adherence
83%
Total tokens
160K
Cost / run
$0.48

Raw context blows up at turn ~12 because the model loses recall of early rules — the “context drift” phenomenon Patrick named in March 2026. Layering Write, Select, Compress, and Isolate addresses different failure modes; Isolate is the only one that bounds cost regardless of run length.

Two takeaways. First: the strategies aren’t alternatives, they’re a stack. Writealone helps but doesn’t flatten cost. Selectalone flattens cost but doesn’t preserve identity rules. The mature production agent uses all four, layered. Second: notice when each one earns its keep. Write for any agent with persistent rules. Select for any agent with fact-based questions. Compress for long conversations. Isolatewhen a single task has slices that don’t need to share context.

§ 06 · HALLUCINATION BY OMISSIONThe failure mode RLHF baked in

There’s a fifth failure mode worth naming because it explains a class of agent bugs that look mysterious. Consumer agents — the chat-shaped products — are RLHF-trained to be helpful. When a tool returns an error, an unhelpful answer is “the tool failed.” A helpfulanswer is the result the user was hoping for. The model’s RLHF objective rewards the latter. Production agents that inherit that training objective will, by default, paper over tool failures by making up the result the user wanted to see.

The fix is mechanical. Every tool returns a shape that explicitly distinguishes success from failure:

type ToolResult<T> =
  | { ok: true; data: T }
  | { ok: false; error: string };

And the system prompt makes the contract explicit: “if a tool returns ok: false, report the failure to the user. Do not invent a result. Do not paraphrase the error away.”

This pattern doesn’t solve hallucination in general — that problem is much harder. But it solves the specific case of hallucination-as-helpfulness in tool-using agents, which is the version that shows up most in production. The agent fails loud instead of soft. Errors propagate to the orchestrator and to the user. Bugs get fixed instead of laundered.

CHECKYour agent runs a 40-turn conversation. At turn 35 it produces an answer that contradicts a constraint you set at turn 1. Which operation would address this most directly?

§ 07 · WHAT THIS REPLACESThe reframe is real, the prompt is still doing work

It would be tidy to say prompt engineering is dead. It isn’t. The system prompt still matters — every operation above feeds intoa prompt the model sees. What changed is that the prompt is no longer the unit of design. The prompt is the output of a context-engineering process; it’s composed at runtime from the memory file plus the retrieved facts plus the compressed history plus the user’s current turn. The skill is upstream of the string.

This is why teams in 2026 measure context-engineering decisions with evals (covered in a sibling drip) and harness the resulting agent with retries and circuit breakers (another sibling drip). The four operations don’t replace any of the rest of the production stack — they’re the missing layer between the prompt-engineering era and the agent-as-a-system era. Write, Select, Compress, Isolate. The skill of 2026.

§ · FURTHER READINGReferences & deeper sources

  1. Phil Schmid (2026). Context Engineering: The Four Operations That Replace Prompt Engineering · Towards Data Science
  2. Patrick (dev.Journal) (2026). Context Drift: Why Your Agent Gets Sloppy at Step 15 (and the MEMORY.md fix) · dev.journal
  3. Anthropic (2026). CLAUDE.md — Project-wide Rules for Claude Code · Anthropic Documentation
  4. AGENTS.md Spec Working Group (2026). AGENTS.md — A Unified Convention for Agent Context Files · Spec, April 2026
  5. Lewis et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks · arXiv:2005.11401
  6. VentureBeat (2026). Why agents hallucinate tool results — the RLHF helpfulness problem · VentureBeat AI

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.