context rot breaks agents. 4 patterns fix it after 32K tokens
30%.
That is the accuracy drop for information located in the middle of a context window, according to Stanford’s "Lost in the Middle" findings, a phenomenon replicated and expanded by Chroma Research in July 2025.
Everyone is buying larger context windows. Almost no one is fixing the retrieval mechanics inside them.
## The 32K cliff is a feature, not a bug
The dominant narrative in late 2025 was that context length was a hardware problem. If Llama 4 Scout ships with a 10M-token window and Anthropic, Google, and OpenAI all offer 1M-token contexts, the reasoning went, we simply need to dump more data into the prompt and let the transformer attend to it.
This assumption is empirically false.
Llama 3.1 405B, despite its massive parameter count, sees correctness drop significantly at approximately 32K tokens. This represents just 3% of its rated 1M context, according to Databricks evaluations. The model does not fail because it runs out of memory. It fails because the attention mechanism dilutes the signal-to-noise ratio as the sequence length increases without structured pruning.
Chroma Research’s July 2025 study provides the most damning evidence against raw context dumping. They analyzed 18 frontier models across 194,480 LLM calls, testing 8 input lengths and 11 needle positions. The result was consistent across all 18 models: performance was higher on shuffled haystacks than on coherent documents.
This counter-intuitive finding suggests that semantic coherence, which engineers assume helps the model "understand" the text, actually creates interference patterns in the attention heads. When documents are coherent, similar tokens cluster, creating local minima in the attention map where the model gets stuck. Shuffling breaks these clusters, forcing the attention mechanism to scan more broadly.
**Context rot is not a storage issue; it is an attention allocation failure.**
Karpathy defined "context engineering" on June 25, 2025, as "the delicate art and science of filling the context window with just the right information for the next step." This definition shifts the burden from the model architecture to the pipeline architecture. If you are relying on the model to find the needle, you have already lost. You must ensure the needle is the only thing in the box.
The implication for senior engineers is stark. The 1M-token window is a trap. It invites laziness in data preprocessing. It encourages developers to skip the hard work of semantic chunking, recursive summarization, and dynamic eviction policies.
## The illusion of infinite memory
Steelmanning the pro-context argument requires acknowledging the utility of long-range dependencies. In codebases, a variable defined in `config.py` at line 10 may be referenced in `main.rs` at line 10,000. Removing that definition breaks the program. Therefore, the argument goes, we need the full file tree in context to maintain referential integrity.
This view is correct regarding *necessity* but wrong regarding *method*.
The opposing side gets one thing right: static retrieval-augmented generation (RAG) is insufficient for agentic workflows. An agent needs to know what it *doesn't* know, and it needs to see the broader structure to plan its next move. A pure vector search approach often misses the structural relationships between files, leading to hallucinated imports or broken dependency chains.
However, the solution is not to inflate the context window to 1M tokens. The solution is to compress the structural knowledge while keeping the executable details local.
The error lies in treating all tokens as equal. A docstring has a different attention weight requirement than a function signature. A comment has a different retention priority than a test case. By dumping everything into a single flat sequence, we force the model to treat a copyright header with the same gravitational pull as the core business logic.
This is why the 30%+ accuracy drop in mid-context info matters. It is not random noise. It is systematic neglect of the "boring" middle sections where most business logic resides. The model attends to the beginning (primacy effect) and the end (recency effect), leaving the middle to decay.
## MCP and the shift to dynamic context
Anthropic’s release of the Model Context Protocol (MCP) in late 2024 signaled the industry’s pivot from static prompts to dynamic context servers. MCP allows agents to query external data sources on demand, rather than pre-loading them into the context window.
This is the first production pattern that fixes context rot: **lazy loading via protocol**.
Instead of injecting the entire PostgreSQL schema into the prompt, the agent connects to an MCP server that exposes the schema. The agent queries only the relevant tables when needed. This keeps the active context small and focused.
Consider the case of Claude Code, released by Anthropic in October 2024. It does not read every file in a repository. It uses a combination of file system indexing and semantic search to identify relevant files, then loads them into context only when executing a specific task.
This approach mirrors how operating systems manage memory. Virtual memory allows programs to address more memory than physically available by swapping pages in and out of disk. Context engineering must adopt a similar paging mechanism. Tokens are swapped in from the "disk" (vector store or file system) into "RAM" (the active context window) only when the attention mechanism requires them.
The second pattern is **recursive summarization with anchor points**. Instead of storing the full history of an agent’s thought process, the system maintains a running summary. However, this summary is not a flat text block. It is a structured object with links to specific token ranges in the original history.
If the agent needs to verify a claim made three turns ago, it does not re-read the entire conversation. It follows the anchor link to the specific tokens. This preserves the fidelity of the original data while keeping the working memory clean.
The third pattern is **semantic deduplication**. In large codebases, similar patterns repeat. Boilerplate code, standard imports, and common utility functions consume tokens without adding informational value. A pre-processing layer that identifies and collapses these duplicates into references can reduce context size by 20-40% without losing functionality.
The fourth pattern is **active eviction based on relevance scoring**. As the agent works, it assigns a relevance score to each chunk of context. Chunks below a certain threshold are evicted. This is not a simple LRU (Least Recently Used) cache. It is a semantic cache that understands that a variable definition introduced 50 turns ago is still relevant if it is used in the current step, while a brainstorming session from 10 turns ago is irrelevant.
## The SWE-bench discrepancy
The gap between benchmark performance and real-world reliability is best illustrated by Claude Opus 4.5. It scores 80.9% on SWE-bench Verified, a respected benchmark for software engineering tasks. However, on Scale AI's Pro benchmark, it scores only 45.9%.
This 35-point gap is not due to model stupidity. It is due to context contamination.
SWE-bench Verified provides clean, isolated problems with well-defined contexts. Scale AI's Pro benchmark includes noisy, real-world repositories with messy histories, broken tests, and ambiguous requirements. The model fails not because it cannot code, but because it cannot distinguish the signal from the noise in a large, unstructured context.
OpenAI’s internal audit revealed that 59.4% of audited Verified problems had flawed test cases, leading to the retirement of the Verified set in favor of Pro. This admission underscores the fragility of current evaluation methods. If the benchmarks are flawed, the models optimized for them will fail in production.
**Production reliability requires optimizing for noise, not just signal.**
Engineers who build agents for real-world use cases must assume that the context will be dirty. It will contain deprecated functions, conflicting documentation, and outdated comments. The agent’s ability to navigate this rot determines its success.
Using a 1M-token window to brute-force through this noise is inefficient and expensive. It increases latency and cost while decreasing accuracy. The four patterns described above—lazy loading, recursive summarization, semantic deduplication, and active eviction—are not optional optimizations. They are prerequisites for reliable agentic systems.
## The 90-day execution plan
Gartner Senior Director Anushree Verma predicts that 40%+ of agentic-AI projects will be cancelled by the end of 2027. The primary driver of these cancellations will not be model capability. It will be context management failure.
Agents that work in demos fail in production because demos use clean, small datasets. Production uses messy, large datasets.
For the next 90 days, senior engineers should stop chasing larger context windows. Instead, they should invest in context engineering infrastructure.
First, implement an MCP server for your primary data sources. Decouple data access from prompt construction.
Second, audit your agent’s context usage. Measure the token count of irrelevant data. If more than 20% of the context is boilerplate or history that is not directly referenced in the current step, you have a rot problem.
Third, replace static prompts with dynamic context assembly. Build a pipeline that selects, compresses, and injects data based on the specific task at hand.
The models are smart enough. The context windows are big enough. The bottleneck is now the engineering discipline required to feed them correctly.
Cliffs are easier. Cliffs you can detect.
The middle is where agents go to die.
