Context Engineering Is All I’ve Been Writing About
I thought I was writing about different things:
| Post | What I Thought It Was About | What It Was Actually About |
|---|---|---|
| The 14K Token Debt | System prompt architecture | Layer 1 of the context window costs 14K tokens before you speak |
| The Terminal Was the First Agent Harness | Unix as agent pattern | How to namespace heterogeneous context sources |
| Your MCP Servers Are Costing You 10 Seconds | MCP performance overhead | Tool schemas consuming the context window silently |
| My AI Agent’s Memory Paid for Itself | Long-term agent memory | On-demand knowledge injection into context |
| Claude Code vs Gemini CLI | Tool comparison | Two different strategies for managing the same context window |
| Self-Improving AI Skills | Recursive improvement | How trajectory data in context compounds across sessions |
Every post answered one question: what occupies the context window before the model reasons, and what does it cost?
Andrej Karpathy defined it: “Context engineering is the delicate art and science of filling the context window with just the right information for the next step.” I just didn’t have the name.
What the Model Actually Sees
I measured my context window. Not estimated — measured. I ran my CLAUDE.md and memory files through tiktoken, counted my MCP tool definitions, and mapped what Claude Code loads before I type a single character.
Here’s the literal structure of a fresh session in this project:
[LAYER 1 — CONSTITUTION] ~3,800 tokens
Anthropic's system prompt: persona, safety rules,
coding standards, tool usage instructions.
(This is what "The 14K Token Debt" measured —
but 14K included tool schemas. This post
separates them into Layer 1 vs Layer 4.)
[LAYER 2 — IDENTITY] 1,160 tokens
~/.claude/CLAUDE.md 182 tokens ← measured
MEMORY.md (index) 163 tokens ← measured
Memory files (3 auto-loaded) 815 tokens ← measured
─────────────────────────────────
Subtotal: 1,160 tokens
[LAYER 3 — KNOWLEDGE] 0 tokens
Brain MCP results: loaded on-demand, not at boot.
File reads: loaded when the model calls Read tool.
[LAYER 4 — TOOLS] ~8,400 tokens
6 MCP servers, ~40 tools total.
After Tool Search deferral: ~1,800 tokens
(tool names + descriptions only, not full schemas).
[LAYER 5 — CONVERSATION] 23 tokens
My question: "fix the broken image path"
─────────────────────────────────────────────────────────
TOTAL AT FIRST TURN: ~13,383 tokens
TOTAL AFTER TOOL SEARCH: ~6,783 tokens
My question was 23 tokens. The infrastructure was 6,760 tokens (with Tool Search) or 13,360 tokens (without).
I call this the Signal Ratio — the percentage of the context window that’s actually your task. Mine was 0.17% at first turn without Tool Search. With Tool Search: 0.34%. Less than half a percent of my context window was the thing I wanted the agent to do.
Now, 13K tokens out of a 200K window is only 6.7% — plenty of headroom. But this is turn one. By turn ten, after file reads, tool calls, and assistant responses, the context has grown to 50,000-80,000 tokens. The infrastructure (Layers 1-4) stays fixed. The Signal Ratio climbs to 5-10%. But those fixed-cost tokens are still there — on every single turn — exerting gravitational pull on the model’s attention.
In The 14K Token Debt, I called this Prompt Gravity. In the MCP post, I called it Schema Gravity. They’re the same force operating at different layers. The unified name is Context Gravity — the cumulative attentional weight of all fixed-cost tokens across every layer of the stack. The heavier your context, the stronger the gravity, the harder it is for the model to attend to your actual question.
Prompt engineering asks “what should I put in the prompt?” That’s Layer 1 thinking — one-fifth of the stack. Context engineering asks: what should be in the entire window, at which layer, loaded when, at what cost?
The Context Engineering Stack
The Context Engineering Stack — five layers that every agent system designs, whether explicitly or by accident.
| Layer | What Goes Here | When Loaded | My Setup (measured) | Key Risk |
|---|---|---|---|---|
| 1: Constitution | System prompt, safety rules, persona | Every turn (immutable) | ~3,800 tokens | Prompt Gravity — biases all downstream reasoning |
| 2: Identity | CLAUDE.md, memory files, project rules | Session start | 1,160 tokens | Stale instructions from old projects; lost in the middle |
| 3: Knowledge | RAG results, Brain MCP queries, file reads | On-demand | 0-50,000 per query | Retrieval noise — wrong documents dilute reasoning |
| 4: Tools | MCP schemas, function definitions | Session start | 8,400 → 1,800 (after Tool Search) | Schema Gravity — unused tools waste budget |
| 5: Conversation | Messages, responses, tool results | Accumulates | Growing | Context rot — old turns degrade attention on recent ones |
The constraint is directional: layers below eat the budget of layers above. A bloated Layer 4 (55,000 tokens of MCP schemas from 5 servers, per Anthropic’s own measurement) directly starves Layer 5 and Layer 3. This is why I cut my global MCP servers from 9 to 2 — their Layer 4 cost was crushing my Layer 5 capacity.
Layer 5: The Neglected Layer
Every previous post in this series addressed Layers 1-4. Layer 5 — the conversation itself — is the one I never wrote about, and it’s where context engineering gets hardest.
Layer 5 grows with every turn. Each tool call adds the full result to the conversation history. A Read tool call on a 200-line file adds ~1,000 tokens. A Bash tool call returning build output adds 500-2,000 tokens. After 10 tool-heavy turns, Layer 5 can consume 40,000-60,000 tokens — dwarfing all other layers combined.
The problem is context rot. In the 14K Token Debt post, I cited Liang et al. who found that instruction drift is universally measurable within eight conversation rounds. By turn eight, the model’s adherence to system prompt instructions starts degrading. By turn fifteen, it has collapsed into the statistical median of its pre-training distribution. Your carefully crafted Layer 1 constitution? It’s being diluted by the sheer volume of Layer 5 tokens.
The Lost in the Middle paper (Liu et al., Stanford) measured this precisely: on multi-document QA with 20 documents, moving the relevant document from position 1 to position 10 caused a ~20 percentage point accuracy drop on GPT-3.5-Turbo. The model attends to the beginning and end of context; everything in the middle is an attention valley. Your Layer 2 instructions (CLAUDE.md) live in that valley.
This is why the most effective Claude Code users put critical instructions in two places: Layer 1 (system prompt, which sits at the very beginning) and Layer 5 (the current message, which sits at the very end). The middle is unreliable territory.
Claude Code vs Gemini CLI: Two Context Stacks
The five layers exist in both tools, but the engineering choices at each layer differ:
| Layer | Claude Code | Gemini CLI |
|---|---|---|
| 1: Constitution | ~3,800 tokens. Detailed coding standards, tool usage rules, style preferences. | ~2,000 tokens. Lighter system prompt, more deference to user config. |
| 2: Identity | Hierarchical: ~/.claude/CLAUDE.md → project → subdirectory. Lazy loading — subdirectory files load only when agent reads files there. | ~/.gemini/GEMINI.md → project root + ancestors. Eager loading — all concatenated and sent every prompt. |
| 3: Knowledge | No built-in knowledge system. Extended via MCP (e.g., Brain MCP). | Built-in google_web_search and web_fetch tools for real-time grounding. |
| 4: Tools | Tool Search defers schemas — 89% reduction (77K → 8.7K tokens per Anthropic). | No schema deferral. All tool definitions injected every turn. |
| 5: Conversation | Adaptive auto-compact. 5-layer compression pipeline. | Fixed threshold compression (50%, changed from 70%). Known compression loop bug. |
The architectural bet is clear. Claude Code invests heavily in Layer 4 optimization (Tool Search) and Layer 5 management (multi-layer compression). Gemini CLI bets on a massive context window (1M tokens) to make optimization less critical — but without schema deferral, even 1M tokens fill when you connect enough MCP servers.
Context Mounting — the pattern from the Terminal post where heterogeneous context sources are projected into a uniform namespace — described the what. The Context Engineering Stack describes the when and the cost.
Audit Your Context Budget
Run this in a Claude Code session:
# Step 1: Count your MCP tools
/mcp
# Output shows: brain (4 tools), sequential-thinking (1 tool), etc.
# Count total tools across all servers.
# Step 2: Estimate Layer 4 cost
# Each tool schema ≈ 200-500 tokens of JSON.
# Rule of thumb: total_tools × 350 = Layer 4 budget
# My setup: 40 tools × 350 = ~14,000 tokens (before Tool Search)
# Step 3: Measure Layer 2 exactly
# Outside Claude Code:
pip install tiktoken
python3 -c "
import tiktoken
enc = tiktoken.encoding_for_model('gpt-4')
with open('$HOME/.claude/CLAUDE.md') as f:
print(f'CLAUDE.md: {len(enc.encode(f.read()))} tokens')
"
Here’s my audit — before and after optimization:
| Layer | Before (9 MCP servers, 2025) | After (2 global + 4 project-scoped, 2026) | Delta |
|---|---|---|---|
| 1: Constitution | ~3,800 | ~3,800 | No change (Anthropic controls this) |
| 2: Identity | 1,160 | 1,160 | No change (my CLAUDE.md + memory) |
| 3: Knowledge | On-demand | On-demand | No change |
| 4: Tools | ~14,000 (40 tools) | ~1,800 (after Tool Search) | -87% |
| 5: Available budget | 181,040 | 193,240 | +12,200 tokens recovered |
| Signal Ratio (turn 1) | 0.17% | 0.34% | 2x improvement |
Those 12,200 recovered tokens translate directly to more file reads, longer conversations, and fewer auto-compact triggers. The context budget is zero-sum: every token saved at one layer is a token available at every other layer.
The Cutting Edge: Context as a Callable Tool
Current context management is passive. Claude Code’s auto-compact triggers when the context grows too large. Gemini CLI triggers at a fixed 50% threshold. Both approaches share the same flaw: the agent doesn’t decide to manage its context. The harness decides for it.
The CAT (Context as a Tool) paper by Liu et al. proposes the fix: make context management a tool the agent can call, just like read_file or bash. The agent decides when to compress, what to retain, and how to structure its memory.
CAT organizes context into three zones that map to the stack:
| CAT Zone | Context Engineering Stack Layer | Function |
|---|---|---|
| Fixed Segment | Layers 1-2 (Constitution + Identity) | Stable anchor — never compressed |
| Long-Term Memory | Layer 3 (Knowledge) | Condensed high-fidelity summaries, evolving |
| Working Memory | Layer 5 (Conversation) | Last K ReAct steps, fine-grained details |
Their trained model, SWE-Compressor, hit 57.6% on SWE-Bench Verified — outperforming both ReAct agents and static compression baselines — while maintaining stable reasoning under a bounded context budget.
The insight: the agent that manages its own context outperforms the agent that lets the harness manage it. My Brain MCP (Layer 3) is a primitive version of this — the agent queries its own past on-demand rather than loading everything at startup. But it’s still me deciding when to query. The next step is the agent deciding when to compress, what to keep, and what to fold.
Related work pushes this further: LongLLMLingua achieves 4x prompt compression with ~95% performance retention by scoring tokens for relevance to the downstream question and reordering compressed chunks to mitigate the lost-in-the-middle effect. DSPy’s MIPROv2 optimizer automates context engineering entirely — jointly optimizing instructions, few-shot examples, and their combination across multi-stage pipelines, improving performance by 5-15 percentage points over hand-tuned prompts. These aren’t prompt tricks. They’re automated context engineering.
Where This Breaks
Lost in the middle is real, and Layer 2 lives in the valley. Liu et al. (2023) measured a ~20 percentage point accuracy drop when relevant information moves from positions 1 or 20 to position 10 in a 20-document context. Your CLAUDE.md instructions sit between the system prompt (beginning) and the conversation (end) — directly in the attention valley. The mitigation: put your most critical rules at the very top of CLAUDE.md (closer to Layer 1) and re-state them in your current message (Layer 5). Don’t bury important instructions at line 50 of a 100-line CLAUDE.md.
Compression is lossy, and the loss is invisible. Auto-compact summarizes your conversation history, but the summary may drop the specific decision you made in turn 3 that’s load-bearing for turn 15. I’ve seen this in practice: after auto-compact fires, Claude Code occasionally “forgets” a constraint I set earlier — not because the constraint was wrong, but because compression deemed it low-priority. The fix: checkpoint critical decisions explicitly. Say “Remember: we decided PostgreSQL over MongoDB because of X” in your current message. Put it in Layer 5 (recent, high-attention), not just in the history (compressed, low-attention).
Token counting is approximate. You never know your exact context budget until you hit the limit. Different models tokenize differently — the same text may be 1,000 tokens for Claude and 1,200 for Gemini. A “200K context window” isn’t 200K words; it’s roughly 150K words of English text, less for code. Build in margin: target 70% utilization, not 95%.
Context engineering is model-specific. CLAUDE.md works for Claude Code. GEMINI.md works for Gemini CLI. But the optimal Layer 1 for Claude (direct, imperative: “Always use TypeScript”) differs from the optimal Layer 1 for Gemini (structured, hierarchical context). Claude’s lazy-loading Layer 2 and Tool Search Layer 4 have no equivalents in Gemini CLI. A context strategy doesn’t transfer cleanly across models — rewrite, don’t copy.
Context Gravity Is the Unifying Force
Prompt Gravity was the observation that system prompt tokens bias all downstream reasoning. Schema Gravity was the observation that tool definitions exert the same force. They’re the same phenomenon at different layers.
Context Gravity is the unified name: the cumulative attentional weight of all fixed-cost tokens across every layer of the stack. The heavier your context, the stronger the gravity, the harder it is for the model to attend to the signal — your actual question, buried at the bottom of 13,000 tokens of infrastructure.
The Context Engineering Stack is the map. The Signal Ratio is the metric. Context Gravity is the force you’re fighting. Every decision in agent architecture — how many MCP servers to load, how long to let conversations run, when to compress, what to put in CLAUDE.md — is a context engineering decision. The teams building reliable agents are the ones who treat it as one.
Run /mcp. Count the tools. Measure your CLAUDE.md. Calculate your Signal Ratio. If your agent’s context is 95% infrastructure and 5% signal, you’re not engineering context — you’re drowning in it.
Sharad Jain is an AI engineer and the author of The 14K Token Debt, The Terminal Was the First Agent Harness, Your MCP Servers Are Costing You 10 Seconds, My AI Agent’s Memory Paid for Itself, Claude Code vs Gemini CLI, and I Built an AI Skill That Started Improving Itself. He writes about agent architecture, system prompts, and the infrastructure decisions that compound across every session. This is the sixth post in a series on the hidden mechanics of agentic AI systems.