· 12 min read ·

Context Engineering Is All I've Been Writing About

I wrote six posts about system prompts, MCP servers, terminal harnesses, agent memory, CLI comparisons, and self-improving skills. They're all about the same thing: what occupies the context window before the model reasons.

Context Engineering Is All I’ve Been Writing About

I thought I was writing about different things:

PostWhat I Thought It Was AboutWhat It Was Actually About
The 14K Token DebtSystem prompt architectureLayer 1 of the context window costs 14K tokens before you speak
The Terminal Was the First Agent HarnessUnix as agent patternHow to namespace heterogeneous context sources
Your MCP Servers Are Costing You 10 SecondsMCP performance overheadTool schemas consuming the context window silently
My AI Agent’s Memory Paid for ItselfLong-term agent memoryOn-demand knowledge injection into context
Claude Code vs Gemini CLITool comparisonTwo different strategies for managing the same context window
Self-Improving AI SkillsRecursive improvementHow trajectory data in context compounds across sessions

Every post answered one question: what occupies the context window before the model reasons, and what does it cost?

Andrej Karpathy defined it: “Context engineering is the delicate art and science of filling the context window with just the right information for the next step.” I just didn’t have the name.


What the Model Actually Sees

I measured my context window. Not estimated — measured. I ran my CLAUDE.md and memory files through tiktoken, counted my MCP tool definitions, and mapped what Claude Code loads before I type a single character.

Here’s the literal structure of a fresh session in this project:

[LAYER 1 — CONSTITUTION]                              ~3,800 tokens
  Anthropic's system prompt: persona, safety rules,
  coding standards, tool usage instructions.
  (This is what "The 14K Token Debt" measured —
   but 14K included tool schemas. This post
   separates them into Layer 1 vs Layer 4.)

[LAYER 2 — IDENTITY]                                   1,160 tokens
  ~/.claude/CLAUDE.md                    182 tokens  ← measured
  MEMORY.md (index)                      163 tokens  ← measured
  Memory files (3 auto-loaded)           815 tokens  ← measured
  ─────────────────────────────────
  Subtotal:                            1,160 tokens

[LAYER 3 — KNOWLEDGE]                                      0 tokens
  Brain MCP results: loaded on-demand, not at boot.
  File reads: loaded when the model calls Read tool.

[LAYER 4 — TOOLS]                                     ~8,400 tokens
  6 MCP servers, ~40 tools total.
  After Tool Search deferral: ~1,800 tokens
  (tool names + descriptions only, not full schemas).

[LAYER 5 — CONVERSATION]                                  23 tokens
  My question: "fix the broken image path"

─────────────────────────────────────────────────────────
TOTAL AT FIRST TURN:                              ~13,383 tokens
TOTAL AFTER TOOL SEARCH:                           ~6,783 tokens

My question was 23 tokens. The infrastructure was 6,760 tokens (with Tool Search) or 13,360 tokens (without).

I call this the Signal Ratio — the percentage of the context window that’s actually your task. Mine was 0.17% at first turn without Tool Search. With Tool Search: 0.34%. Less than half a percent of my context window was the thing I wanted the agent to do.

Now, 13K tokens out of a 200K window is only 6.7% — plenty of headroom. But this is turn one. By turn ten, after file reads, tool calls, and assistant responses, the context has grown to 50,000-80,000 tokens. The infrastructure (Layers 1-4) stays fixed. The Signal Ratio climbs to 5-10%. But those fixed-cost tokens are still there — on every single turn — exerting gravitational pull on the model’s attention.

In The 14K Token Debt, I called this Prompt Gravity. In the MCP post, I called it Schema Gravity. They’re the same force operating at different layers. The unified name is Context Gravity — the cumulative attentional weight of all fixed-cost tokens across every layer of the stack. The heavier your context, the stronger the gravity, the harder it is for the model to attend to your actual question.

Prompt engineering asks “what should I put in the prompt?” That’s Layer 1 thinking — one-fifth of the stack. Context engineering asks: what should be in the entire window, at which layer, loaded when, at what cost?


The Context Engineering Stack

The Context Engineering Stack — five layers that every agent system designs, whether explicitly or by accident.

LayerWhat Goes HereWhen LoadedMy Setup (measured)Key Risk
1: ConstitutionSystem prompt, safety rules, personaEvery turn (immutable)~3,800 tokensPrompt Gravity — biases all downstream reasoning
2: IdentityCLAUDE.md, memory files, project rulesSession start1,160 tokensStale instructions from old projects; lost in the middle
3: KnowledgeRAG results, Brain MCP queries, file readsOn-demand0-50,000 per queryRetrieval noise — wrong documents dilute reasoning
4: ToolsMCP schemas, function definitionsSession start8,400 → 1,800 (after Tool Search)Schema Gravity — unused tools waste budget
5: ConversationMessages, responses, tool resultsAccumulatesGrowingContext rot — old turns degrade attention on recent ones

The constraint is directional: layers below eat the budget of layers above. A bloated Layer 4 (55,000 tokens of MCP schemas from 5 servers, per Anthropic’s own measurement) directly starves Layer 5 and Layer 3. This is why I cut my global MCP servers from 9 to 2 — their Layer 4 cost was crushing my Layer 5 capacity.

Layer 5: The Neglected Layer

Every previous post in this series addressed Layers 1-4. Layer 5 — the conversation itself — is the one I never wrote about, and it’s where context engineering gets hardest.

Layer 5 grows with every turn. Each tool call adds the full result to the conversation history. A Read tool call on a 200-line file adds ~1,000 tokens. A Bash tool call returning build output adds 500-2,000 tokens. After 10 tool-heavy turns, Layer 5 can consume 40,000-60,000 tokens — dwarfing all other layers combined.

The problem is context rot. In the 14K Token Debt post, I cited Liang et al. who found that instruction drift is universally measurable within eight conversation rounds. By turn eight, the model’s adherence to system prompt instructions starts degrading. By turn fifteen, it has collapsed into the statistical median of its pre-training distribution. Your carefully crafted Layer 1 constitution? It’s being diluted by the sheer volume of Layer 5 tokens.

The Lost in the Middle paper (Liu et al., Stanford) measured this precisely: on multi-document QA with 20 documents, moving the relevant document from position 1 to position 10 caused a ~20 percentage point accuracy drop on GPT-3.5-Turbo. The model attends to the beginning and end of context; everything in the middle is an attention valley. Your Layer 2 instructions (CLAUDE.md) live in that valley.

This is why the most effective Claude Code users put critical instructions in two places: Layer 1 (system prompt, which sits at the very beginning) and Layer 5 (the current message, which sits at the very end). The middle is unreliable territory.


Claude Code vs Gemini CLI: Two Context Stacks

The five layers exist in both tools, but the engineering choices at each layer differ:

LayerClaude CodeGemini CLI
1: Constitution~3,800 tokens. Detailed coding standards, tool usage rules, style preferences.~2,000 tokens. Lighter system prompt, more deference to user config.
2: IdentityHierarchical: ~/.claude/CLAUDE.md → project → subdirectory. Lazy loading — subdirectory files load only when agent reads files there.~/.gemini/GEMINI.md → project root + ancestors. Eager loading — all concatenated and sent every prompt.
3: KnowledgeNo built-in knowledge system. Extended via MCP (e.g., Brain MCP).Built-in google_web_search and web_fetch tools for real-time grounding.
4: ToolsTool Search defers schemas — 89% reduction (77K → 8.7K tokens per Anthropic).No schema deferral. All tool definitions injected every turn.
5: ConversationAdaptive auto-compact. 5-layer compression pipeline.Fixed threshold compression (50%, changed from 70%). Known compression loop bug.

The architectural bet is clear. Claude Code invests heavily in Layer 4 optimization (Tool Search) and Layer 5 management (multi-layer compression). Gemini CLI bets on a massive context window (1M tokens) to make optimization less critical — but without schema deferral, even 1M tokens fill when you connect enough MCP servers.

Context Mounting — the pattern from the Terminal post where heterogeneous context sources are projected into a uniform namespace — described the what. The Context Engineering Stack describes the when and the cost.


Audit Your Context Budget

Run this in a Claude Code session:

# Step 1: Count your MCP tools
/mcp
# Output shows: brain (4 tools), sequential-thinking (1 tool), etc.
# Count total tools across all servers.

# Step 2: Estimate Layer 4 cost
# Each tool schema ≈ 200-500 tokens of JSON.
# Rule of thumb: total_tools × 350 = Layer 4 budget
# My setup: 40 tools × 350 = ~14,000 tokens (before Tool Search)

# Step 3: Measure Layer 2 exactly
# Outside Claude Code:
pip install tiktoken
python3 -c "
import tiktoken
enc = tiktoken.encoding_for_model('gpt-4')
with open('$HOME/.claude/CLAUDE.md') as f:
    print(f'CLAUDE.md: {len(enc.encode(f.read()))} tokens')
"

Here’s my audit — before and after optimization:

LayerBefore (9 MCP servers, 2025)After (2 global + 4 project-scoped, 2026)Delta
1: Constitution~3,800~3,800No change (Anthropic controls this)
2: Identity1,1601,160No change (my CLAUDE.md + memory)
3: KnowledgeOn-demandOn-demandNo change
4: Tools~14,000 (40 tools)~1,800 (after Tool Search)-87%
5: Available budget181,040193,240+12,200 tokens recovered
Signal Ratio (turn 1)0.17%0.34%2x improvement

Those 12,200 recovered tokens translate directly to more file reads, longer conversations, and fewer auto-compact triggers. The context budget is zero-sum: every token saved at one layer is a token available at every other layer.


The Cutting Edge: Context as a Callable Tool

Current context management is passive. Claude Code’s auto-compact triggers when the context grows too large. Gemini CLI triggers at a fixed 50% threshold. Both approaches share the same flaw: the agent doesn’t decide to manage its context. The harness decides for it.

The CAT (Context as a Tool) paper by Liu et al. proposes the fix: make context management a tool the agent can call, just like read_file or bash. The agent decides when to compress, what to retain, and how to structure its memory.

CAT organizes context into three zones that map to the stack:

CAT ZoneContext Engineering Stack LayerFunction
Fixed SegmentLayers 1-2 (Constitution + Identity)Stable anchor — never compressed
Long-Term MemoryLayer 3 (Knowledge)Condensed high-fidelity summaries, evolving
Working MemoryLayer 5 (Conversation)Last K ReAct steps, fine-grained details

Their trained model, SWE-Compressor, hit 57.6% on SWE-Bench Verified — outperforming both ReAct agents and static compression baselines — while maintaining stable reasoning under a bounded context budget.

The insight: the agent that manages its own context outperforms the agent that lets the harness manage it. My Brain MCP (Layer 3) is a primitive version of this — the agent queries its own past on-demand rather than loading everything at startup. But it’s still me deciding when to query. The next step is the agent deciding when to compress, what to keep, and what to fold.

Related work pushes this further: LongLLMLingua achieves 4x prompt compression with ~95% performance retention by scoring tokens for relevance to the downstream question and reordering compressed chunks to mitigate the lost-in-the-middle effect. DSPy’s MIPROv2 optimizer automates context engineering entirely — jointly optimizing instructions, few-shot examples, and their combination across multi-stage pipelines, improving performance by 5-15 percentage points over hand-tuned prompts. These aren’t prompt tricks. They’re automated context engineering.


Where This Breaks

Lost in the middle is real, and Layer 2 lives in the valley. Liu et al. (2023) measured a ~20 percentage point accuracy drop when relevant information moves from positions 1 or 20 to position 10 in a 20-document context. Your CLAUDE.md instructions sit between the system prompt (beginning) and the conversation (end) — directly in the attention valley. The mitigation: put your most critical rules at the very top of CLAUDE.md (closer to Layer 1) and re-state them in your current message (Layer 5). Don’t bury important instructions at line 50 of a 100-line CLAUDE.md.

Compression is lossy, and the loss is invisible. Auto-compact summarizes your conversation history, but the summary may drop the specific decision you made in turn 3 that’s load-bearing for turn 15. I’ve seen this in practice: after auto-compact fires, Claude Code occasionally “forgets” a constraint I set earlier — not because the constraint was wrong, but because compression deemed it low-priority. The fix: checkpoint critical decisions explicitly. Say “Remember: we decided PostgreSQL over MongoDB because of X” in your current message. Put it in Layer 5 (recent, high-attention), not just in the history (compressed, low-attention).

Token counting is approximate. You never know your exact context budget until you hit the limit. Different models tokenize differently — the same text may be 1,000 tokens for Claude and 1,200 for Gemini. A “200K context window” isn’t 200K words; it’s roughly 150K words of English text, less for code. Build in margin: target 70% utilization, not 95%.

Context engineering is model-specific. CLAUDE.md works for Claude Code. GEMINI.md works for Gemini CLI. But the optimal Layer 1 for Claude (direct, imperative: “Always use TypeScript”) differs from the optimal Layer 1 for Gemini (structured, hierarchical context). Claude’s lazy-loading Layer 2 and Tool Search Layer 4 have no equivalents in Gemini CLI. A context strategy doesn’t transfer cleanly across models — rewrite, don’t copy.


Context Gravity Is the Unifying Force

Prompt Gravity was the observation that system prompt tokens bias all downstream reasoning. Schema Gravity was the observation that tool definitions exert the same force. They’re the same phenomenon at different layers.

Context Gravity is the unified name: the cumulative attentional weight of all fixed-cost tokens across every layer of the stack. The heavier your context, the stronger the gravity, the harder it is for the model to attend to the signal — your actual question, buried at the bottom of 13,000 tokens of infrastructure.

The Context Engineering Stack is the map. The Signal Ratio is the metric. Context Gravity is the force you’re fighting. Every decision in agent architecture — how many MCP servers to load, how long to let conversations run, when to compress, what to put in CLAUDE.md — is a context engineering decision. The teams building reliable agents are the ones who treat it as one.

Run /mcp. Count the tools. Measure your CLAUDE.md. Calculate your Signal Ratio. If your agent’s context is 95% infrastructure and 5% signal, you’re not engineering context — you’re drowning in it.


Sharad Jain is an AI engineer and the author of The 14K Token Debt, The Terminal Was the First Agent Harness, Your MCP Servers Are Costing You 10 Seconds, My AI Agent’s Memory Paid for Itself, Claude Code vs Gemini CLI, and I Built an AI Skill That Started Improving Itself. He writes about agent architecture, system prompts, and the infrastructure decisions that compound across every session. This is the sixth post in a series on the hidden mechanics of agentic AI systems.

#AI #context-engineering #Claude-Code #Gemini-CLI #MCP #Karpathy #system-prompts #agents #architecture #developer-experience