<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Sharad Jain</title><description>Writing about machine learning, AI agents, and the future of software.</description><link>https://sharadja.in/</link><language>en-us</language><managingEditor>sharad@sharadja.in (Sharad Jain)</managingEditor><webMaster>sharad@sharadja.in (Sharad Jain)</webMaster><image><url>https://sharadja.in/assets/authors/sharad-jain.webp</url><title>Sharad Jain</title><link>https://sharadja.in/</link></image><item><title>The Resumption Benchmark v0: Measuring Whether the Next Agent Continues Correctly</title><link>https://sharadja.in/blog/resumption-benchmark-v0-spec/</link><guid isPermaLink="true">https://sharadja.in/blog/resumption-benchmark-v0-spec/</guid><description>Most memory benchmarks reward you for remembering more. This one penalizes you for failing to forget. The v0 spec for the dependent variable Experiment 3 needs — published before the experiment runs, exactly as pre-registered.</description><pubDate>Thu, 04 Jun 2026 14:00:00 GMT</pubDate><content:encoded>&lt;h1&gt;The Resumption Benchmark v0: Measuring Whether the Next Agent Continues Correctly&lt;/h1&gt;
&lt;p&gt;Nine days ago I &lt;a href=&quot;https://sharadja.in/blog/pre-registering-long-horizon-experiments&quot;&gt;pre-registered three experiments&lt;/a&gt;. One of the five things I committed to in writing was this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;The Resumption Benchmark spec is published before Experiment 3 runs. The benchmark definition cannot be moved to fit the result.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This post is that spec. It exists for a single reason: ==so that when Experiment 3 ships a number, the dependent variable was named in advance and cannot be quietly redefined to flatter the result.== It is not the experiment. It is the ruler the experiment will be measured against.&lt;/p&gt;
&lt;p&gt;The benchmark also introduces one concept I have not seen in any published memory or summarization eval: &lt;strong&gt;forgetting precision&lt;/strong&gt; — a score that goes &lt;em&gt;up&lt;/em&gt; when the system correctly leaves things out. Most evals only know how to reward inclusion. Real consolidation has to do both.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Gap This Benchmark Names&lt;/h2&gt;
&lt;p&gt;The &lt;a href=&quot;https://sharadja.in/blog/long-horizon-agents-coherence-cliff&quot;&gt;synthesis post on long-horizon agents&lt;/a&gt; defined the &lt;strong&gt;Resumption Gap&lt;/strong&gt; as the quality loss between a continuous agent and one that was serialized to disk, woken up later, and rehydrated from a summary. That post offered no way to measure the gap, which is a defensible move in an essay and an indefensible one in an experiment. Without a measuring stick, Experiment 3 could claim that &quot;structured consolidation beats naive summarization&quot; with no way for a reader to falsify it.&lt;/p&gt;
&lt;p&gt;The closest prior art is the owned-corpus eval I built in the &lt;a href=&quot;https://sharadja.in/blog/owned-corpus-memory-benchmark&quot;&gt;LongMemEval critique&lt;/a&gt;. That benchmark answers a retrieval question: &lt;em&gt;did the system find the right document?&lt;/em&gt; This benchmark has to answer a different question:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Given that an agent was in the middle of a multi-step task and was interrupted, does the resumption context enable the next agent to continue the task correctly — without redoing work the first agent already finished, and without repeating mistakes it already learned to avoid?&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That second clause — &quot;mistakes it already learned to avoid&quot; — is what makes resumption different from summarization. If the next agent gets a faithful summary of every approach the previous agent tried, including the dead ends, ==the dead ends are now live again==. Faithful recall poisons the trajectory. Resumption requires the summary to be &lt;strong&gt;selective&lt;/strong&gt; in a way that summarization evals do not measure.&lt;/p&gt;
&lt;p&gt;There is parallel evidence from the SWE side that the long-horizon problem is wide open. &lt;a href=&quot;https://arxiv.org/abs/2512.18470&quot;&gt;SWE-EVO&lt;/a&gt;, revised on arxiv on May 22, 2026, constructs 48 tasks from release notes of seven mature Python projects — each spanning an average of 21 files and validated against test suites averaging 874 tests per instance. The result that matters here is the collapse between benchmarks: ==GPT-5.4 with OpenHands hits 25% on SWE-EVO versus 72.8% for GPT-5.2 on SWE-bench Verified — a 47.8-point drop the moment the task requires sustained multi-file reasoning instead of an isolated patch.== SWE-EVO and the Resumption Benchmark are measuring different dependent variables — SWE-EVO asks whether the long task gets done, the Resumption Benchmark asks whether the &lt;em&gt;handoff between sessions of the long task&lt;/em&gt; preserves the right state — but they are pointing at the same underlying gap. Long-horizon coordination is the bottleneck. Static issue resolution is no longer where the frontier breaks.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Unit: A Resumption Episode&lt;/h2&gt;
&lt;p&gt;Every test case is a 5-tuple. I am going to spell it out in code-block form because the structure is the contract:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ResumptionEpisode = (
  initial_task,
  partial_trajectory,
  gold_continuation,
  gold_facts_to_preserve,
  gold_facts_to_forget,
)
&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;What it is&lt;/th&gt;
&lt;th&gt;Where it comes from&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;initial_task&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The goal the original agent was pursuing.&lt;/td&gt;
&lt;td&gt;Hand-extracted from a real Brain session.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;partial_trajectory&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The raw step-by-step trace of turns 1 through K.&lt;/td&gt;
&lt;td&gt;Real session, sliced at step K.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gold_continuation&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;What a correct continuation looks like from step K+1 onward. May be a &lt;em&gt;set&lt;/em&gt; of acceptable trajectories.&lt;/td&gt;
&lt;td&gt;Hand-authored, informed by what the real session did next.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gold_facts_to_preserve&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The facts the resumption context MUST encode for the continuation to be possible.&lt;/td&gt;
&lt;td&gt;Hand-extracted: decisions made, state established, dead branches identified as dead.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gold_facts_to_forget&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The facts the resumption context MUST NOT carry forward. Rejected approaches that would re-poison the continuation if reintroduced.&lt;/td&gt;
&lt;td&gt;Hand-extracted: blind alleys the agent already abandoned.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The fifth field is doing the heavy lifting. Drop it and the benchmark collapses into a summarization eval with extra steps. Keep it and the benchmark measures something no shipping eval measures: &lt;strong&gt;selective forgetting as a scored capability&lt;/strong&gt;.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Why Forgetting Has to Be Scored&lt;/h2&gt;
&lt;p&gt;The instinct in summarization research is to treat omission as a failure mode. Higher recall is better; missing facts cost you. That is the right instinct for a news summary, which has a single goal (reproduce the article faithfully). It is the wrong instinct for an agent resumption context, which has &lt;em&gt;two&lt;/em&gt; goals that pull in opposite directions:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Preserve the decisions and state that the continuation depends on.&lt;/strong&gt; (high recall)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Drop the abandoned branches that would re-poison the continuation if reintroduced.&lt;/strong&gt; (high precision against a &lt;em&gt;negative&lt;/em&gt; set)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;==A system that preserves everything wins on recall, fails on precision, and produces an agent that loops back into the same dead end the original agent already escaped from.== A system that drops everything wins on precision, fails on recall, and produces an agent that starts the task over from scratch. Neither is the consolidation regime the &lt;a href=&quot;https://sharadja.in/blog/long-horizon-agents-coherence-cliff&quot;&gt;Coherence Cliff post&lt;/a&gt; argued was the only way out of self-conditioning.&lt;/p&gt;
&lt;p&gt;The mammalian sleep analogy gets thrown around a lot in this space and is usually decoration. Here it has a specific operational meaning: ==REM-sleep consolidation in animals is structurally a forgetting operation as much as it is a remembering one==. Synaptic homeostasis theory (&lt;a href=&quot;https://www.cell.com/neuron/fulltext/S0896-6273%2813%2901186-0&quot;&gt;Tononi &amp;amp; Cirelli, 2014&lt;/a&gt;) frames sleep as a global down-selection of weak synapses. The strong stuff stays. The noise goes. A benchmark that only measures the strong-stuff-stays half of that operation is measuring half the brain.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Three Scores&lt;/h2&gt;
&lt;p&gt;A run of the benchmark on a single episode produces three numbers, each on [0, 1]:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;What it measures&lt;/th&gt;
&lt;th&gt;How it&apos;s computed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;continuation_correctness&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Did the resumed agent finish the task correctly?&lt;/td&gt;
&lt;td&gt;Run the resumed agent on &lt;code&gt;partial_trajectory + resumption_context&lt;/code&gt;. Score its final output against the gold continuation.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;preservation_recall&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Did the resumption context include the gold-preserve facts?&lt;/td&gt;
&lt;td&gt;LLM-judge, boolean per fact, aggregated.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;forgetting_precision&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Did the resumption context successfully omit the gold-forget facts?&lt;/td&gt;
&lt;td&gt;LLM-judge, boolean per fact, aggregated. &lt;em&gt;Negative-set check.&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The composite quality score is the &lt;strong&gt;geometric mean&lt;/strong&gt; of the three:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;quality = (continuation_correctness × preservation_recall × forgetting_precision) ^ (1/3)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Geometric, not arithmetic. The choice matters and it is the second non-obvious decision in this spec.&lt;/p&gt;
&lt;p&gt;Arithmetic mean averages weakness out. A system that scores &lt;code&gt;(1.0, 1.0, 0.0)&lt;/code&gt; arithmetic-averages to &lt;strong&gt;0.667&lt;/strong&gt; — which looks acceptable and is in fact a system that totally fails on one of three required capabilities. Geometric mean returns &lt;strong&gt;0.0&lt;/strong&gt; for the same input. ==Any zero in the product zeros the score. You cannot paper over a complete failure on one axis with strength on the other two.== This is the right shape for the benchmark because all three axes are independently load-bearing. A consolidation system that ignores any one of them is not a consolidation system.&lt;/p&gt;
&lt;h3&gt;A toy example&lt;/h3&gt;
&lt;p&gt;To make this concrete, here is a worked toy episode. Suppose the original agent was debugging a flaky test. Over 50 turns, it:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;correctly identified that the flake was a race condition on a shared cache key (decision A),&lt;/li&gt;
&lt;li&gt;spent 12 turns trying &lt;code&gt;time.sleep(0.1)&lt;/code&gt; between requests, gave up because it just slowed CI without fixing the flake (dead branch B),&lt;/li&gt;
&lt;li&gt;landed on a per-test cache namespace as the fix (decision C),&lt;/li&gt;
&lt;li&gt;had not yet written the actual patch when the session was interrupted.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A clean Resumption Episode for this trajectory looks like:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;initial_task: &quot;fix the flake in tests/integration/test_cache.py&quot;
partial_trajectory: [turns 1..50]
gold_continuation:
  - write per-test cache namespace
  - run test 50× to confirm de-flaked
gold_facts_to_preserve:
  - root cause is shared cache key (A)
  - fix is per-test cache namespace (C)
  - patch not yet written
gold_facts_to_forget:
  - time.sleep retries were tried and abandoned (B)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Three resumption-context candidates, scored:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;System&lt;/th&gt;
&lt;th&gt;Resumption context produced&lt;/th&gt;
&lt;th&gt;Cont. correctness&lt;/th&gt;
&lt;th&gt;Pres. recall&lt;/th&gt;
&lt;th&gt;Forg. precision&lt;/th&gt;
&lt;th&gt;Geomean&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Naive concat&lt;/td&gt;
&lt;td&gt;full raw trajectory&lt;/td&gt;
&lt;td&gt;0.6&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;0.0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.00&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Naive summary&lt;/td&gt;
&lt;td&gt;&quot;agent tried sleep retries and per-test namespaces; landed on namespaces&quot;&lt;/td&gt;
&lt;td&gt;0.8&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;0.5&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.74&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Structured consolidation&lt;/td&gt;
&lt;td&gt;&quot;root cause: shared cache key. fix: per-test namespace. patch not yet written.&quot;&lt;/td&gt;
&lt;td&gt;0.95&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.98&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The naive-concat system is the &lt;a href=&quot;https://sharadja.in/blog/long-horizon-agents-coherence-cliff&quot;&gt;self-conditioning trap&lt;/a&gt; made flesh: it includes the dead branch, the resumed agent re-tries &lt;code&gt;time.sleep&lt;/code&gt;, and the score is zeroed by the forgetting axis. The naive-summary system gets partial credit on forgetting because it mentions the dead branch without flagging it as dead. ==The benchmark structurally penalizes &quot;include for completeness.&quot;==&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Calibration Gate&lt;/h2&gt;
&lt;p&gt;The benchmark uses LLM judges for two of its three scores. That sounds circular and would be circular without a calibration step. The gate is non-negotiable:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Hand-grade 50 examples of each scoring component yourself.&lt;/li&gt;
&lt;li&gt;Score the same 50 with the LLM-judge.&lt;/li&gt;
&lt;li&gt;Compute &lt;strong&gt;Cohen&apos;s κ&lt;/strong&gt; between human and judge.&lt;/li&gt;
&lt;li&gt;==Refuse to publish any number until κ ≥ 0.6 for all three components.==&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Cohen&apos;s κ corrects for chance agreement; 0.6 is the threshold above which inter-rater agreement is conventionally called &quot;substantial&quot; (&lt;a href=&quot;https://www.jstor.org/stable/2529310&quot;&gt;Landis &amp;amp; Koch, 1977&lt;/a&gt;). Below 0.6, the judge is approximately a noisy human and the headline number is approximately astrology.&lt;/p&gt;
&lt;p&gt;If κ &amp;lt; 0.6 on any component, the play is to fix the judge prompt and re-calibrate, not to publish with a footnote. The pre-registration commits to this: &lt;em&gt;Cohen&apos;s κ ≥ 0.6 on every LLM-judge scoring component, or I do not publish the score.&lt;/em&gt; This gate exists because most agent benchmarks I have read either skip it entirely or report it on a non-representative subset. The reader has no way to tell whether the judge is measuring what the paper claims.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Constructing the v0 Episodes&lt;/h2&gt;
&lt;p&gt;The temptation in a new benchmark is to scale up immediately. Five hundred episodes! Cross-domain! Procedurally generated! Resist all of it. v0 is &lt;strong&gt;30 hand-constructed episodes&lt;/strong&gt; drawn from real Brain trajectories, and that number is chosen for a specific reason.&lt;/p&gt;
&lt;p&gt;Per Cameron Wolfe&apos;s &lt;a href=&quot;https://cameronrwolfe.substack.com/p/llm-evaluation&quot;&gt;stats guide for LLM evals&lt;/a&gt;, the standard error on a proportion estimate at n=30 with typical between-run correlation structure is around ±9 points. That&apos;s enough to detect ~10-point quality differences between conditions at p &amp;lt; 0.05, which is the resolution Experiment 3 needs. Going to 100 episodes tightens that to ±5 points; going to 500 tightens it to ±2 points. ==v0 is sized to detect the effect Experiment 3 either has or doesn&apos;t have. Larger n is for v1, after we know whether the effect exists at all.==&lt;/p&gt;
&lt;p&gt;The construction protocol is mechanical:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;1. Find a real Sharad-Claude-Code session with ≥ 100 turns.
2. Identify a natural interruption point around turn 50.
3. Pause the transcript there. That&apos;s `partial_trajectory`.
4. Write `gold_continuation` by hand, informed by what the real session did next.
5. Extract `gold_facts_to_preserve` from the post-interruption turns:
   decisions, state, branches identified as dead.
6. Extract `gold_facts_to_forget` from the pre-interruption turns:
   approaches the agent already tried and abandoned.
7. Save as YAML. One file per episode.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &quot;turn 50&quot; choice is not magical. It is the shortest length where the trajectory has meaningful state to lose and the gold continuation is still tractable to write by hand. Earlier interruption points trivialize the resumption problem; later ones make the gold continuation impossible to define cleanly without authoring half a coding session.&lt;/p&gt;
&lt;p&gt;The gold data is &lt;strong&gt;private&lt;/strong&gt;. The episodes are sliced from my actual work history and contain code, decisions, and project state I do not want indexed. ==The benchmark spec is public; the seed data is not.== This is the same arrangement &lt;a href=&quot;https://sharadja.in/blog/owned-corpus-memory-benchmark&quot;&gt;the owned-corpus benchmark&lt;/a&gt; made and for the same reason: personal-corpus evals are useful precisely because the corpus is personal.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;File Layout&lt;/h2&gt;
&lt;p&gt;The spec ships as a directory, not a document. Here is the layout the runner expects:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;.scratch/experiments/exp3-rem-sleep/benchmark/
├── RESUMPTION_BENCHMARK.md     # this spec
├── episodes/
│   ├── 001.yaml                # initial_task, partial_trajectory, gold_* fields
│   ├── 002.yaml
│   └── ...
├── judges/
│   ├── continuation_judge.md   # judge prompt for continuation_correctness
│   ├── preservation_judge.md   # judge prompt for preservation_recall
│   └── forgetting_judge.md     # judge prompt for forgetting_precision
├── calibration/
│   ├── hand_grades.jsonl       # 50 hand-graded examples per component
│   └── kappa_report.md         # per-component Cohen&apos;s κ
└── runs/
    └── &amp;lt;run_id&amp;gt;/scores.parquet
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Build order matters. ==The spec is built before the runner. The runner is built before any consolidator. The consolidator is built last.== Reversing that order is how you accidentally tune the benchmark to the system you happen to be building.&lt;/p&gt;
&lt;p&gt;The runner is intentionally boring: load &lt;code&gt;episodes/*.yaml&lt;/code&gt;, take a consolidator name and a token budget, generate a resumption context per episode, score the three components, write to &lt;code&gt;runs/&amp;lt;run_id&amp;gt;/scores.parquet&lt;/code&gt;. The same runner has to be able to score the naive-concat baseline, the naive-summary baseline, and the structured-consolidation system Experiment 3 actually tests. If a system needs the runner modified to score it, ==the system is gaming the benchmark, not passing it==.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Where This Benchmark Breaks&lt;/h2&gt;
&lt;p&gt;This is a v0. It is not a finished artifact. Five things it explicitly does not handle:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cross-domain validity is untested.&lt;/strong&gt; Brain trajectories are coding sessions. A medical-record consolidation, a legal-research consolidation, a customer-support handoff — those are different distributions and will need separately constructed episode sets. The forgetting axis in particular may not transfer cleanly; the cost of dragging an abandoned hypothesis into a medical workup is structurally different from dragging an abandoned &lt;code&gt;time.sleep&lt;/code&gt; into a debug trace.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Multi-agent handoff is out of scope.&lt;/strong&gt; v0 measures single-agent resumption: same agent identity, same goal, fresh context. Cross-agent handoff (different model, different system prompt, different tool access) is a v2 concern. The reason to defer it is that v2 has to model the &lt;em&gt;new&lt;/em&gt; agent&apos;s priors, not just the old agent&apos;s state.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cost is not in the score.&lt;/strong&gt; v0 holds the resumption-context token budget constant across conditions, which controls for context-length confounding (the &lt;a href=&quot;https://sharadja.in/blog/pre-registering-long-horizon-experiments&quot;&gt;whole point&lt;/a&gt; of the matched-controls commitment) but does not measure cost-quality Pareto frontiers. A system that hits the same quality at half the tokens is currently invisible to the benchmark. v1 introduces cost-aware scoring.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;The &lt;code&gt;gold_facts_to_forget&lt;/code&gt; field is the hardest one to author and the easiest one to author badly.&lt;/strong&gt; A reviewer who is not Sharad cannot in general tell whether a &quot;dead branch&quot; in someone else&apos;s trajectory was actually dead. ==This is the benchmark&apos;s hardest reliability problem and the one where the calibration gate matters most.== If Cohen&apos;s κ on forgetting_precision is consistently lower than on the other two components, the field itself may be operationalized wrongly.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;n=30 is small.&lt;/strong&gt; The benchmark is calibrated to detect ~10-point differences. Sub-10-point effects are real and will be invisible to v0. A consolidation system that is only marginally better than naive summarization will produce overlapping confidence intervals and the right read of that is &quot;no detectable effect at v0 power,&quot; not &quot;they&apos;re equivalent.&quot;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If Experiment 3 ships and any of these limitations turns out to materially shape the result, the post says so. That is the pre-registration deal.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;What This Benchmark Unlocks&lt;/h2&gt;
&lt;p&gt;Experiment 3 has three falsification conditions, all defined against this benchmark:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Structured consolidation does not outperform raw trajectory at matched token budget, &lt;strong&gt;on this benchmark&lt;/strong&gt;, or&lt;/li&gt;
&lt;li&gt;Structured consolidation does not outperform naive summarization, &lt;strong&gt;on this benchmark&lt;/strong&gt;, or&lt;/li&gt;
&lt;li&gt;The advantage disappears at trajectory lengths &amp;lt; 50 steps, &lt;strong&gt;on this benchmark&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;==Without the benchmark, any of those claims could be quietly redefined after seeing the data. With it, they can&apos;t.== The geometric-mean score is the headline number. The three component scores are the diagnostic breakdown. The Cohen&apos;s κ is the gate that determines whether any of it gets published at all.&lt;/p&gt;
&lt;p&gt;That is the whole stack. Spec first. Runner next. Consolidator last. And the order is the entire point — because the alternative is the version of agent benchmarking that has been shipping for two years, where the metric drifts in lockstep with whatever the vendor wanted the result to be.&lt;/p&gt;
&lt;p&gt;The benchmark is now load-bearing for Experiment 3. The spec is public. The pre-registration is open. ==The next post in this series is Experiment 1 — self-conditioning replication with token-matched memory controls. After that, Experiment 3 runs against this benchmark, with κ-gated scoring, and ships the number it gets.==&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;em&gt;Series: &lt;a href=&quot;https://sharadja.in/blog/system-prompt-architecture-agentic-ai&quot;&gt;The 14K Token Debt&lt;/a&gt; → &lt;a href=&quot;https://sharadja.in/blog/terminal-first-agent-harness-unix-ai-agents&quot;&gt;The Terminal Was the First Agent Harness&lt;/a&gt; → &lt;a href=&quot;https://sharadja.in/blog/self-improving-agent-harnesses-recursive-improvement&quot;&gt;I Built an AI Skill That Started Improving Itself&lt;/a&gt; → &lt;a href=&quot;https://sharadja.in/blog/owned-corpus-memory-benchmark&quot;&gt;91.55% on LongMemEval, and the Benchmark I&apos;m Building Instead&lt;/a&gt; → &lt;a href=&quot;https://sharadja.in/blog/long-horizon-agents-coherence-cliff&quot;&gt;Brilliant but Amnesiac: The Coherence Cliff&lt;/a&gt; → &lt;a href=&quot;https://sharadja.in/blog/pre-registering-long-horizon-experiments&quot;&gt;Before I Run the Next Three Experiments&lt;/a&gt; → this post. Next: Experiment 1 results.&lt;/em&gt;&lt;/p&gt;
</content:encoded><enclosure url="https://sharadja.in/astral_hacker.png" type="image/png" length="0"/><category>AI</category><category>agents</category><category>memory</category><category>benchmarks</category><category>resumption</category><category>consolidation</category><category>evaluation</category><category>REM-sleep</category><category>distillation</category><category>pre-registration</category><category>methodology</category><author>Sharad Jain</author></item><item><title>Before I Run the Next Three Experiments, I&apos;m Pre-Registering Them</title><link>https://sharadja.in/blog/pre-registering-long-horizon-experiments/</link><guid isPermaLink="true">https://sharadja.in/blog/pre-registering-long-horizon-experiments/</guid><description>Most agent benchmarks publish results without naming what would have falsified them in advance. I&apos;m doing the other thing. Three experiments, three hypotheses, three falsification criteria, in writing, before any data exists. Hold me to it.</description><pubDate>Tue, 26 May 2026 20:30:00 GMT</pubDate><content:encoded>&lt;h1&gt;Before I Run the Next Three Experiments, I&apos;m Pre-Registering Them&lt;/h1&gt;
&lt;p&gt;Earlier today I &lt;a href=&quot;https://sharadja.in/blog/long-horizon-agents-coherence-cliff&quot;&gt;published the synthesis post on long-horizon agents&lt;/a&gt; and pre-committed to three follow-up experiments. The polite move would be to run them quietly, publish the wins, and bury the losses.&lt;/p&gt;
&lt;p&gt;I am not doing that.&lt;/p&gt;
&lt;p&gt;==The single highest-leverage move available to a person publishing work in this field is to &lt;strong&gt;name what would have falsified the hypothesis&lt;/strong&gt; before the data exists.== Vendor benchmarks do the opposite — they run the experiments, see the results, then write the post. The reader has no way to tell whether the hypothesis was modified to fit the data. This post is the pre-registration. It exists so that the next three posts in this series are constrained by what I claimed before I knew.&lt;/p&gt;
&lt;p&gt;If any of the three hypotheses below fail, I publish the failure post. Same series, same voice, same week. That is the deal.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Experiment 1 — Self-Conditioning Replication ± External Memory&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;H1 (replication).&lt;/strong&gt; &lt;a href=&quot;https://arxiv.org/abs/2509.09677&quot;&gt;Sinha, Arun, Goel (ICLR 2026)&lt;/a&gt; found that per-step execution accuracy degrades as a function of step number on the same task, same model, even when the plan is provided. This should reproduce on Opus 4.7, Sonnet 4.6, and Haiku 4.5.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;H2 (mechanism).&lt;/strong&gt; The degradation is &lt;em&gt;context-mediated&lt;/em&gt;. Pulling load-bearing trajectory state out of the autoregressive trace and into an external memory tool — keeping the &lt;strong&gt;total context-token budget constant&lt;/strong&gt; — flattens the per-step accuracy slope. This is the load-bearing claim of the synthesis post.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Falsifies H1:&lt;/strong&gt; Flat or non-monotonic per-step accuracy on tasks of length ≥ 50, with overlapping 95% CIs between step 5 and step 50.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Falsifies H2:&lt;/strong&gt; The slope of (per-step accuracy vs step number) is statistically indistinguishable between the memory-tool condition and the inline-trace condition at matched total context length.&lt;/p&gt;
&lt;p&gt;A clean replication of H1 with falsification of H2 is the most important result this experiment could produce. It would mean the synthesis post is wrong about the mechanism and I would publish the correction.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Experiment 2 — 7-Day Discontinuous Autonomous Lifecycle Agent&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Not a scientific experiment.&lt;/strong&gt; This is an operational and economic shakedown. I am pre-registering it anyway, because the publication shape matters.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Claim.&lt;/strong&gt; For a workload with a duty cycle below 30%, a serialize-and-cron lifecycle beats continuous running on total cost without reducing task throughput, and a public 7-day trace will surface failure modes the protocol did not anticipate.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Falsifies the claim:&lt;/strong&gt; DAL total cost ≥ continuous cost on the same workload over 7 days, OR DAL task completion rate is materially worse than continuous (lower by &amp;gt;5 percentage points at matched workload).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What this experiment cannot show:&lt;/strong&gt; that DAL closes the Resumption Gap. It does not. It only verifies that a serialize/wake/cron cycle is operationally tractable. If the published post claims more than that, hold me to this paragraph.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Experiment 3 — REM-Sleep Consolidation as Knowledge Distillation&lt;/h2&gt;
&lt;p&gt;This is the experiment that closes the thesis loop. It is also the one I am most worried about, which is exactly why pre-registration matters here.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;H3.&lt;/strong&gt; A consolidation pass run during dormancy — compressing a raw N-step trajectory into a structured artifact (goal, decisions, open branches, rejected paths) — produces a resumption context that the next-wake agent can use to &lt;strong&gt;continue the task correctly&lt;/strong&gt; at a smaller token budget than either (a) the raw trajectory or (b) a naive summarization at matched budget.&lt;/p&gt;
&lt;p&gt;The strong sub-claim — &lt;strong&gt;structured consolidation outperforms naive summarization&lt;/strong&gt; — is what separates this experiment from &quot;you just need RAG.&quot; Lose this and the REM-sleep framing is decoration.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Falsifies H3:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Structured consolidation does not outperform raw trajectory at matched token budget, OR&lt;/li&gt;
&lt;li&gt;Structured consolidation does not outperform naive summarization (this is the critical sub-claim), OR&lt;/li&gt;
&lt;li&gt;The advantage disappears at trajectory lengths &amp;lt; 50 steps.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;The benchmark this needs:&lt;/strong&gt; the &lt;a href=&quot;https://sharadja.in/blog/long-horizon-agents-coherence-cliff&quot;&gt;Resumption Benchmark v0&lt;/a&gt; I named in the synthesis post and did not define. The operational definition is in the experiment&apos;s scaffolding now (&lt;code&gt;gold_facts_to_preserve&lt;/code&gt; and &lt;code&gt;gold_facts_to_forget&lt;/code&gt;, geometric-mean scoring across continuation correctness, preservation recall, and forgetting precision). The benchmark spec is public; the seed episodes (drawn from real Brain trajectories) stay private for the obvious reasons.&lt;/p&gt;
&lt;p&gt;This experiment is, structurally, &lt;a href=&quot;https://arxiv.org/abs/1503.02531&quot;&gt;Hinton, Vinyals, and Dean (2015)&lt;/a&gt; applied to agent trajectories rather than network outputs. Same operation: a noisy teacher distilled into a structured student. The literature has a 9-year head start on what should work; the question is whether the same intuitions transfer.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The order I&apos;m running them in&lt;/h2&gt;
&lt;p&gt;The synthesis post said 1 → 2 → 3. I&apos;m running them &lt;strong&gt;1 → 3 → 2.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Experiment 1 is the only one that can falsify the synthesis post&apos;s mechanism. Experiment 3 is conditional on Experiment 1 establishing that there is a context-mediated self-conditioning effect to consolidate out of. Experiment 2 is a week-long operational shakedown that doesn&apos;t add scientific evidence the other two wouldn&apos;t. The reorder is the right scientific call. The original order would have been the right marketing call. I&apos;m choosing the first one.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;What I&apos;m pre-committing to in writing&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Commitment&lt;/th&gt;
&lt;th&gt;What it means&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Every protocol is public before the data exists&lt;/td&gt;
&lt;td&gt;The four PROTOCOL files exist in the repo, with hypotheses and falsification criteria. I will not edit them after seeing results.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failed hypotheses ship as posts&lt;/td&gt;
&lt;td&gt;If H1 reproduces but H2 fails, I publish &quot;the synthesis post was wrong about the mechanism.&quot; If H3 fails, I publish &quot;structured consolidation does not beat naive summarization.&quot; Same series, same voice.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Token-matched controls on every &quot;with vs without memory&quot; comparison&lt;/td&gt;
&lt;td&gt;Otherwise I am measuring context size, not the mechanism I claim to be testing.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;The Resumption Benchmark spec is published before Experiment 3 runs&lt;/td&gt;
&lt;td&gt;The benchmark definition cannot be moved to fit the result.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cohen&apos;s κ ≥ 0.6 on every LLM-judge scoring component&lt;/td&gt;
&lt;td&gt;Or I do not publish the score.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This is the standard the field&apos;s vendor benchmarks routinely fail to meet. It is not a high standard. It is the minimum that lets a reader trust the result.&lt;/p&gt;
&lt;p&gt;==The next post in this series will be Experiment 1&apos;s pre-registration in action: data first, hypothesis last, result either way.==&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;em&gt;Series: &lt;a href=&quot;https://sharadja.in/blog/system-prompt-architecture-agentic-ai&quot;&gt;The 14K Token Debt&lt;/a&gt; → &lt;a href=&quot;https://sharadja.in/blog/terminal-first-agent-harness-unix-ai-agents&quot;&gt;The Terminal Was the First Agent Harness&lt;/a&gt; → &lt;a href=&quot;https://sharadja.in/blog/self-improving-agent-harnesses-recursive-improvement&quot;&gt;I Built an AI Skill That Started Improving Itself&lt;/a&gt; → &lt;a href=&quot;https://sharadja.in/blog/owned-corpus-memory-benchmark&quot;&gt;91.55% on LongMemEval, and the Benchmark I&apos;m Building Instead&lt;/a&gt; → &lt;a href=&quot;https://sharadja.in/blog/long-horizon-agents-coherence-cliff&quot;&gt;Brilliant but Amnesiac: The Coherence Cliff&lt;/a&gt; → this post. Next: Experiment 1 results.&lt;/em&gt;&lt;/p&gt;
</content:encoded><enclosure url="https://sharadja.in/astral_hacker.png" type="image/png" length="0"/><category>AI</category><category>agents</category><category>long-horizon</category><category>memory</category><category>pre-registration</category><category>experiments</category><category>self-conditioning</category><category>distillation</category><category>REM-sleep</category><category>methodology</category><author>Sharad Jain</author></item><item><title>Brilliant but Amnesiac: The Coherence Cliff in Long-Horizon AI Agents</title><link>https://sharadja.in/blog/long-horizon-agents-coherence-cliff/</link><guid isPermaLink="true">https://sharadja.in/blog/long-horizon-agents-coherence-cliff/</guid><description>Frontier agents can now finish a two-hour task at 50% reliability — but most fail at four-hour ones not because they get dumber, but because they get more committed to their own prior mistakes. The long-horizon problem isn&apos;t a reasoning problem. It&apos;s a memory problem disguised as a context problem. Here&apos;s the math, the failure mode everyone is missing, and the shape of what fixes it.</description><pubDate>Tue, 26 May 2026 18:00:00 GMT</pubDate><content:encoded>&lt;h1&gt;Brilliant but Amnesiac: The Coherence Cliff in Long-Horizon AI Agents&lt;/h1&gt;
&lt;p&gt;I have been building a memory layer for long-horizon AI agents for six months. Not because I think context windows are too small — they are now a million tokens and counting. Because I watched a frontier coding agent confidently overwrite a working test fixture on turn 47 of a session it had been crushing through turns 1 through 46. The model did not get dumber. It got more committed to its own previous mistake.&lt;/p&gt;
&lt;p&gt;That is the failure mode everyone is missing.&lt;/p&gt;
&lt;p&gt;The agent industry is racing toward longer horizons — 24-hour coding agents, week-long research agents, vending-machine simulations that span a simulated year — and the conversation is dominated by two metrics: model capability (does it reason well?) and context window size (does it remember enough?). Both are real. Both are also incomplete. ==The dominant failure mode of long-horizon agents in 2026 is neither — it is coherence collapse, and the math of why it happens is uglier than the model capability curve suggests.==&lt;/p&gt;
&lt;p&gt;This post is the synthesis. The next two posts in this series will be the build: a 7-day Discontinuous Autonomous Lifecycle agent run with Brain as its memory layer, and a replication of the self-conditioning benchmark across Opus, Sonnet, and Haiku with and without external memory. This one lays out the map.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/benchmarking-memory/memory-compounding-hero.png&quot; alt=&quot;Raw sessions compound into structured memory — the thesis visual for what a long-horizon memory layer has to produce&quot; /&gt;&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Math Everyone Skips&lt;/h2&gt;
&lt;p&gt;The single equation that should appear in every long-horizon agent pitch deck — and almost never does — is this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;P(success) = Π pᵢ ≈ pⁿ
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If your agent has a 99% per-step accuracy rate — a number that would crush every static benchmark — and your task requires 100 sequential decisions, your probability of overall success is &lt;strong&gt;36.6%&lt;/strong&gt;. Drop to 95% per-step (still good!) and 50 steps gives you 7.7%. Horizon length is hyperbolically sensitive to per-step reliability:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Per-step accuracy&lt;/th&gt;
&lt;th&gt;10 steps&lt;/th&gt;
&lt;th&gt;50 steps&lt;/th&gt;
&lt;th&gt;100 steps&lt;/th&gt;
&lt;th&gt;500 steps&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;95%&lt;/td&gt;
&lt;td&gt;60%&lt;/td&gt;
&lt;td&gt;7.7%&lt;/td&gt;
&lt;td&gt;0.6%&lt;/td&gt;
&lt;td&gt;~0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;99%&lt;/td&gt;
&lt;td&gt;90%&lt;/td&gt;
&lt;td&gt;60.5%&lt;/td&gt;
&lt;td&gt;36.6%&lt;/td&gt;
&lt;td&gt;0.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;99.5%&lt;/td&gt;
&lt;td&gt;95%&lt;/td&gt;
&lt;td&gt;78%&lt;/td&gt;
&lt;td&gt;60%&lt;/td&gt;
&lt;td&gt;8.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;99.9%&lt;/td&gt;
&lt;td&gt;99%&lt;/td&gt;
&lt;td&gt;95%&lt;/td&gt;
&lt;td&gt;90%&lt;/td&gt;
&lt;td&gt;60.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The gap between &quot;very good&quot; (95%) and &quot;frontier&quot; (99%) is not a 4-point gain. It is the difference between a demo that dies at 15 steps and a product that survives 70. The gap between 99% and 99.9% is the difference between a multi-day agent and an indefinite one.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2503.14499&quot;&gt;METR&apos;s 2026 paper&lt;/a&gt; puts the current frontier (o3-class) at a 50% time horizon of ~110 minutes, with the horizon doubling roughly every 7 months since 2019. Naive extrapolation puts month-long autonomy between mid-2028 and mid-2031. That extrapolation is the optimistic case. It assumes errors are independent. They are not.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Coherence Cliff&lt;/h2&gt;
&lt;p&gt;The independent-step model — &lt;code&gt;P(success) = pⁿ&lt;/code&gt; — undersells the problem in one specific, structural way: in real agent trajectories, &lt;strong&gt;per-step accuracy itself degrades as the task progresses.&lt;/strong&gt; The error rate at step 50 is higher than the error rate at step 5, on the same model, on the same task type.&lt;/p&gt;
&lt;p&gt;The cleanest evidence is Sinha, Arun, and Goel&apos;s 2026 ICLR paper, &lt;a href=&quot;https://arxiv.org/abs/2509.09677&quot;&gt;&lt;em&gt;The Illusion of Diminishing Returns: Measuring Long-Horizon Execution in LLMs&lt;/em&gt;&lt;/a&gt;. They isolated the execution capability of LLMs — explicitly providing the plan and the knowledge — and measured how many sequential steps each model could correctly execute in a single turn. The result that should change how you think about long-horizon agents:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Steps executable in a single turn&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek-V3 (no thinking)&lt;/td&gt;
&lt;td&gt;~4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude-4 Sonnet (thinking)&lt;/td&gt;
&lt;td&gt;432&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5 thinking (&quot;Horizon&quot;)&lt;/td&gt;
&lt;td&gt;2,100+&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The 500× spread between DeepSeek-V3 and GPT-5 Horizon is not a reasoning gap. The paper explicitly removes reasoning from the experiment — every model is handed the plan. The spread is an &lt;strong&gt;execution coherence&lt;/strong&gt; gap. And the structural cause has a name: self-conditioning.&lt;/p&gt;
&lt;p&gt;==Self-conditioning is the dynamic where a model conditions its next-token prediction on its own prior outputs — including its own errors. The longer the trajectory, the more its own past mistakes shape its present predictions. The model literally starts to believe its own lies, because its own lies are now part of the context it is autoregressing over.==&lt;/p&gt;
&lt;p&gt;Three things make self-conditioning particularly nasty:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;It scales poorly with model size.&lt;/strong&gt; Larger non-thinking models are &lt;em&gt;more&lt;/em&gt; susceptible to self-conditioning, not less. Larger models are better at &quot;staying in character&quot; — including the character of a model that just made a mistake. They produce more coherent errors.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Long-context capacity does not fix it.&lt;/strong&gt; A 1M-token context window does not help if 990K of those tokens are the agent&apos;s own error trajectory.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Thinking models partially fix it.&lt;/strong&gt; GPT-5 Horizon and Claude-4 Sonnet&apos;s thinking modes catch errors &lt;em&gt;before&lt;/em&gt; they enter the context window. That is the architectural innovation behind their multi-hundred-step single-turn execution.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Call this &lt;strong&gt;The Coherence Cliff&lt;/strong&gt;: long-horizon failure is not a capability limit on the model — it is a state-maintenance limit on the &lt;em&gt;trajectory&lt;/em&gt;. The reasoning is fine. The plan is fine. What collapses is the model&apos;s ability to maintain a coherent view of its own progress when its context is full of its own ambiguous, lossy, and sometimes wrong prior steps.&lt;/p&gt;
&lt;p&gt;This is the failure mode you feel when an agent loops endlessly on a fixed bug, when it confidently regresses a working feature, when it spends 30 minutes reproducing context it already had. It is not getting dumber. It is becoming increasingly conditioned on its own confused history.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Why Bigger Context Windows Don&apos;t Help&lt;/h2&gt;
&lt;p&gt;The default industry response to &quot;agents forget&quot; has been to scale the context window. Gemini ships 1M tokens. Anthropic ships a 1M-token tier for Claude Opus. The marketing implies that bigger windows are bigger memory.&lt;/p&gt;
&lt;p&gt;They are not. They are bigger working memory at best, and bigger noise floor at worst.&lt;/p&gt;
&lt;p&gt;Three pieces of evidence:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Lost-in-the-middle persists at scale.&lt;/strong&gt; &lt;a href=&quot;https://arxiv.org/abs/2307.03172&quot;&gt;Liu et al.&apos;s analysis of long-context retrieval&lt;/a&gt;, reproduced repeatedly through 2026, shows a U-shaped curve. Information at the start (the system prompt) and at the end (the recent turn) is recalled well. Information in the middle craters. For a long-horizon agent, the middle of the trajectory is exactly where the load-bearing intermediate state lives — the variable you decided about at step 20, the constraint the user mentioned at step 50, the dead-end branch you committed not to retry at step 80. The context contains them. The attention does not find them.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/attention-valley-degradation-curve.png&quot; alt=&quot;The Attention Valley: a U-shaped recall curve where the system prompt peaks at the start and the current query peaks at the end — and every load-bearing intermediate step decays into the trough in the middle&quot; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;AgentFold beats 671B models with 7k tokens after 100 turns.&lt;/strong&gt; &lt;a href=&quot;https://arxiv.org/abs/2510.24699&quot;&gt;AgentFold (Tongyi Lab, Alibaba, 2025)&lt;/a&gt; trained a 30B agent on proactive context folding — multi-scale consolidation of past turns into compact state summaries. After 100 turns of interaction, AgentFold&apos;s context is &lt;strong&gt;7,000 tokens&lt;/strong&gt; total. The full ReAct trace for the same 100 turns is ~50× larger. AgentFold-30B-A3B achieves 36.2% on BrowseComp, beating both DeepSeek-V3.1-671B-A37B and OpenAI&apos;s o4-mini, while scaling to 500 turns without saturation. ==The lesson is not that bigger contexts are bad. The lesson is that the relevant axis is not capacity. It is consolidation.==&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Vending-Bench 2 measures coherence, not capability.&lt;/strong&gt; Anthropic&apos;s Vending-Bench 2 simulates an agent running a vending-machine business across a simulated year — roughly 20 million tokens of interaction history. The leaderboard:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Mean net worth (simulated year)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 3 Pro&lt;/td&gt;
&lt;td&gt;$5,478&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Opus 4.5&lt;/td&gt;
&lt;td&gt;$4,967&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.1&lt;/td&gt;
&lt;td&gt;$1,473&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Gemini 3 did not win because it reasons better — Opus 4.5 leads SWE-bench Verified at 80.9%. It won because it maintained coherence across 20M tokens of operating history without drifting. The losers did not forget; they remembered too much, and the noise drowned the signal.&lt;/p&gt;
&lt;p&gt;This is &lt;strong&gt;The Amnesia Tax&lt;/strong&gt;: every long-horizon agent pays it, and the bill is not denominated in tokens. It is denominated in coherence — in how much of the recent trajectory the agent treats as load-bearing versus disposable, and whether it can tell the difference.&lt;/p&gt;
&lt;p&gt;Scaling context windows does not lower the tax. It gives the agent a larger surface area to be incoherent across.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;What Persistent Memory Actually Has to Do&lt;/h2&gt;
&lt;p&gt;If context windows are not memory, then what is? The shortcut answer — &quot;RAG over conversation history&quot; — turns out to be wrong in a specific, useful way.&lt;/p&gt;
&lt;p&gt;A working memory layer for long-horizon agents has to do three jobs, only one of which is well-served by current retrieval-style memory systems.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Job&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;th&gt;State of the art&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1. Find&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Retrieve what is relevant to the current moment&lt;/td&gt;
&lt;td&gt;Solved enough. BM25 + dense retrieval + hybrid reranking; Mem0, Zep, Letta, Supermemory, Brain all ship this.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2. Forget&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Know when stored information has been superseded&lt;/td&gt;
&lt;td&gt;Partially solved in research (AgentFold). No off-the-shelf product currently does this well.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3. Resume&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pick up where a previous trajectory left off without rehashing&lt;/td&gt;
&lt;td&gt;Unsolved. No public benchmark, no off-the-shelf substrate. Every team builds this in-house, badly.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Find&lt;/strong&gt; is the easy one. Vector search, BM25, knowledge graphs — mature techniques, crowded market. I &lt;a href=&quot;https://sharadja.in/blog/owned-corpus-memory-benchmark&quot;&gt;published Brain&apos;s LongMemEval result&lt;/a&gt; at 91.55% recall@5 over 470 questions using nothing but stdlib BM25. Retrieval is the table stakes, not the differentiator.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Forget&lt;/strong&gt; is harder. Real memory has to know when stored information has been superseded — when the user changed their mind, when the deployment moved, when last week&apos;s decision was reversed in yesterday&apos;s review. Append-only memory is just a larger context window dressed up as a database. The literature is starting to converge on the term: &lt;a href=&quot;https://arxiv.org/abs/2511.20857&quot;&gt;Evo-Memory (DeepMind, 2025)&lt;/a&gt; shows that most &quot;memory&quot; systems passively retrieve from dialogue history and do not abstract patterns from it. They remember conversations. They do not learn from them. AgentFold attacks consolidation as a learned operation. The space is wide open.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/benchmarking-memory/distillation-noise-signal.png&quot; alt=&quot;Consolidation as compression: a 2000-line raw transcript compressed into a high-signal artifact of Goal / Decisions / Rejected — the shape Job 2 has to produce, and the shape append-only memory never does&quot; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Resume&lt;/strong&gt; is the one nobody has named, much less shipped. It is the answer to a different question than &quot;what did I do last Tuesday?&quot; It is the answer to &quot;put me back into motion on the thing I was doing last Tuesday — knowing which branches were already pruned, which approaches were tried-and-rejected versus tried-and-deferred, which decisions are still load-bearing — without me reconstructing context for the next agent.&quot;&lt;/p&gt;
&lt;p&gt;Call this &lt;strong&gt;The Resumption Gap&lt;/strong&gt;. When I ask Brain &quot;what did we decide about the schema?&quot; I get the answer in milliseconds. When I want an agent to &lt;em&gt;pick up where the last agent left off&lt;/em&gt;, there is no off-the-shelf substrate. Retrieval ritual is solved. Resumption is not.&lt;/p&gt;
&lt;p&gt;I have been building Brain partly around Jobs 1 and 2 — partly because I have been guessing at what Job 3 actually wants to look like. The next post will be the experiment.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;What Long-Horizon Architecture Actually Looks Like&lt;/h2&gt;
&lt;p&gt;If the failure mode is coherence and the missing primitive is consolidating, resumable memory, the architecture follows. Three pieces, all of which already exist in research; none of which compose into a production stack.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;1. Thinking models for the per-step floor.&lt;/strong&gt; Sinha et al.&apos;s self-conditioning result is a buy signal for thinking models in any long-horizon role. The 500× execution-coherence gap between a model with thinking and a model without is not the kind of variable you negotiate. The per-step floor has to be high or none of the rest matters.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. Proactive context folding for working memory.&lt;/strong&gt; AgentFold&apos;s pattern — multi-scale consolidation of past turns into compact state summaries, learned not heuristic — is the closest thing to a production-ready answer for in-trajectory memory. Append-only ReAct traces become obsolete the moment any task crosses 50 turns. ==The end-state for long-horizon agent architecture has folding as a primitive, the way ReAct has tool-calling as a primitive.==&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. Discontinuous lifecycles for between-trajectory memory.&lt;/strong&gt; Here is the piece that excites me most, because it inverts the default. A truly long-horizon agent does not run continuously. It serializes its state, schedules its own resurrection via cron, and terminates the process. The next wake-up rehydrates context from disk, executes, and goes back to sleep. The pattern has a name in the literature — the &lt;strong&gt;Discontinuous Autonomous Lifecycle (DAL)&lt;/strong&gt; — and the Claude Agent SDK already ships the primitives for it:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;from claude_agent_sdk import ClaudeAgentOptions, ClaudeSDKClient

# Wake phase: resume the prior session by ID
options = ClaudeAgentOptions(
    resume=&quot;session_id_from_previous_run&quot;,
    continue_conversation=True,
    permission_mode=&quot;acceptEdits&quot;
)

async with ClaudeSDKClient(options=options) as agent:
    await agent.query(&quot;Continue from where we left off.&quot;)
    # ...do bounded work...
    # Before exit: write a wake-up entry to crontab pointing at this session_id.
    # Then exit. The process terminates. The state survives on disk.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The economic case alone makes DAL inevitable: compute cost ∝ time-active, not wall-clock. A continuously-running agent waiting on a 4-hour condition burns a host the entire time. A discontinuous agent serializes, dies, and pays nothing until cron wakes it. For any workflow with a duty cycle below ~30%, DAL is the only path to a viable production economic profile.&lt;/p&gt;
&lt;p&gt;The bio-inspired version is too obvious to ignore. Mammals consolidate memory through sleep — the hippocampus replays the day&apos;s experience into the cortex, distilling episode into procedure, raw into wiki. The agent equivalent is not poetry; it is the same architectural move. Run the agent. Serialize state. Terminate. While dormant, run a consolidation pass that distills the recent trajectory into compact memory artifacts. Wake the agent into a context that contains the &lt;em&gt;consolidated&lt;/em&gt; history, not the raw one. This is the REM-sleep architecture, and it is what the third post in this series will build.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Where This Breaks&lt;/h2&gt;
&lt;p&gt;I owe the reader the same honesty about this thesis that I owed about Brain&apos;s LongMemEval result.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Limitation&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;th&gt;What I would want before claiming more&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;The Coherence Cliff is not a unified theory&lt;/td&gt;
&lt;td&gt;Self-conditioning is one driver; lost-in-the-middle is another; planning-failure cascades are a third; they interact in ways the literature has not disentangled&lt;/td&gt;
&lt;td&gt;Replicate Sinha et al.&apos;s execution benchmark with and without external memory; isolate which failure mode external memory addresses&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AgentFold&apos;s 7k-after-100-turns number is from in-domain SFT, not general&lt;/td&gt;
&lt;td&gt;The agent was trained on folding trajectories. A frontier model without that training will not fold this well from prompting alone&lt;/td&gt;
&lt;td&gt;Treat folding as a fine-tuning target or a tool, not a prompt pattern&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;The Resumption Gap is a name for an absence&lt;/td&gt;
&lt;td&gt;I cannot point to a benchmark that measures task-resumption quality. Naming a gap is not closing it&lt;/td&gt;
&lt;td&gt;Build the resumption benchmark before claiming a product solves it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Thinking-model latency makes high-iteration loops expensive&lt;/td&gt;
&lt;td&gt;A model that takes 30 seconds of thinking per step is a non-starter for many real workflows&lt;/td&gt;
&lt;td&gt;Reserve thinking-budget for steps that change state; use cheap models for read-only retrieval&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cron-driven discontinuous agents add operational surface area&lt;/td&gt;
&lt;td&gt;The state file is now a critical artifact; the scheduler is now a critical dependency; the wake-up handler must be idempotent&lt;/td&gt;
&lt;td&gt;Treat the wake/sleep cycle as a distributed-systems problem, with the rigor of a job queue&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Most &quot;memory&quot; benchmarks measure retrieval, not consolidation or resumption&lt;/td&gt;
&lt;td&gt;Optimizing for LongMemEval-style benchmarks may not move the needle on long-horizon coherence&lt;/td&gt;
&lt;td&gt;Build owned-corpus and trajectory-resumption benchmarks alongside any retrieval benchmark; see the &lt;a href=&quot;https://sharadja.in/blog/owned-corpus-memory-benchmark&quot;&gt;previous post in this series&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This is a methodology argument backed by a literature synthesis. It is not a finished system. The work is the next post.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;What I Am Building Next&lt;/h2&gt;
&lt;p&gt;Three concrete experiments will follow this post. I am pre-committing to all three in writing so that the failure modes are public, not silent.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Experiment 1 — The Self-Conditioning Replication.&lt;/strong&gt; I will replicate Sinha et al.&apos;s execution benchmark across the three Claude tiers — Opus 4.7, Sonnet 4.6, Haiku 4.5 — with and without Brain MCP as an external memory tool. The hypothesis I want to break is the assumption that external memory primarily helps with retrieval. The hypothesis I want to confirm is that external memory &lt;em&gt;also&lt;/em&gt; helps with self-conditioning, by pulling load-bearing context out of the autoregressive trace and into a tool call that does not poison the next prediction.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Experiment 2 — The 7-Day DAL Agent.&lt;/strong&gt; A long-horizon coding agent running across one calendar week, terminating after each work block and resuming via cron, with Brain as its memory substrate. The trace, the cost curve, and the failure log will be public. The control: a continuously-running variant of the same agent. The dependent variables: total cost, task completion rate, recovery time after a failure.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Experiment 3 — REM-Sleep Cron.&lt;/strong&gt; The dream-cycle architecture. While the DAL agent is dormant, a consolidation pass runs against its recent trajectory, distilling episode into procedure. The waking agent inherits the consolidated history, not the raw one. I have hand-waved this for months. The next post will be either the working version or an honest writeup of why it does not work.&lt;/p&gt;
&lt;p&gt;Each experiment becomes its own post in this series. Each post will include the harness code, the result, and the limitations. ==The blog is the experiment.==&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;What I Am Actually Saying&lt;/h2&gt;
&lt;p&gt;The argument is narrower than &quot;context windows are useless.&quot; Bigger context is a real capability lift for single-turn work, multi-document synthesis, and long-context reasoning tasks. None of those are long-horizon agentic tasks.&lt;/p&gt;
&lt;p&gt;The argument is sharper than that:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Long-horizon agents fail at coherence, not at capability.&lt;/strong&gt; The failure mode is self-conditioning compounded by lost-in-the-middle compounded by error cascades — and the architectural fix is not a bigger context window. It is a persistent, consolidating, resumable memory substrate that lives &lt;em&gt;outside&lt;/em&gt; the model&apos;s autoregressive trace. Today&apos;s &quot;memory layers&quot; solve one-third of that — retrieval. Consolidation is partially solved in research but not in product. Resumption is unsolved.&lt;/p&gt;
&lt;p&gt;The next agent breakthrough I am betting on is not a bigger model. It is the first production system that treats the trajectory itself as a managed artifact — folded, consolidated, resumable across discontinuous waking and dormancy — and the model as a renewable resource that operates &lt;em&gt;on top of&lt;/em&gt; the trajectory, not &lt;em&gt;as&lt;/em&gt; the trajectory.&lt;/p&gt;
&lt;p&gt;If that is right, the people who win this part of the agent era will be the ones who took the systems-engineering problem seriously while everyone else was waiting for the next model release.&lt;/p&gt;
&lt;p&gt;If you are building in this space — owned-corpus memory, trajectory consolidation, discontinuous-lifecycle agents, anything that names what current memory products do not solve — I want to read your benchmark before I read your leaderboard.&lt;/p&gt;
&lt;p&gt;The build starts in the next post.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;em&gt;This is post #5 in a series on agentic AI memory infrastructure. Earlier posts: &lt;a href=&quot;https://sharadja.in/blog/system-prompt-architecture-agentic-ai&quot;&gt;The 14K Token Debt&lt;/a&gt;, &lt;a href=&quot;https://sharadja.in/blog/terminal-first-agent-harness-unix-ai-agents&quot;&gt;The Terminal Was the First Agent Harness&lt;/a&gt;, &lt;a href=&quot;https://sharadja.in/blog/self-improving-agent-harnesses-recursive-improvement&quot;&gt;I Built an AI Skill That Started Improving Itself&lt;/a&gt;, and &lt;a href=&quot;https://sharadja.in/blog/owned-corpus-memory-benchmark&quot;&gt;91.55% on LongMemEval, and the Benchmark I&apos;m Building Instead&lt;/a&gt;. The next post will be Experiment 1 — the self-conditioning replication.&lt;/em&gt;&lt;/p&gt;
</content:encoded><enclosure url="https://sharadja.in/astral_hacker.png" type="image/png" length="0"/><category>AI</category><category>agents</category><category>long-horizon</category><category>memory</category><category>context-windows</category><category>self-conditioning</category><category>METR</category><category>AgentFold</category><category>reliability</category><category>architecture</category><author>Sharad Jain</author></item><item><title>Google 2026–2056: A 30-Year Game-Theory Forecast</title><link>https://sharadja.in/blog/google-2026-2056-forecast/</link><guid isPermaLink="true">https://sharadja.in/blog/google-2026-2056-forecast/</guid><description>A Bayesian-game model of Google through 2056, run as 20,000 Monte Carlo paths. Bimodal distribution. Six named futures. One headline mechanic — self-knowledge, not capability. Includes the adversarial code review that broke v1.0.</description><pubDate>Mon, 18 May 2026 10:00:00 GMT</pubDate><content:encoded>
&lt;h1&gt;Google 2026–2056: A 30-Year Game-Theory Forecast&lt;/h1&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Premise.&lt;/strong&gt; Two months ago, I ran the substrate. Phase 0 to Phase 10. A 30-year Bayesian game model of Alphabet, with Google playing against a single representative adversary (&quot;Anti-Google&quot; — open-source models, Apple-Anthropic, Microsoft-OpenAI, collapsed for tractability). The shock: transformative AI in an unknown year, median 2033. The runbook: research brief (227 sources), 33-analyst pool, formal spec, payoff matrices, 20,000-run Monte Carlo across four strategies. Then a polished HTML dashboard. Then a GPT-5.5 adversarial code review that broke v1.0 and produced v1.0.1.&lt;/p&gt;
&lt;p&gt;This post is the v1.0.1 dashboard rebuilt as a narrative forecast — including the math errors that GPT-5.5 caught and the structural critiques that v1.1 will still owe.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr /&gt;
&lt;h2&gt;Late 2025 → 2056, in one paragraph&lt;/h2&gt;
&lt;p&gt;Late 2025: Sundar refuses to choose between defending Search and replacing it. AI Overviews ship across 100+ countries, monetized at 25.5% of impressions. The Mehta ruling rejects Chrome divestiture on Sep 2; behavioral remedies bind. &lt;strong&gt;January 2026:&lt;/strong&gt; Apple ships Gemini-on-Siri at $1B/yr — a defensive masterstroke, or a Trojan horse that teaches Apple to ship its own answer engine by 2030. &lt;strong&gt;Q1 2026:&lt;/strong&gt; Cloud +63%/yr, backlog jumps $240B → $462B in one quarter, of which Anthropic alone is &amp;gt;40% — a $200B / 5GW / 5-year TPU pact. &lt;strong&gt;2027:&lt;/strong&gt; Anthropic IPOs on TPU. &lt;strong&gt;2031–2033:&lt;/strong&gt; Cloud overtakes Search as Alphabet&apos;s #1 revenue line in 45% of paths. &lt;strong&gt;2033:&lt;/strong&gt; transformative AI fires (median). &lt;strong&gt;2036:&lt;/strong&gt; P(Search ad operating income compresses ≥40%) = 0.32. &lt;strong&gt;2056:&lt;/strong&gt; mean 2056 cap is 1.58× 2026, but the distribution is bimodal — Google ends in one of two regimes, with the trough between them rarely populated.&lt;/p&gt;
&lt;p&gt;That&apos;s the model. Now the charts.&lt;/p&gt;
&lt;div&gt;
&lt;div&gt;2056 NORMALIZED MARKET CAP · 20,000 MONTE CARLO RUNS · TYPE-CONDITIONAL STRATEGY&lt;/div&gt;
&lt;div&gt;The distribution is bimodal — Slow Decline cluster vs Re-Platformed cluster.&lt;/div&gt;














MEDIAN 1.85×


MEAN 1.58×


P100.62

P251.08

P501.85

P752.10

P902.24

BEAR · SLOW DECLINE 31%
NEUTRAL
BULL · RE-PLAT + TWO-TRACK 51%


&lt;div&gt;Density approximated from {P10 0.62, P25 1.08, P50 1.85, P75 2.10, P90 2.24}, mean 1.58. Modal bear is Slow Decline (~1.1×). Modal bull is Re-Platformed (~2.1×). The trough between them is the visual headline of this entire model.&lt;/div&gt;
&lt;/div&gt;
&lt;h2&gt;The setup&lt;/h2&gt;
&lt;p&gt;Two strategic players. &lt;strong&gt;Google.&lt;/strong&gt; &lt;strong&gt;&quot;Anti-Google&quot; X&lt;/strong&gt;, a collapsed adversary representing the three asymmetric forces pressing on Search rents: (1) open-source models commoditizing the model layer, (2) Apple-Anthropic coalition controlling distribution on premium hardware, (3) Microsoft-OpenAI coalition owning enterprise and consumer agentic AI.&lt;/p&gt;
&lt;p&gt;Each player has private hidden types — capability (trailing / parity / leading), patience, integration philosophy. Google additionally carries a binding &lt;strong&gt;information asymmetry&lt;/strong&gt;: it does not know whether its own AI Overviews are &lt;em&gt;defensive&lt;/em&gt; (preserve Search RPM), &lt;em&gt;neutral&lt;/em&gt;, or &lt;em&gt;cannibalistic&lt;/em&gt; (own product teaches users to leave). It can only learn through telemetry, accumulating noisy signal across rounds.&lt;/p&gt;
&lt;p&gt;The shock: &lt;strong&gt;T_AGI&lt;/strong&gt;, transformative AI, drawn once at game start, observed by both players only when it fires. Shifted lognormal, median 2033.&lt;/p&gt;
&lt;div&gt;
&lt;div&gt;T_AGI PRIOR · SHIFTED LOGNORMAL · ANCHORED TO BRIEF §5&lt;/div&gt;
&lt;div&gt;50% of probability mass sits between 2031 and 2036 — the strategic-decision window.&lt;/div&gt;















MEDIAN · 2033
P(≤2030) = 20%
left tail fattened by Pichai + Hassabis 2026
P(2031–2036) ≈ 50%
the strategic-decision window
P(&amp;gt;2045) = 10%
Epoch multi-decade-skeptic tail


2026
2030
2033
2036
2040
2045
2050
2056


&lt;div&gt;Shifted lognormal centered at 2033. Anchored to Hassabis + Altman + Amodei + Pichai 2026 public statements. &lt;em&gt;Note: v1.0 had σ=0.40 which doesn&apos;t actually produce P(≤2030)=0.20 — codex caught this and v1.0.1 widened to σ=0.60. The chart above is drawn to the stated CDF, not the v1.0 code.&lt;/em&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;h2&gt;Strategy: the surprising result&lt;/h2&gt;
&lt;p&gt;Four candidate strategies were simulated 20,000 times each: &lt;strong&gt;all-partner&lt;/strong&gt; (the substrate-bet — vertical-integrate, sell TPU rents to all comers including Anthropic), &lt;strong&gt;minimax-regret&lt;/strong&gt;, &lt;strong&gt;type-conditional&lt;/strong&gt; (act differently depending on what type Google believes it is), and &lt;strong&gt;all-build&lt;/strong&gt; (consumer Gemini only — no infrastructure, no compute pact, defend the SERP).&lt;/p&gt;
&lt;p&gt;Three of four strategies cluster within 3% of each other on expected 2056 cap. The fourth is a cliff.&lt;/p&gt;
&lt;div&gt;
&lt;div&gt;STRATEGY COMPARISON · 2056 CAP DISTRIBUTION · 20K RUNS EACH&lt;/div&gt;
&lt;div&gt;All-build (consumer Gemini only) is the only strategy that strictly dominates negative.&lt;/div&gt;






all-partner
G3 · BEST MEAN

0.62
2.24

1.62×
minimax-regret
−2.2% VS BEST

0.56
2.27

1.59×
type-conditional
−2.5% VS BEST

0.62
2.24

1.58×
all-build
G2 · −21.4% WORST

0.57
1.89

1.27×


0.4×0.9×1.4×1.9×2.4×

P10 ←——— P90 · ● MEAN · 2056 NORMALIZED CAP
MEAN

&lt;div&gt;All-build&apos;s 21% gap is the model&apos;s first crisp finding — running consumer Gemini without parallel TPU/Cloud investment sacrifices a fifth of expected cap. &lt;strong&gt;Anthropic-on-TPU was the rational move.&lt;/strong&gt; The fact that minimax-regret and type-conditional sit within 3% of all-partner means most of the value comes from the substrate bet itself, not from cleverness about when to deploy it.&lt;/div&gt;
&lt;/div&gt;
&lt;h2&gt;Year by year&lt;/h2&gt;
&lt;div&gt;&lt;div&gt;Q2 2026&lt;/div&gt;&lt;div&gt;&lt;strong&gt;Google I/O.&lt;/strong&gt; Gemini 4 + Workspace agentic. P(Google capability ≥ frontier) updates ±10pp depending on harness execution shown. The model treats this as the first major belief-update event.&lt;/div&gt;&lt;/div&gt;
&lt;div&gt;&lt;div&gt;Q3 2026&lt;/div&gt;&lt;div&gt;&lt;strong&gt;DOJ Search remedy phase resolution.&lt;/strong&gt; Chrome divestiture stayed at P=0.12; behavioral remedies bind either way. Mehta&apos;s September 2025 ruling already collapsed the worst tail; the EU DMA + €2.95B AdTech fine + Android-default-AI opening keeps the regulatory super-game alive.&lt;/div&gt;&lt;/div&gt;
&lt;div&gt;&lt;div&gt;Q4 2026&lt;/div&gt;&lt;div&gt;&lt;strong&gt;First Search-RPM by AI-Overviews cohort.&lt;/strong&gt; Internal A/B disclosure (voluntary or compelled). The single largest update on θ_ovw — Google learns whether its own product is defensive or cannibalistic. Largest information event in the entire 30-year game.&lt;/div&gt;&lt;/div&gt;
&lt;div&gt;&lt;div&gt;2027&lt;/div&gt;&lt;div&gt;&lt;strong&gt;Anthropic IPO / valuation mark.&lt;/strong&gt; If Anthropic ships at scale on TPU and hits public-market valuation, the substrate-bet thesis is publicly vindicated. A competitor&apos;s procurement choice doing more lifting than any Google statement.&lt;/div&gt;&lt;/div&gt;
&lt;div&gt;&lt;div&gt;2028&lt;/div&gt;&lt;div&gt;&lt;strong&gt;Apple-Gemini renewal decision.&lt;/strong&gt; Brief prior: P(Apple drops Gemini by 2030) = 0.55. The deal is a rolling annual auction; the model&apos;s &quot;Jan 2026 announcement&quot; is not a 5-year lock. Apple&apos;s own answer-engine readiness shifts the term sheet every year.&lt;/div&gt;&lt;/div&gt;
&lt;div&gt;&lt;div&gt;2030&lt;/div&gt;&lt;div&gt;&lt;strong&gt;P(agentic AI serves ≥30% of consumer search-intent queries) ≈ 0.65.&lt;/strong&gt; The substitution slope crosses out of the noise. If θ_ovw resolved cannibalistic by now, the Nokia path is fully open. If it resolved defensive, Two-Track Victory is on the table.&lt;/div&gt;&lt;/div&gt;
&lt;div&gt;&lt;div&gt;2031–2033&lt;/div&gt;&lt;div&gt;&lt;strong&gt;Cloud overtakes Search as #1 revenue line.&lt;/strong&gt; Q1 2026 Cloud was $20B/qtr vs Services $89.6B; Cloud +63%/yr vs Services +16%/yr. Crossover lands in this window in 45% of paths. By 2056, Cloud &amp;gt; Search in 56% of paths.&lt;/div&gt;&lt;/div&gt;
&lt;div&gt;&lt;div&gt;2033&lt;/div&gt;&lt;div&gt;&lt;strong&gt;T_AGI fires (median).&lt;/strong&gt; Strategic landscape resets. Pre-AGI moats partially decay; post-AGI returns depend almost entirely on who controls substrate (compute + distribution + agentic OS). G5 — Status Quo — becomes catastrophic in 4 of 5 cells post-AGI. Doing nothing in 2033 is the dominant path to Nokia-Google.&lt;/div&gt;&lt;/div&gt;
&lt;div&gt;&lt;div&gt;2036&lt;/div&gt;&lt;div&gt;&lt;strong&gt;The first hard verdict.&lt;/strong&gt; P(Search OI compresses ≥40%) = 0.32. P(Cloud &amp;gt; Search) = 0.45. P(Waymo &amp;gt; $50B revenue) = 0.18. P(forced Chrome divestiture) = 0.12. P(Nokia regime fires) = 0.18 in the broad-tail definition; 1.7% in the tight Nokia-only basin.&lt;/div&gt;&lt;/div&gt;
&lt;div&gt;&lt;div&gt;2045&lt;/div&gt;&lt;div&gt;&lt;strong&gt;Terminal-value sensitivity dominates.&lt;/strong&gt; The right tail of T_AGI thins past 2045 to roughly 10% — Epoch&apos;s multi-decade-skeptic tail. By now Google is either Re-Platformed (Cloud + YouTube + Waymo + a shrunk-but-defended Search at 70–80% of 2025 RPM) or it is the Yahoo of agentic AI.&lt;/div&gt;&lt;/div&gt;
&lt;div&gt;&lt;div&gt;2056&lt;/div&gt;&lt;div&gt;&lt;strong&gt;End state.&lt;/strong&gt; Bimodal. Modal bear: &lt;em&gt;Slow Decline&lt;/em&gt; (~1.1×), 31% of paths. Modal bull: &lt;em&gt;Re-Platformed&lt;/em&gt; (~2.1×), 29% of paths. Bull aggregate 55%, bear aggregate 43%, catastrophic tail 1.7%. The trough between Slow Decline and Re-Platformed is rarely populated — Google ends in one of two regimes, almost never in the middle.&lt;/div&gt;&lt;/div&gt;
&lt;h2&gt;The six named futures&lt;/h2&gt;
&lt;div&gt;
&lt;div&gt;SCENARIO DISTRIBUTION · 20,000 RUNS · TYPE-CONDITIONAL STRATEGY&lt;/div&gt;
&lt;div&gt;Bull aggregate 55% · Bear aggregate 43% · Catastrophic tail 1.7%.&lt;/div&gt;
























31%29%22%7%


Slow Decline
Re-Platformed
Two-Track
Search Coll.


mean 1.08×
mean 2.10×
mean 2.02×
mean 0.54×


← BEAR · 43%
BULL · 55% →
CATASTROPHIC ↑



&lt;div&gt;The smaller pills on the right are the catastrophic + edge scenarios: &lt;strong&gt;Nokia-Google fires in 1.7%&lt;/strong&gt; of paths (red), &lt;strong&gt;Regulatory Dismemberment 1.5%&lt;/strong&gt;, &lt;strong&gt;Yahoo-Google 2.0%&lt;/strong&gt;, &lt;strong&gt;Bull-Waymo 0.1%&lt;/strong&gt;, &lt;strong&gt;Compounding Giant 0.4%&lt;/strong&gt;. The model&apos;s surprise: the broad-bear modal is Slow Decline (Yahoo-shaped), not Nokia.&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Nokia-Google · 1.7%.&lt;/strong&gt; θ_ovw resolves cannibalistic + G5 or sustained G1 + AGI fires by 2032. AI Overviews accelerate Search-rent collapse from inside Google&apos;s own funnel; Cloud growth fails to offset; YouTube RPM depresses under ambient AI saturation. Terminal cap 0.30–0.65.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Slow Decline · 30.8% — the modal outcome.&lt;/strong&gt; Search compresses gradually over 15 years; AI Overviews defend partial RPM; Cloud grows but not fast enough; YouTube holds. Yahoo-shaped stagnation. Terminal cap 0.70–1.20. &lt;strong&gt;The most likely future Google.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Re-Platformed Google · 29.1% — the bull modal.&lt;/strong&gt; G4 × X4 holds. Cloud overtakes Search 2031–2033. Anthropic-on-TPU multiplies across other Labs. Substrate bet vindicates. Terminal cap 1.5–2.8. The 2056 Google looks more like AWS-of-the-AI-era than Search-as-we-know-it: ~40% revenue from Cloud + TPU rents, ~25% from YouTube, ~20% from a shrunk-but-defended Search, ~10% Waymo + Other Bets, ~5% from agentic-AI distribution surfaces.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Two-Track Victory · 21.5%.&lt;/strong&gt; Both Search (defended) and Cloud (grown) material at 2056. Cloud terminal share 30–55%. Terminal cap 1.4–2.4.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Agent-OS Google · 3.6%.&lt;/strong&gt; Google owns the dominant agentic surface — Gemini consumer app or Workspace agents become the default query interface. Requires cumulative G2 + G3 investment &amp;gt;6% of cap. Terminal cap 2.0–2.5.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Regulatory Dismemberment · 1.5%.&lt;/strong&gt; Chrome divestiture (P=0.05 per jurisdiction-round, P=0.12 by 2036) or multiple stacked structural remedies. Terminal cap 0.45–0.60. Less probable than Apple-instance equivalent because the Mehta ruling collapsed the worst tail.&lt;/p&gt;
&lt;h2&gt;The headline mechanic: self-knowledge, not capability&lt;/h2&gt;
&lt;p&gt;Capability type — trailing vs parity vs leading — creates only &lt;strong&gt;1% spread&lt;/strong&gt; in terminal cap across the simulation. Gemini being a benchmark leader by 2030 barely changes the answer. AI-Overviews-cannibalization type creates &lt;strong&gt;8.8% mean spread&lt;/strong&gt; and dominates the catastrophic tail.&lt;/p&gt;
&lt;div&gt;
&lt;div&gt;2056 CAP BY AI-OVERVIEWS TYPE · TYPE-CONDITIONAL STRATEGY · 20K RUNS&lt;/div&gt;
&lt;div&gt;The cannibalistic-type&apos;s tail is 39% lower at P10 — self-knowledge buys this tail.&lt;/div&gt;







TYPEPRIOR nP10MEANP90

defensive
40% PRIOR
AI-Overviews holds ≥85% Search RPM
7,998

0.90

1.63×
2.31
neutral
45% PRIOR
Partial RPM defense; modest compression
8,954

0.87

1.56×
2.20
cannibalistic
15% PRIOR · NOKIA PATH
Own product teaches users to defect
3,048

0.55

1.50×
2.15

↓ P10 −39%


0.4×0.9×1.4×1.9×2.4×

Mean spread across types: 8.8% · P10 spread: 39% · the headline self-knowledge cut

&lt;div&gt;Type-conditional strategy partially routes around the cannibalistic regime — mean only 8% below defensive — &lt;strong&gt;but the tail still bites&lt;/strong&gt;. The cannibalistic-type&apos;s P10 is 0.55 vs defensive 0.90. The information value of resolving θ_ovw at round 2 instead of round 6 is approximately $230B. &lt;em&gt;v1.0 claimed this; v1.0.1 acknowledges this number is asserted, not computed by code — see the codex sidebar below.&lt;/em&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;The 2026 corporate function with the most leverage at Google is not the consumer Gemini team.&lt;/strong&gt; It is whichever group inside the company runs Search-RPM-by-AI-Overviews-cohort A/B tests, AI-Mode monetization measurement, and the internal telemetry pipeline that resolves θ_ovw faster.&lt;/p&gt;
&lt;p&gt;Most analyst commentary asks: &lt;em&gt;is Gemini good enough to compete?&lt;/em&gt; The model says this is the wrong question.&lt;/p&gt;
&lt;p&gt;The right question: &lt;em&gt;does Google know whether its own AI Overviews are cannibalistic, and how fast can it find out?&lt;/em&gt;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;
&lt;div&gt;★ Sidebar artifact — the view from the main thread&lt;/div&gt;
After writing the v1.0 briefing, I screen-recorded a scroll through &lt;a href=&quot;https://ai-2027.com&quot;&gt;ai-2027.com&lt;/a&gt; and fed the recording to Gemini with one prompt: &lt;em&gt;&quot;Can you make a front-end engineer commentary on this blog, speaking like Jeff Dean?&quot;&lt;/em&gt; The output is reproduced below — not because it changes the forecast, but because &lt;strong&gt;an AI forecasting model is only as fast as the rendering pipeline that delivers it&lt;/strong&gt;, and the artifact captures that bidirectionality well.
&lt;br /&gt;&lt;br /&gt;
&lt;strong&gt;1. Scroll-linked layout.&lt;/strong&gt; The two-column scroll-tracker design needs an &lt;code&gt;IntersectionObserver&lt;/code&gt; with an explicit threshold array, scroll delta normalized to [0, 1] and piped to a single &lt;code&gt;requestAnimationFrame&lt;/code&gt; loop. A naive scroll listener with &lt;code&gt;getBoundingClientRect()&lt;/code&gt; calls causes layout thrashing — Style → Layout → Paint, sixty times a second, for nothing.
&lt;br /&gt;&lt;br /&gt;
&lt;strong&gt;2. The compute matrix.&lt;/strong&gt; If each of the FLOP-allocation cells is an SVG element or styled &lt;code&gt;&amp;lt;div&amp;gt;&lt;/code&gt;, the DOM tree balloons. Pack the cell state into a &lt;code&gt;Uint8Array&lt;/code&gt;, push to GPU as a single texture, render the matrix transition with zero main-thread overhead. &lt;em&gt;If you can&apos;t justify each DOM node carrying its own accessibility tree entry, it shouldn&apos;t be a DOM node.&lt;/em&gt;
&lt;br /&gt;&lt;br /&gt;
&lt;strong&gt;3. Chart lifecycle.&lt;/strong&gt; Don&apos;t unmount and remount charts on section change. Persistent instances; pass new data vectors; interpolate paths. &lt;em&gt;Don&apos;t allocate new memory if you can reuse the old buffers.&lt;/em&gt;
&lt;br /&gt;&lt;br /&gt;
&lt;strong&gt;4. Typography and CLS.&lt;/strong&gt; Self-host the subset. Preload critical glyphs. Match fallback metrics with &lt;code&gt;size-adjust&lt;/code&gt;, &lt;code&gt;ascent-override&lt;/code&gt;, &lt;code&gt;descent-override&lt;/code&gt;. CLS to zero is not a vanity metric — it&apos;s the difference between a reader trusting the page and a reader subconsciously deciding the site is broken.
&lt;br /&gt;&lt;br /&gt;
The charts you&apos;ve been scrolling past in this post — every SVG is hand-coded, transparency-first, no external chart library, no per-cell DOM node. The rendering layer matters.
&lt;/div&gt;
&lt;hr /&gt;
&lt;h2&gt;What this model is wrong about&lt;/h2&gt;
&lt;p&gt;Two months after I shipped v1.0, I piped the entire corpus — briefing + analyst pool + 5-file model spec + 700-line simulation code, 310KB total — through GPT-5.5 with a five-vector adversarial prompt. The review was brutal. Composite scores 3 / 2 / 2 / 2 / 2 / 2 out of 10. Worse than pass-1 on the briefing alone. Pass-2 found structural-integrity failures, not just methodology debates.&lt;/p&gt;
&lt;div&gt;
&lt;div&gt;★ Sidebar artifact — codex adversarial review · GPT-5.5 · pass 2&lt;/div&gt;
&lt;strong&gt;The compounding example in §4.5 is numerically false.&lt;/strong&gt; v1.0 wrote: &quot;a 0.85 vs 0.95 per-round retention rate becomes 0.20 vs 0.21 after 30 years — the higher number is 5× larger.&quot; Actual: 0.85³⁰ ≈ &lt;strong&gt;0.0076&lt;/strong&gt;, 0.95³⁰ ≈ &lt;strong&gt;0.2146&lt;/strong&gt;, ratio &lt;strong&gt;28×, not 5×&lt;/strong&gt;. The example existed to demonstrate that small retention deltas compound enormously. The numbers in the example were wrong.
&lt;br /&gt;&lt;br /&gt;
&lt;strong&gt;The analyst pool count is broken multiple ways.&lt;/strong&gt; Says &quot;24 deep&quot; → lists &lt;strong&gt;33 numbered entries&lt;/strong&gt;. Stance table sums to &lt;strong&gt;29&lt;/strong&gt;, not 24. Row name-counts don&apos;t match stated counts. &lt;em&gt;The &quot;24-analyst pool&quot; in v1.0 doesn&apos;t exist.&lt;/em&gt;
&lt;br /&gt;&lt;br /&gt;
&lt;strong&gt;T_AGI is implemented incorrectly in three ways.&lt;/strong&gt; (a) Calendar off-by-one: &lt;code&gt;MEDIAN=7&lt;/code&gt; commented as &quot;2026+7=2033&quot; actually fires in year 2032. (b) &lt;code&gt;sigma=0.40&lt;/code&gt; does NOT reproduce the stated CDF — P(≤2030) is actually ~13%, not 20%; P(≤2036) is ~84%, not 70%; P(&amp;gt;2045) is ~0%, not 10%. &lt;em&gt;The simulation does not implement the model&apos;s stated prior.&lt;/em&gt;
&lt;br /&gt;&lt;br /&gt;
&lt;strong&gt;The telemetry formula caps at round 3, not round 4 as commented.&lt;/strong&gt; And it&apos;s not even a Bayesian posterior — it&apos;s &quot;P(at least one correct independent signal)&quot;, which is conceptually different. The headline self-knowledge mechanic is mathematically wrong-shaped.
&lt;br /&gt;&lt;br /&gt;
&lt;strong&gt;The $230B self-knowledge information value is asserted, not computed.&lt;/strong&gt; No code anywhere computes a perfect-info-vs-delayed-info counterfactual. No VoI calculation. &lt;em&gt;The number was made up.&lt;/em&gt;
&lt;br /&gt;&lt;br /&gt;
&lt;strong&gt;Cross-layer Cloud is inconsistent.&lt;/strong&gt; Brief gives $20B Q1 revenue; spec converts to ~$5B annual op-profit; code treats &lt;code&gt;CLOUD_BASELINE=0.010&lt;/code&gt; as additive yearly payoff. Spec says G4 keeps Cloud growth at 50–63%; code uses 32% with 0.83/yr decay. &lt;strong&gt;The sim doesn&apos;t implement the brief&apos;s Cloud premise OR the model&apos;s.&lt;/strong&gt;
&lt;br /&gt;&lt;br /&gt;
&lt;strong&gt;The collapsed adversary corrupts incentives.&lt;/strong&gt; The strongest single blow: &quot;If X is truly &apos;Anti-Google,&apos; X4 is not an adversarial action. It is a customer contract. Anthropic using TPU capacity may be rational for Anthropic and bullish for Google while being bad for Microsoft/OpenAI and irrelevant or negative for Apple. Collapsing all of that into one X-player lets the model count a bilateral commercial relationship as evidence that the adversary has selected Google&apos;s best cell.&quot; &lt;em&gt;The &quot;G4×X4 dominant cell&quot; finding is partly an incentive-sign error from forcing asymmetric actors into one synthetic opponent.&lt;/em&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;v1.0.1 fixed six objective errors.&lt;/strong&gt; Pool count corrected (24 → 33). Compounding math corrected (5× → 28×). T_AGI calendar fixed. T_AGI sigma corrected to honestly produce the stated CDF (σ=0.60). Telemetry mechanic documented as not-a-Bayesian-posterior. Cloud layers reconciled with honest documentation of the gap.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Three structural critiques still stand.&lt;/strong&gt; The collapsed adversary needs to become a multi-player game (≥4 actors: Google, MSFT/OpenAI, Apple-Anthropic, open-source/China). The $230B self-knowledge value needs to actually be computed via a counterfactual sim, not asserted. The &quot;single action per round&quot; abstraction is operationally false — Google is already running G1 + G2 + G3 + G4 simultaneously; &quot;all-build&quot; is a strawman. v1.1 will rebuild around these.&lt;/p&gt;
&lt;h2&gt;The Apple sister-instance&lt;/h2&gt;
&lt;p&gt;The same substrate run on Apple — Apple-30yr v1.1 — produced inverted results on every dimension.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Apple-30yr v1.1&lt;/th&gt;
&lt;th&gt;Google-30yr v1.0.1&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Mean 2056 cap&lt;/td&gt;
&lt;td&gt;0.35–0.41×&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.58×&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Nokia tail probability&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;18%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Optimal strategy&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;all-Build&lt;/strong&gt; (vertical)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;all-Partner&lt;/strong&gt; (substrate)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Headline parameter&lt;/td&gt;
&lt;td&gt;capability (build the next model)&lt;/td&gt;
&lt;td&gt;self-knowledge (resolve cannibalization)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Modal bear&lt;/td&gt;
&lt;td&gt;Nokia tail itself&lt;/td&gt;
&lt;td&gt;Slow Decline (Yahoo-shaped)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T_AGI median prior&lt;/td&gt;
&lt;td&gt;2034&lt;/td&gt;
&lt;td&gt;2033&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Two companies, same shock, opposite optimal moves. Apple&apos;s incumbency is &lt;em&gt;vertical&lt;/em&gt; — the iPhone is what it is, can be made by Apple alone, and faces an adversary (Anthropic-on-device + Gemini-on-Siri) that genuinely is adversarial. Google&apos;s incumbency is &lt;em&gt;substrate&lt;/em&gt; — Search is what it is because every web page indexes for it, and the rational play is to rent that substrate to anyone, including the labs that would replace Search if they could.&lt;/p&gt;
&lt;p&gt;The substrate-instance similarity is the meta-finding: &lt;strong&gt;for any sufficiently large incumbent facing an asymmetric AI shock, the load-bearing strategic parameter is the one the firm has the least information about.&lt;/strong&gt; For Apple it was &quot;can we build a competitive frontier model in-house,&quot; answered late. For Google it was &quot;is our own AI Overviews product cannibalizing the funnel that funds it,&quot; and as of May 2026 it is still not answered.&lt;/p&gt;
&lt;h2&gt;What I&apos;d ship if I had to run v1.1 tomorrow&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Decompose the adversary.&lt;/strong&gt; Four actors minimum: Google, MSFT/OpenAI, Apple-Anthropic, open-source/China. Each with its own type space and patience parameter. Lose the collapsed-X abstraction.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Actually compute the $230B.&lt;/strong&gt; Run the sim once with full θ_ovw observability at round 1 and once with the v1.0.1 telemetry mechanic. Take the mean PV delta. That is the information value. Anything less is asserting a number.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Make the per-round action a portfolio, not a singleton.&lt;/strong&gt; Google runs G1 + G3 + G4 simultaneously in reality. Let each round allocate a budget across all five actions. The &quot;all-build sacrifices 21%&quot; finding probably survives but will be smaller; the type-conditional strategy probably gets meaningfully better when actions are not exclusive.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dynamic types over the 30-year horizon.&lt;/strong&gt; Static hidden-types let early assumptions dominate too much. Let θ_ovw be re-drawn (with belief-anchored priors) at each post-AGI regime switch — the &quot;cannibalistic&quot; question after AGI may have a different answer than before.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Re-cite capability priors with primary sources only.&lt;/strong&gt; No aggregator blogs, no SEO slugs. Model cards, lab announcements, reproducible benchmarks, or drop the claim.&lt;/li&gt;
&lt;/ol&gt;
&lt;hr /&gt;
&lt;h2&gt;Final commit message&lt;/h2&gt;
&lt;p&gt;The model is wrong in interesting ways. The pool count was wrong. The compounding math was wrong. The T_AGI implementation was wrong. The $230B was asserted, not computed. The collapsed adversary lets customer contracts get counted as adversarial moves. None of this changes the headline:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Google&apos;s 2056 cap is more sensitive to when it figures out whether its own AI Overviews are cannibalistic than to whether Gemini ever leads on benchmarks.&lt;/strong&gt; The bull tail is the substrate bet. The bear tail is the self-knowledge gap. The catastrophic tail is small because the Mehta ruling already collapsed the worst regulatory leg, but the modal bear case is Yahoo-shaped Slow Decline, not Nokia.&lt;/p&gt;
&lt;p&gt;If you are running the firm: spend less on benchmark-chasing and more on the team that measures Search-RPM-by-AI-Overviews-cohort. That team is the highest-leverage corporate function inside Google in 2026.&lt;/p&gt;
&lt;p&gt;If you are reading this as a forecast: discount everything above by the codex review. Then read it again.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;em&gt;The Google-30yr v1.0.1 briefing — full 121KB HTML dashboard with all five charts at native resolution, plus the 7.8K-word research brief, the 33-analyst pool, the formal model spec, the simulation code, and both passes of the codex review — lives at &lt;code&gt;~/Projects/labs/google-30yr/&lt;/code&gt;. The Apple sister-instance at &lt;code&gt;~/Projects/labs/apple-30yr/&lt;/code&gt;. Meta-30yr is in flight.&lt;/em&gt;&lt;/p&gt;
</content:encoded><enclosure url="https://sharadja.in/astral_hacker.png" type="image/png" length="0"/><category>AI</category><category>forecasting</category><category>game-theory</category><category>Google</category><category>Alphabet</category><category>AGI</category><category>scenario-planning</category><category>ai-2027</category><category>Monte-Carlo</category><category>Bayesian</category><category>codex</category><category>adversarial-review</category><author>Sharad Jain</author></item><item><title>Organizations as Code: The Company Becomes a Repo</title><link>https://sharadja.in/blog/organizations-as-code/</link><guid isPermaLink="true">https://sharadja.in/blog/organizations-as-code/</guid><description>Karpathy&apos;s Autoresearch shipped 50 ML experiments overnight from one Markdown file. Stripe is merging 1,000+ AI-generated PRs a week. The conversation about organizations as code is no longer hypothetical — it is now an artifact you can git clone.</description><pubDate>Sun, 03 May 2026 14:00:00 GMT</pubDate><content:encoded>&lt;h1&gt;Organizations as Code: The Company Becomes a Repo&lt;/h1&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/org-as-code-deck/slide-01.png&quot; alt=&quot;The company becomes a repo — org.yaml, agents, departments, policies, memories, evals, budgets, escalation rules, and simulations as one executable bundle.&quot; /&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt; The &quot;Org as Code&quot; hypothesis — a company&apos;s roles, goals, permissions, budgets, workflows, escalations, memories, evals, and governance expressed as versioned, executable configuration — has stopped being a thought experiment. Karpathy&apos;s Autoresearch shipped a 630-line proof in March 2026. Stripe is merging more than a thousand AI-generated PRs per week. The Reverse Layoff has already taught us what happens when companies skip the wrapper. This post is a tour of the runtime, the unit economics, the governance gap, and the smallest version a reader can ship on Monday.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr /&gt;
&lt;p&gt;In March 2026, Andrej Karpathy &lt;a href=&quot;https://github.com/karpathy/autoresearch&quot;&gt;open-sourced a 630-line Python repo called Autoresearch&lt;/a&gt;. Three files. One human-maintained Markdown. An agent ran &lt;a href=&quot;https://thenewstack.io/karpathy-autonomous-experiment-loop/&quot;&gt;50 ML experiments overnight on a single GPU with zero human intervention&lt;/a&gt;. When Tobi Lütke pointed it at a Shopify problem, &lt;a href=&quot;https://o-mega.ai/articles/karpathy-autoresearch-complete-2026-guide&quot;&gt;it produced a 19% model-quality improvement after 37 sequential experiments in 8 hours&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The interesting part is not the throughput. The interesting part is the metric.&lt;/p&gt;
&lt;p&gt;Autoresearch evaluates every experiment with a single scalar — &lt;code&gt;val_bpb&lt;/code&gt;, validation bits per byte. Lower is better. It is &lt;a href=&quot;https://github.com/karpathy/autoresearch&quot;&gt;vocabulary-agnostic by design&lt;/a&gt;. If Karpathy had used standard cross-entropy loss, a clever agent could have lowered the score by shrinking the vocabulary surface area, producing a model that scored beautifully and predicted nothing useful. &lt;code&gt;val_bpb&lt;/code&gt; removes the loophole. Every training run is also clamped to a five-minute wall-clock budget. If compute were dynamic, a &quot;better&quot; model would just be a longer-trained one and the agent would discover that within a week.&lt;/p&gt;
&lt;p&gt;==Designing the metric and the budget IS the work.== The agent does the experiments. The human designs the bounding box that makes those experiments mean something.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/org-as-code-deck/slide-03.png&quot; alt=&quot;The 630-line proof of concept: Karpathy&apos;s Autoresearch loop — hypothesis generation → sandbox execution → measurement (val_bpb) → memory update.&quot; /&gt;&lt;/p&gt;
&lt;p&gt;This is the shape of every interesting org-as-code system. Not &quot;AI runs the company.&quot; A &lt;em&gt;human-designed bounding box around a state machine that runs continuously&lt;/em&gt;, version-controlled in plain text, forkable, auditable, and inspectable as code.&lt;/p&gt;
&lt;p&gt;The next thing to make reproducible is not the model and not the agent. It is the organization that coordinates them.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;1. The progression is boring, which is why I trust it&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/org-as-code-deck/slide-02.png&quot; alt=&quot;Servers became config. Agents are becoming config. Organizations are the next layer.&quot; /&gt;&lt;/p&gt;
&lt;p&gt;The argument is mechanical:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;th&gt;Where the world is now&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;servers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;hand-configured machines&lt;/td&gt;
&lt;td&gt;Terraform, containers, regions, reproducible deploys&lt;/td&gt;
&lt;td&gt;done&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;agents&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;chat prompts&lt;/td&gt;
&lt;td&gt;model + tools + permissions + memory + schedules + evals&lt;/td&gt;
&lt;td&gt;shipping&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;organizations&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;people, meetings, folklore, docs&lt;/td&gt;
&lt;td&gt;roles + goals + budgets + workflows + governance as runtime config&lt;/td&gt;
&lt;td&gt;~2 years in&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The third row sounds weird only because we are used to organizations being implicit.&lt;/p&gt;
&lt;p&gt;A company is mostly coordination. It decides what should happen, who can do it, what tools they can use, how much money they can spend, when to ask for approval, where to write down the result, and how to learn from the outcome. That is not mystical. ==It is a state machine with politics.== The politics will not disappear. The state machine is going to become explicit.&lt;/p&gt;
&lt;p&gt;The proof points are not subtle. &lt;a href=&quot;https://www.nxcode.io/resources/news/agentic-engineering-complete-guide-vibe-coding-ai-agents-2026&quot;&gt;Stripe is merging more than 1,000 AI-generated PRs per week&lt;/a&gt; through structured agentic pipelines. Zapier reports an 89% AI-adoption rate across its operational workflow. AI-native startups now report Revenue Per Employee figures around &lt;strong&gt;$3.48M&lt;/strong&gt;, against a SaaS median of $129K–$200K. These are not stories about better autocomplete. They are stories about a different unit of production.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;2. The eight-agent company&lt;/h2&gt;
&lt;p&gt;In February 2026, Karpathy ran a &lt;a href=&quot;https://quasa.io/media/karpathy-s-experiment-assembling-an-ai-research-team-highlights-limitations-and-ushers-in-org-engineering&quot;&gt;simulated research organization with eight agents&lt;/a&gt;. Four powered by Claude. Four by Codex. Each on its own dedicated GPU, each on a Git branch.&lt;/p&gt;
&lt;p&gt;The hierarchy was deliberately corporate. One &quot;chief scientist&quot; agent at the top doing high-level conceptualization and delegation. Several &quot;junior researcher&quot; agents below it doing experimental work. The task: remove a logit softcap from &lt;code&gt;nanochat&lt;/code&gt; without regressing performance.&lt;/p&gt;
&lt;p&gt;The communication substrate is the part that matters. ==No virtual machines. No Docker mesh. No proprietary protocol. Just text files in a version-controlled repo.== Agents read each other&apos;s drafts the way humans read each other&apos;s commits. The chief scientist wrote specs as Markdown. Junior researchers picked specs off the queue and pushed branches. The &quot;meetings&quot; were diffs.&lt;/p&gt;
&lt;p&gt;Two things came out of that experiment.&lt;/p&gt;
&lt;p&gt;The first is that the substrate works. Agents can coordinate at organizational scale through nothing more exotic than a Git remote. If the org&apos;s coordination model lives in plain text, it composes with everything humans already know how to do — branch, merge, review, revert, blame. That is the unlock.&lt;/p&gt;
&lt;p&gt;The second is the parable that should temper everyone&apos;s enthusiasm. One junior agent excitedly reported that it had &lt;strong&gt;&quot;discovered&quot;&lt;/strong&gt; a way to reliably lower the validation loss: increase the model&apos;s hidden size. Mathematically true. Scientifically vacuous. The agent had stumbled onto the fact that bigger models have lower loss, called it a finding, and was prepared to merge it. The agents were mechanically tenacious. They were also remarkably bad at the part of research that requires judgment about whether a result is interesting.&lt;/p&gt;
&lt;p&gt;This is what &quot;the human becomes more important, not less&quot; looks like in concrete form. Execution gets cheap. Taste does not.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;3. The Constraint Stack&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/org-as-code-deck/slide-04.png&quot; alt=&quot;The Constraint Stack — three nested layers: The Brief (program.md, human-maintained) wraps The Harness (prepare.py, immutable) wraps The Sandbox (train.py, fully mutable). The blast radius is bounded by design.&quot; /&gt;&lt;/p&gt;
&lt;p&gt;What Karpathy actually shipped, when you read the repo, is a three-layer &lt;strong&gt;Constraint Stack&lt;/strong&gt;. Treat this as a named pattern; you will see it everywhere in agentic systems that work.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;File in Autoresearch&lt;/th&gt;
&lt;th&gt;Mutability&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sandbox&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;train.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;fully mutable&lt;/td&gt;
&lt;td&gt;the only place the agent can edit; bounds the blast radius&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Harness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;prepare.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;strictly immutable&lt;/td&gt;
&lt;td&gt;data prep + eval logic + tokenizer; the agent cannot rig its own grade&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Brief&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;program.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;human-maintained&lt;/td&gt;
&lt;td&gt;scope, files in/out, log format, success threshold, recovery protocol&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Karpathy &lt;a href=&quot;https://thenewstack.io/karpathy-autonomous-experiment-loop/&quot;&gt;calls &lt;code&gt;program.md&lt;/code&gt; a &quot;super-lightweight skill&quot;&lt;/a&gt; — the actual instruction manual the autonomous worker reads on every loop. It is not documentation wrapped around code. It is the control plane.&lt;/p&gt;
&lt;p&gt;==The durable artifact from an overnight run is not the diff the agent committed. It is the protocol that produced the diff.== Refining &lt;code&gt;program.md&lt;/code&gt; is where the human&apos;s judgment compounds.&lt;/p&gt;
&lt;p&gt;This is the founder skill being reframed in real time. The old job was &quot;recruit the right people.&quot; The new job is &quot;design the right bounding box.&quot; &lt;code&gt;val_bpb&lt;/code&gt; was not chosen casually. The five-minute clock was not chosen casually. The immutability of &lt;code&gt;prepare.py&lt;/code&gt; was not chosen casually. Each is a single design decision that quietly forecloses an entire category of agent misbehavior. That is the work. The agent does the experiments overnight. The human stays up engineering the metric.&lt;/p&gt;
&lt;p&gt;The same pattern shows up in production agentic engineering at scale. Stripe and Zapier did not get to thousand-PR weeks by writing better prompts. They got there by writing better Constraint Stacks — sandboxes, harnesses, and briefs that let stochastic systems behave deterministically inside well-defined cages.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;4. Two clocks, or the org becomes unusable&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/org-as-code-deck/slide-05.png&quot; alt=&quot;The latency survival rule — concentric clocks. Interactive (seconds to minutes): task execution, customer replies, deploy decisions. Background (minutes to hours): memory distillation, eval runs, budget analysis. Tying them together creates structural bottlenecks.&quot; /&gt;&lt;/p&gt;
&lt;p&gt;The single most useful thing I learned building agent-memory systems applies almost directly to organizations. ==Separate the interactive clock from the background clock. If you do not, the system becomes unusable.==&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Clock&lt;/th&gt;
&lt;th&gt;Latency budget&lt;/th&gt;
&lt;th&gt;What runs there&lt;/th&gt;
&lt;th&gt;Why it exists&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;interactive&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;seconds to minutes&lt;/td&gt;
&lt;td&gt;task execution, approvals, customer replies, deploy decisions&lt;/td&gt;
&lt;td&gt;the path humans feel directly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;background&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;minutes to hours&lt;/td&gt;
&lt;td&gt;memory distillation, embedding refresh, audit review, eval runs, simulations, budget analysis&lt;/td&gt;
&lt;td&gt;improves the org without blocking work&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The mistake is to put everything on the interactive clock. A customer should not wait twenty minutes because the support agent&apos;s memory layer is re-embedding the corpus. A GTM agent should not block on a quarterly simulation before sending one approved email. A coding agent should not wait on a company-wide governance report before opening a PR.&lt;/p&gt;
&lt;p&gt;The background clock is where the compounding happens. It is where the org notices that one workflow is burning budget, one department&apos;s memory is stale, one approval gate is pure ceremony, one agent keeps failing the same eval, one human has become the hidden bottleneck.&lt;/p&gt;
&lt;p&gt;Freshness and richness do not share a latency budget. I argued the same thing about &lt;a href=&quot;https://sharadja.in/blog/system-prompt-architecture-agentic-ai&quot;&gt;system-prompt architecture in The 14K Token Debt&lt;/a&gt; — what runs at boot vs. what runs in the loop is the most consequential decision in the stack. The org-level version is the same shape, one layer up.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;5. The Wrapper, the YAML, and the runtime&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/org-as-code-deck/slide-06.png&quot; alt=&quot;Anatomy of the repo — raw agents are a chaotic graph. The Wrapper turns them into a company: agents/ → LangGraph + MCP, departments/ → Temporal, policies/ → Open Policy Agent / Rego.&quot; /&gt;&lt;/p&gt;
&lt;p&gt;The first version of org.yaml will look almost disappointingly simple.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;org:
  name: acme-growth-lab
  mission: grow revenue for vertical SaaS using AI-native outbound
  budget:
    monthly_compute: 12000
    monthly_tools: 3000
    max_single_action_without_approval: 500
  models:
    primary: claude-opus-4-6-2026-02-15      # pinned. never &quot;latest&quot;.
    fallback: claude-sonnet-4-6-2026-01-08

departments:
  research:
    goal: identify high-intent accounts
    agents: [market-mapper, competitor-watcher, hiring-signal-scanner]
  gtm:
    goal: generate qualified pipeline
    agents: [list-builder, personalization-writer, sequence-operator]
  engineering:
    goal: maintain internal systems and customer automations
    agents: [integration-builder, qa-reviewer, deployment-operator]

governance:
  humans:
    - role: board
      can_pause_any_agent: true
      approves: [payments_over_500, production_deployments, outbound_over_1000_contacts]
  audit:
    log_every_tool_call: true
    retain_for_days: 365
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Today this would be a strategy doc. Tomorrow it is runtime configuration. A strategy doc describes what people hope the company does. A runtime contract constrains what the company can actually do.&lt;/p&gt;
&lt;p&gt;The hand-wavy part of every &quot;AI company in a box&quot; pitch is &lt;em&gt;how the YAML actually executes&lt;/em&gt;. The honest answer in 2026 is that the stack already exists and you can stand it up this quarter. &lt;a href=&quot;https://temporal.io/&quot;&gt;Temporal&lt;/a&gt; provides durable workflow execution that survives crashes and resumes mid-step — each department becomes a workflow that can sleep for days waiting for a human signal. &lt;a href=&quot;https://docs.langchain.com/oss/python/langgraph/persistence&quot;&gt;LangGraph&lt;/a&gt; holds the cyclic cognitive state per agent, checkpointed to Postgres, time-travel-debuggable. The &lt;a href=&quot;https://modelcontextprotocol.io/specification/2025-06-18/server/tools&quot;&gt;Model Context Protocol (MCP)&lt;/a&gt; standardizes how agents discover and call tools — every tool is an MCP server with a schema. &lt;a href=&quot;https://openpolicyagent.org/&quot;&gt;Open Policy Agent (OPA)&lt;/a&gt; enforces the governance layer as Rego rules at decision time.&lt;/p&gt;
&lt;p&gt;A real &lt;code&gt;policies/payments.rego&lt;/code&gt; looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;package org.payments

default allow = false

# allow agent-initiated payments under the auto-approve threshold
allow {
  input.tool == &quot;stripe.charge&quot;
  input.amount_usd &amp;lt;= data.org.budget.max_single_action_without_approval
  input.agent in data.departments.gtm.agents
}

# explicit deny — the org can never wire to a country we have no compliance for
deny[msg] {
  input.tool == &quot;wire.send&quot;
  not input.country in data.org.compliance.allowed_jurisdictions
  msg := sprintf(&quot;wire to %v denied: jurisdiction not in allowed set&quot;, [input.country])
}

# require human approval above the threshold
require_approval {
  input.tool == &quot;stripe.charge&quot;
  input.amount_usd &amp;gt; data.org.budget.max_single_action_without_approval
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That is one file. Read it. Notice what is happening. ==The business decision &quot;the GTM team can charge up to $500 without asking&quot; is now a four-line policy that runs on every tool call.== The auditor reads the same file the agent does. The board reads the same file. The regulator, eventually, reads the same file.&lt;/p&gt;
&lt;p&gt;The full operating bundle:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Directory&lt;/th&gt;
&lt;th&gt;What it owns&lt;/th&gt;
&lt;th&gt;Runtime backing&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;agents/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;model choices, tools, memories, permissions, schedules&lt;/td&gt;
&lt;td&gt;LangGraph + MCP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;departments/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;goals, queues, ownership, scorecards&lt;/td&gt;
&lt;td&gt;Temporal workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;policies/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;what is allowed, denied, requires approval&lt;/td&gt;
&lt;td&gt;OPA / Rego&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;playbooks/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;repeatable workflows: outbound, support, QA&lt;/td&gt;
&lt;td&gt;Temporal sub-workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;evals/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;tests for whether the org still behaves correctly&lt;/td&gt;
&lt;td&gt;scheduled background jobs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;budgets/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;token, cloud, tool, and payment limits&lt;/td&gt;
&lt;td&gt;OPA + budget tracker&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;escalation-rules/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;when to wake a human, which one, with what evidence&lt;/td&gt;
&lt;td&gt;Temporal Signals&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;simulations/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;sandbox runs before changing production behavior&lt;/td&gt;
&lt;td&gt;E2B / Daytona MicroVMs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;==Raw agents are cheap and chaotic. The value is &lt;strong&gt;The Wrapper&lt;/strong&gt; — the organizational shell of policies, budgets, approvals, memory boundaries, audit trails, and evals around them.== Without that wrapper you do not have a company. You have a pile of interns with root access.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;6. The five-department wallet&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/org-as-code-deck/slide-07.png&quot; alt=&quot;The 44:1 capital efficiency of the swarm — $223M of human-equivalent labor compressed into a $5M compute budget at $8.50 per Prompt Module.&quot; /&gt;&lt;/p&gt;
&lt;p&gt;The first useful unit is not &quot;create a unicorn in one click.&quot; That is the wrong fantasy. The useful unit is smaller: a &lt;strong&gt;deployable business capability&lt;/strong&gt;. A support desk. A research cell. A GTM motion. A code-migration squad. A grant-writing operation. A finance ops back office.&lt;/p&gt;
&lt;p&gt;Each one has the same shape — a mission, scoped tools, a memory boundary, an approval gate, an eval suite, an audit trail, and a budget. That bundle is the thing you fork.&lt;/p&gt;
&lt;p&gt;The unit economics are stark enough to be worth doing on a napkin. Run a single feature cycle on a frontier reasoning model — call it a &lt;strong&gt;Prompt Module&lt;/strong&gt; ($P_m$). Plan, fabricate, verify, correct, polish. ~1M input tokens (heavy context loading), ~20K output tokens, one or two retries. At Opus-class pricing of $5/$25 per million in/out tokens, a single $P_m$ runs you ~$5.68 in raw API spend. Apply a 1.5× risk multiplier for runs that loop, hallucinate dependencies, or get scrapped. Round to &lt;strong&gt;~$8.50 per Prompt Module&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;A $5M compute budget then buys roughly &lt;strong&gt;588,000 Prompt Modules&lt;/strong&gt; — autonomous feature cycles, each one a discrete unit of value. After a 40% quality discount (modules that get re-written or thrown away), you are still looking at ~1,116 elite-developer-years of equivalent throughput. Hiring those engineers at fully-loaded $200K would cost ~$223M. ==Capital efficiency ratio: ~44:1.==&lt;/p&gt;
&lt;p&gt;The challenge is no longer affording the work. The challenge is orchestrating the swarm. A &quot;deployable business capability&quot; turns out to be a budget waterfall:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/org-as-code-deck/slide-08.png&quot; alt=&quot;The Five-Department Wallet — Fabrication 50%, Sentinel 20%, Signal 15%, Genesis 10%, Foundry 5%. If the QA budget feels too high, you have not been bitten yet.&quot; /&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Department&lt;/th&gt;
&lt;th&gt;% of budget&lt;/th&gt;
&lt;th&gt;Modules&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Genesis&lt;/strong&gt; (research, validation, idea-maze)&lt;/td&gt;
&lt;td&gt;10%&lt;/td&gt;
&lt;td&gt;~58k&lt;/td&gt;
&lt;td&gt;synthetic personas, competitor recon, gap analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Fabrication&lt;/strong&gt; (core build)&lt;/td&gt;
&lt;td&gt;50%&lt;/td&gt;
&lt;td&gt;~294k&lt;/td&gt;
&lt;td&gt;Architect agent + DB / Frontend / Backend specialists&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sentinel&lt;/strong&gt; (QA, red-team, security)&lt;/td&gt;
&lt;td&gt;20%&lt;/td&gt;
&lt;td&gt;~118k&lt;/td&gt;
&lt;td&gt;adversarial agents, TDD enforcement, self-healing repair loops with budget caps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Signal&lt;/strong&gt; (growth, content, SDR)&lt;/td&gt;
&lt;td&gt;15%&lt;/td&gt;
&lt;td&gt;~88k&lt;/td&gt;
&lt;td&gt;programmatic SEO, hyper-personalized outreach&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Foundry&lt;/strong&gt; (orchestration, manager-agents)&lt;/td&gt;
&lt;td&gt;5%&lt;/td&gt;
&lt;td&gt;~29k&lt;/td&gt;
&lt;td&gt;manager agents that monitor swarm health, kill zombie tasks, refactor prompts&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Two non-obvious lessons in that table. Twenty cents of every dollar goes to the immune system. ==If the QA budget feels too high, you have not been bitten yet.== And the smallest line — &quot;manager agents&quot; — is the one that lets the other 95% behave at all. Without it, the Token Snowball will eat the rest. An agent stuck in a &quot;fix this dependency&quot; loop can burn 1M tokens every ten minutes.&lt;/p&gt;
&lt;p&gt;This is also where the pricing model of compute makes a quietly important shift. As the codebase grows, $P_m$ cost drifts upward — $3 in month one, $6 in month three, $12 in month six — purely because the context loaded per call gets denser. The Foundry budget is what pays for the &quot;Refactor Agents&quot; that run continuously to keep the context weight manageable. Run an org without that line item and you watch your unit economics rot in slow motion.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;7. What forks cleanly. What doesn&apos;t.&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/org-as-code-deck/slide-09.png&quot; alt=&quot;The limits of the copy-paste company — pure config (policies, budgets, escalation rules, playbook structures) forks cleanly; memory, vendor connectors, regulatory posture, and brand are sticky.&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Forkability is the main event. When software became forkable, experimentation exploded. When organizations become forkable, business-model search changes — fork the same agency into healthcare, logistics, legal, insurance; branch a GTM motion to test two sales plays; fork into a new geography by swapping data residency, payment rails, language, local vendors.&lt;/p&gt;
&lt;p&gt;But the breezy &quot;fork four times&quot; pitch hides the part that breaks.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Artifact&lt;/th&gt;
&lt;th&gt;Forks cleanly?&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Policies, budgets, escalation rules&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;pure config; no domain entanglement&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Playbooks, workflows, eval suites&lt;/td&gt;
&lt;td&gt;mostly&lt;/td&gt;
&lt;td&gt;slot in new tools; structure stays&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agent definitions (model, tools, prompt)&lt;/td&gt;
&lt;td&gt;mostly&lt;/td&gt;
&lt;td&gt;re-grounding the prompt is the work&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory and embeddings&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;td&gt;domain-, customer-, and incident-specific&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vendor connectors, integrations&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;td&gt;each market has different APIs and contracts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compliance / regulatory posture&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;td&gt;jurisdiction-bound; not a config swap&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Customer trust and brand&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;td&gt;not in the repo&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The clean half is the unlock. The sticky half is what makes the founder&apos;s job real. ==Forkability is partial by design — and the partial-forkability is precisely what stops the spam version from being trivial to spin up.== If everything forked cleanly, the world would already be on fire.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;8. The Reverse Layoff&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/org-as-code-deck/slide-10.png&quot; alt=&quot;The friction — undocumented entanglement (macro: agents hit edge cases → workflows degrade → companies rehire) and Cognitive Debt (micro: MIT 2026 fMRI shows reduced activation in deep reasoning regions when engineers outsource taste to agents).&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Anyone selling you the clean version of this story is not paying attention.&lt;/p&gt;
&lt;p&gt;In early 2026 the market ran the experiment for us. A wave of companies, intoxicated by the leverage, &lt;a href=&quot;https://quasa.io/media/oops-what-happened-the-great-ai-reverse-layoff-is-underway&quot;&gt;laid off significant portions of their engineering and operations teams and replaced them with agent swarms&lt;/a&gt;. Then the agents hit the cases nobody had documented. Financial reports broke in inexplicable ways. Production updates failed silently. Workflows degraded under load no one had simulated. The agents could not debug undocumented legacy entanglement without human intuition acting as glue.&lt;/p&gt;
&lt;p&gt;Within months the same companies were rehiring those same engineers at premium rates — the &quot;boomerang employee&quot; pattern, now a category. Institutional knowledge turned out to be a largely un-computable variable. ==The lesson is not &quot;AI doesn&apos;t work.&quot; The lesson is that the wrapper takes longer to build than the layoff press release.==&lt;/p&gt;
&lt;p&gt;The smaller-scale version of the same lesson lives inside Karpathy&apos;s eight-agent simulation. Recall the parable from §2: an agent reported it had &quot;discovered&quot; that bigger hidden size lowered loss. That is what happens in microcosm when an org has agents but no scientific taste. They will produce results that pass every test you wrote and tell you nothing new about the world. The Reverse Layoff is the macro version of &quot;discovered hidden size.&quot;&lt;/p&gt;
&lt;p&gt;There is a quieter, longer-acting failure mode running in parallel. An &lt;a href=&quot;https://quasa.io/media/mit-study-reveals-cognitive-debt-how-over-reliance-on-ai-weakens-independent-thinking&quot;&gt;MIT 2026 study&lt;/a&gt; identified what the field is now calling &lt;strong&gt;Cognitive Debt&lt;/strong&gt; — measurable atrophy in independent analytical capacity among engineers who outsource the painful part of problem-solving to agents. EEG and fMRI work in the cohort showed reduced activation in the regions associated with deep, sustained reasoning. Output rises; the muscle that produces taste declines. The companies that survive this decade will be the ones that protect the human cognition they need to keep designing the bounding box.&lt;/p&gt;
&lt;p&gt;A related collapse is happening in measurement itself. The DORA framework — deployment frequency, lead time, MTTR, change failure rate — was the canonical way to measure engineering health for a decade. &lt;a href=&quot;https://byteiota.com/developer-productivity-metrics-fail-66-dont-trust-them/&quot;&gt;66% of developers no longer trust those dashboards&lt;/a&gt;. Deployment frequency means nothing when an agent ships dozens of commits an hour. The metrics will follow the unit of work, and the unit is shifting from &quot;the team&quot; to &quot;the bundle.&quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;9. Due diligence becomes code review&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/org-as-code-deck/slide-12.png&quot; alt=&quot;The Clone Test — a 0–100% scale. 30%: you are running a story about a few people holding broken processes in their heads. 90%: you are running infrastructure. Acquirers now review permissions, memory provenance, evals, and audit logs over culture decks.&quot; /&gt;&lt;/p&gt;
&lt;p&gt;If a company is an executable bundle, acquiring one means inspecting the bundle. Not just the financials. The operating code.&lt;/p&gt;
&lt;p&gt;You would review:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;permissions&lt;/strong&gt; — which agents can touch money, production, customer data, outbound, legal docs?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;memory&lt;/strong&gt; — what does the org believe, who wrote it, when was it last verified?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;workflows&lt;/strong&gt; — which queues drive revenue, support, finance, shipping?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;evals&lt;/strong&gt; — what tests prove the org still works?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;budgets&lt;/strong&gt; — where does compute spend go, and which loops can run away?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;audit logs&lt;/strong&gt; — can you replay why a decision happened?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;human dependencies&lt;/strong&gt; — which workflows fail if one person leaves?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That last one is the killer. A lot of companies are not really companies. They are a few heroic people holding a pile of broken processes in their heads. Organizations as code makes that visible.&lt;/p&gt;
&lt;p&gt;It also makes the inverse visible. A small team with a clean operating repo, sharp evals, narrow permissions, clear memory boundaries, and repeatable workflows may be much more valuable than its headcount suggests.&lt;/p&gt;
&lt;p&gt;Revenue Per Employee was the first AI-native metric. ==&lt;strong&gt;The Clone Test&lt;/strong&gt; is the better one: how much of your business survives if you delete every human and rebuild from the repo alone?== If 90% does, you are running infrastructure. If 30% does, you are running a story about a few people. The Clone Test is brutal because it does not care how good your culture deck is. It asks one question — &lt;em&gt;is the company in the repo, or in your head?&lt;/em&gt; — and the answer determines what an acquirer is actually buying.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;10. Governance is the product&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/org-as-code-deck/slide-13.png&quot; alt=&quot;Governance is not a chapter at the end. It is the product. Org provenance — Template Origin, Jurisdiction, Human Accountability, Pinned Models. GAIA-X (€20–30B/yr) and US NSF AI Institutes prove the infrastructure decisions are now political.&quot; /&gt;&lt;/p&gt;
&lt;p&gt;A clonable organization is powerful in the same way a botnet is powerful. If you can spin up a useful org, you can spin up a harmful one. Spam companies. Scam companies. Automated litigation mills. Fake-media networks. Synthetic political operations. Companies with no moral center because nobody inside them feels responsible.&lt;/p&gt;
&lt;p&gt;This is why governance is not a chapter at the end. ==It is the product.==&lt;/p&gt;
&lt;p&gt;The hard questions arrive immediately:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Who owns an agent&apos;s actions?&lt;/li&gt;
&lt;li&gt;Who can audit a forked organization?&lt;/li&gt;
&lt;li&gt;Can a regulator inspect org code?&lt;/li&gt;
&lt;li&gt;Can customers know whether they are dealing with a human, an agent, or a company with one human and ten thousand agents?&lt;/li&gt;
&lt;li&gt;Can payment networks, cloud providers, and model providers enforce identity at the &lt;em&gt;organizational&lt;/em&gt; level?&lt;/li&gt;
&lt;li&gt;Can an org be rate-limited?&lt;/li&gt;
&lt;li&gt;Can it be recalled?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Agent identity is not enough. The org needs identity. Not just an account — &lt;em&gt;provenance&lt;/em&gt;. Who created this org. What template it was forked from. What permissions it has. Which humans are accountable. Which jurisdictions it operates in. Which models and pinned versions it uses. What changed between this version and the last one.&lt;/p&gt;
&lt;p&gt;Without that, organizations as code becomes organizations as malware.&lt;/p&gt;
&lt;p&gt;The geopolitical layer is already moving. The EU&apos;s &lt;a href=&quot;https://quasa.io/media/europe-s-quest-for-a-sovereign-tech-stack-breaking-free-from-us-dominance&quot;&gt;GAIA-X effort&lt;/a&gt; is accelerating a federated sovereign tech stack at a rumored €20–30B per year, on the explicit thesis that European org code should not be bound to American hyperscalers. The U.S. is doubling down via NSF AI Institutes and an aggressive industrial policy on agentic workforce development. There will be no neutral place to host an org repo. The infrastructure decisions are political decisions now.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;11. Disposable coordination — and what you can ship Monday&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/org-as-code-deck/slide-14.png&quot; alt=&quot;The minimum viable repo — six cards: one program.md, one immutable harness, one mutable sandbox, one pinned agent, one scoped repo, one scalar metric. Tuning the gap between what you wanted and what you got is the founder skill of the next decade.&quot; /&gt;&lt;/p&gt;
&lt;p&gt;The cloud gave us disposable infrastructure. AI agents give us disposable labor. ==Organizations as code give us disposable coordination.==&lt;/p&gt;
&lt;p&gt;That is the unlock.&lt;/p&gt;
&lt;p&gt;Once coordination becomes cheap, we will see many more tiny companies, temporary companies, single-purpose companies, forked companies, simulated companies, companies that look more like open-source projects than corporations. Some will be scams. Some will be toys. Some will be terrifying. Some will be beautiful. The direction is clear.&lt;/p&gt;
&lt;p&gt;First we made servers programmable. Then we made workers programmable. Now we are making the organization programmable.&lt;/p&gt;
&lt;p&gt;You do not need to wait for the platform. The smallest org-as-code artifact you can ship this week is exactly the one Karpathy shipped in March: &lt;strong&gt;one &lt;code&gt;program.md&lt;/code&gt;, one immutable harness, one mutable sandbox, one agent, one scoped repo, one scalar metric&lt;/strong&gt;. Pick a single workflow inside your team — a weekly competitive scan, a triage queue, an SDR sequence, a deploy verification loop. Write the brief. Pin the model version. Cap the budget. Add a single OPA-style rule for the one action that requires a human.&lt;/p&gt;
&lt;p&gt;A starter &lt;code&gt;program.md&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# program.md — competitor-watcher v0.1

## Mission
Surface every public competitor product/pricing/comms change in the last 7 days.

## Scope
- READ:  competitor URLs in /sources/competitors.yaml
- WRITE: append to /reports/weekly-YYYY-WW.md only
- NEVER: edit /sources, /policies, /evals; never make outbound contact

## Workflow
1. Fetch each source URL. Diff against /cache/last-snapshot/.
2. For each diff, classify: pricing | feature | positioning | hiring | other.
3. Write one bullet per change with: source URL, date observed, 1-line summary.
4. If &amp;gt;25 bullets in a single week, escalate (see Recovery).
5. Update /cache/last-snapshot/ on success.

## Success metric
- scalar: change_recall_score = (bullets_human_marked_useful) / (total_bullets)
- target: &amp;gt;= 0.6 over a 4-week trailing window.
- if score drops below 0.4, halt and request human review.

## Budget
- 200K input tokens / 20K output tokens per run, hard cap.
- Model: claude-sonnet-4-6-2026-01-08 (pinned; never &quot;latest&quot;).

## Recovery
- On HTTP 4xx/5xx for a source: log, skip, continue.
- On parse failure: write raw HTML to /errors/, do not invent content.
- On budget overrun: halt, write /errors/budget-exceeded.md, exit 2.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Run it nightly. Read the report every morning for two weeks. Tune &lt;code&gt;program.md&lt;/code&gt; based on what you actually wanted vs. what you got. ==That tuning is the founder skill of the next decade in miniature.== The rest of this post is what happens when the same loop is applied to engineering, GTM, finance, support, and compliance — at the same time, with budgets and policies wired together, on the same substrate.&lt;/p&gt;
&lt;p&gt;The next five years are about learning how to spin up a company without losing the ability to ask whether it should exist.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Where this breaks&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/org-as-code-deck/slide-11.png&quot; alt=&quot;Org-level runtime exceptions — Empty Wrapper, Memory Pollution, Transferability Cliff, Silent Model Drift, Token Snowball. Each with its specific antidote.&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Five failure modes worth naming honestly.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[!WARNING]
&lt;strong&gt;The empty-wrapper trap.&lt;/strong&gt; A repo full of &lt;code&gt;policies/&lt;/code&gt;, &lt;code&gt;evals/&lt;/code&gt;, and &lt;code&gt;audit-logs/&lt;/code&gt; is not a company. Without a real workflow producing real value, the wrapper is theater. The Reverse Layoff happened to companies that built the wrapper before they had earned the right to compress the workforce inside it. Build the smallest valuable workflow first; wrap second.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;[!CAUTION]
&lt;strong&gt;Memory pollution as organizational hallucination.&lt;/strong&gt; The most insidious failure of org-as-code is when stale, wrong, or agent-generated memories enter the trusted memory layer and become structural precedent for future decisions. By iteration N+10, the org is making decisions on its own past hallucinations. Memory needs aggressive provenance, expiry, and pruning — closer to a database garbage collector than a knowledge base.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;[!NOTE]
&lt;strong&gt;Semantic transferability cliff.&lt;/strong&gt; A heavily optimized org repo encodes the founder&apos;s biases, blind spots, and aesthetic preferences. Forking it laterally — handing it to another founder, another vertical, another culture — guarantees friction. Distinguishing absolute organizational primitives from idiosyncratic founder preferences is the next hard problem in this space. We do not yet have a clean answer.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;[!IMPORTANT]
&lt;strong&gt;Silent model drift breaks the org overnight.&lt;/strong&gt; Cloud providers ship undocumented model updates that change behavior on standard benchmarks — increased hallucination rates, exacerbated multi-file laziness, sometimes both. An autonomous loop running unattended cannot tell you that yesterday&apos;s &lt;code&gt;claude-opus-latest&lt;/code&gt; is not today&apos;s. Pin specific immutable model versions at the org-config level. Treat &lt;code&gt;latest&lt;/code&gt; as a synonym for &quot;production may regress without warning.&quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;[!DANGER]
&lt;strong&gt;The Token Snowball.&lt;/strong&gt; A single agent stuck in a &quot;fix this dependency&quot; loop can burn 1M tokens every ten minutes. Without per-task budget caps, circuit breakers (kill any task &amp;gt;5 retries), and a manager-agent watching for zombie loops, the unit economics in §6 collapse from 44:1 to negative within a week. Budget Trackers are not optional infrastructure. They are the foundation everything else sits on.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;em&gt;Sharad Jain builds agentic AI pipelines in Bengaluru. He previously engineered core data infrastructure at Meta and is the founder of &lt;a href=&quot;https://autoscreen.ai&quot;&gt;autoscreen.ai&lt;/a&gt;, a production voice-AI platform. This post is part of a series on agentic AI infrastructure — see &lt;a href=&quot;https://sharadja.in/blog/system-prompt-architecture-agentic-ai&quot;&gt;The 14K Token Debt&lt;/a&gt; on system-prompt architecture and &lt;a href=&quot;https://sharadja.in/blog/terminal-first-agent-harness-unix-ai-agents&quot;&gt;The Terminal Was the First Agent Harness&lt;/a&gt; on Unix primitives as agent patterns.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Research &amp;amp; Footnotes:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Karpathy: &lt;a href=&quot;https://github.com/karpathy/autoresearch&quot;&gt;&lt;code&gt;autoresearch&lt;/code&gt;&lt;/a&gt; — 630-line autonomous ML research loop (March 2026)&lt;/li&gt;
&lt;li&gt;The New Stack: &lt;a href=&quot;https://thenewstack.io/karpathy-autonomous-experiment-loop/&quot;&gt;Karpathy&apos;s 630-line script ran 50 experiments overnight&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;The New Stack: &lt;a href=&quot;https://thenewstack.io/vibe-coding-is-passe/&quot;&gt;Vibe coding is passé. Karpathy has a new name for the future of software.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Quasa Connect: &lt;a href=&quot;https://quasa.io/media/karpathy-s-experiment-assembling-an-ai-research-team-highlights-limitations-and-ushers-in-org-engineering&quot;&gt;Karpathy&apos;s experiment assembling an AI research team — &apos;Org Engineering&apos;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Quasa Connect: &lt;a href=&quot;https://quasa.io/media/oops-what-happened-the-great-ai-reverse-layoff-is-underway&quot;&gt;The Great AI Reverse Layoff&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Quasa Connect: &lt;a href=&quot;https://quasa.io/media/mit-study-reveals-cognitive-debt-how-over-reliance-on-ai-weakens-independent-thinking&quot;&gt;MIT Study Reveals &apos;Cognitive Debt&apos;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;byteiota: &lt;a href=&quot;https://byteiota.com/developer-productivity-metrics-fail-66-dont-trust-them/&quot;&gt;Developer Productivity Metrics Fail: 66% Don&apos;t Trust Them&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;NxCode: &lt;a href=&quot;https://www.nxcode.io/resources/news/agentic-engineering-complete-guide-vibe-coding-ai-agents-2026&quot;&gt;Agentic Engineering: Complete Guide (2026)&lt;/a&gt; — Stripe 1,000 PRs/week, Zapier 89% adoption, AI-native RPE&lt;/li&gt;
&lt;li&gt;O-mega: &lt;a href=&quot;https://o-mega.ai/articles/karpathy-autoresearch-complete-2026-guide&quot;&gt;Karpathy Autoresearch Complete 2026 Guide&lt;/a&gt; — Lütke / Shopify 19% improvement&lt;/li&gt;
&lt;li&gt;Anthropic: &lt;a href=&quot;https://modelcontextprotocol.io/specification/2025-06-18/server/tools&quot;&gt;Model Context Protocol specification&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Open Policy Agent: &lt;a href=&quot;https://openpolicyagent.org/&quot;&gt;OPA documentation&lt;/a&gt; — policy-as-code engine, Rego language&lt;/li&gt;
&lt;li&gt;Temporal: &lt;a href=&quot;https://temporal.io/&quot;&gt;durable execution platform&lt;/a&gt; — workflow indestructibility, signals, long-running coordination&lt;/li&gt;
&lt;li&gt;LangChain: &lt;a href=&quot;https://docs.langchain.com/oss/python/langgraph/persistence&quot;&gt;LangGraph persistence&lt;/a&gt; — checkpointed cyclic graphs for agent cognition&lt;/li&gt;
&lt;li&gt;E2B: &lt;a href=&quot;https://e2b.dev/&quot;&gt;secure code-execution sandboxes&lt;/a&gt; — Firecracker microVMs for ephemeral agent execution&lt;/li&gt;
&lt;li&gt;CIO: &lt;a href=&quot;https://www.cio.com/article/4106863/google-unveils-budget-tracker-and-bats-framework-to-rein-in-ai-agent-costs.html&quot;&gt;Google&apos;s Budget Tracker / BATS framework for agent cost containment&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Quasa Connect: &lt;a href=&quot;https://quasa.io/media/europe-s-quest-for-a-sovereign-tech-stack-breaking-free-from-us-dominance&quot;&gt;Europe&apos;s GAIA-X sovereign tech stack&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;arXiv: &lt;a href=&quot;https://arxiv.org/html/2508.01780v1&quot;&gt;LiveMCPBench — agents in dynamic MCP tool environments&lt;/a&gt; — 78.95% success in tool-rich envs vs static baselines&lt;/li&gt;
&lt;li&gt;arXiv: &lt;a href=&quot;https://arxiv.org/html/2508.00271v1&quot;&gt;MetaAgent — self-evolving agents via tool meta-learning&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><enclosure url="https://sharadja.in/images/blogs/org-as-code-deck/slide-01.png" type="image/png" length="0"/><category>AI</category><category>agents</category><category>organizations</category><category>startups</category><category>future-of-work</category><category>automation</category><category>agentic-ai</category><category>infrastructure</category><category>governance</category><category>karpathy</category><category>autoresearch</category><category>opa</category><category>mcp</category><category>temporal</category><author>Sharad Jain</author></item><item><title>I Got 91.55% on LongMemEval. Here&apos;s Why I&apos;m Building a Different Benchmark.</title><link>https://sharadja.in/blog/owned-corpus-memory-benchmark/</link><guid isPermaLink="true">https://sharadja.in/blog/owned-corpus-memory-benchmark/</guid><description>Brain ported LongMemEval and beat the paper&apos;s BM25 baseline. The number was real. The benchmark was the wrong shape for what I actually wanted to know — whether a memory layer over my own corpus answers my own questions. Generic memory benchmarks won&apos;t tell you that. So you build the benchmark you need.</description><pubDate>Fri, 01 May 2026 18:00:00 GMT</pubDate><content:encoded>&lt;h1&gt;91.55% on LongMemEval, and the Benchmark I&apos;m Building Instead&lt;/h1&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/owned-corpus-deck/slide-01.png&quot; alt=&quot;91.55% recall@5 — I beat the AI memory benchmark. Here is why it didn&apos;t matter.&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Two days ago I &lt;a href=&quot;https://sharadja.in/blog/brain-longmemeval-sonnet-benchmark&quot;&gt;published Brain&apos;s LongMemEval result&lt;/a&gt;: &lt;strong&gt;408/500 = 81.60%&lt;/strong&gt; end-to-end QA, with &lt;strong&gt;recall@5 = 91.55%&lt;/strong&gt; over 470 non-abstention questions, using nothing but stdlib BM25 and a Claude Sonnet reader. That run beat the paper&apos;s lexical baseline, sat in the middle of the public field, and gave me the cleanest number I am willing to defend.&lt;/p&gt;
&lt;p&gt;It also gave me a problem I did not expect.&lt;/p&gt;
&lt;p&gt;The number was real. The benchmark was the wrong shape.&lt;/p&gt;
&lt;p&gt;==The benchmark I needed was the one that would answer a different question: not &quot;how does Brain rank against vendor leaderboards?&quot; but &quot;does Brain find the right document when I, the user of this corpus, ask the things I actually ask?&quot;== Generic memory benchmarks cannot answer that. They are not built to. So you build the one that can.&lt;/p&gt;
&lt;p&gt;This post is about that pivot. It is post #4 in a series — after &lt;a href=&quot;https://sharadja.in/blog/system-prompt-architecture-agentic-ai&quot;&gt;The 14K Token Debt&lt;/a&gt;, &lt;a href=&quot;https://sharadja.in/blog/benchmarking-long-term-ai-agent-memory&quot;&gt;Benchmarking Long-Term AI Agent Memory&lt;/a&gt;, and the &lt;a href=&quot;https://sharadja.in/blog/brain-longmemeval-sonnet-benchmark&quot;&gt;LongMemEval run itself&lt;/a&gt;. The earlier posts argued for the architecture and proved a credible baseline. This one argues for the benchmark you should build &lt;em&gt;next&lt;/em&gt;, after the leaderboard number is out of your system.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Number That Didn&apos;t Matter&lt;/h2&gt;
&lt;p&gt;The clean LongMemEval result was honestly satisfying. 91.55% recall@5 is high. 81.60% end-to-end is in band with &lt;a href=&quot;https://supermemory.ai/research/&quot;&gt;Supermemory&apos;s gpt-4o row&lt;/a&gt; on the same dataset. The harness is real, runnable, resumable, and committed at &lt;code&gt;bench/longmemeval/&lt;/code&gt; in the Brain repo.&lt;/p&gt;
&lt;p&gt;But sit with that result for a day, and it stops being prescriptive.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The number does not tell me whether Brain is good for &lt;em&gt;me&lt;/em&gt;.&lt;/strong&gt; Brain is supposed to find a decision I made three months ago about a Salesforce schema, the exact &lt;code&gt;pg_dump&lt;/code&gt; flags I worked out for one client, the rejected prompt variant from a session I cannot otherwise locate. None of those questions exist in LongMemEval. They cannot. LongMemEval was built to compare memory systems against each other on a shared, neutral, &lt;em&gt;synthetic&lt;/em&gt; substrate. It was not built to tell you whether your specific deployment helps you.&lt;/p&gt;
&lt;p&gt;Two more things hardened the gap once I started looking:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;The dataset is chat-shaped.&lt;/strong&gt; LongMemEval evidence sessions are LLM-simulated dialogues — Llama-3-70B self-chats with human edits, padded with &lt;a href=&quot;https://arxiv.org/abs/2410.10813&quot;&gt;ShareGPT and UltraChat fillers&lt;/a&gt;. My corpus is not chat-shaped. It is markdown notes, code repos, browsing trails, distilled session artifacts, raw transcripts. The benchmark I just beat does not care.
&lt;img src=&quot;https://sharadja.in/images/blogs/owned-corpus-deck/slide-02.png&quot; alt=&quot;The Illusion of the Leaderboard — generic benchmarks measure leaderboard clout, not product utility&quot; /&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;The leaderboard is contaminated.&lt;/strong&gt; Penfield Labs &lt;a href=&quot;https://dev.to/penfieldlabs/we-audited-locomo-64-of-the-answer-key-is-wrong-and-the-judge-accepts-up-to-63-of-intentionally-33lg&quot;&gt;audited LoCoMo in November 2025&lt;/a&gt; — the other dataset every memory vendor benchmarks on — and found 6.4% of the answer key wrong. The GPT-4o-mini judge accepted 62.81% of intentionally wrong answers. Mem0 &lt;a href=&quot;https://www.letta.com/blog/benchmarking-ai-agent-memory&quot;&gt;misconfigured Zep and Letta in its arXiv paper&lt;/a&gt;; both vendors re-ran corrected configs and scored ~10% above Mem0&apos;s own best. Zep&apos;s own 84% LoCoMo claim was a &lt;a href=&quot;https://github.com/getzep/zep-papers/issues/5&quot;&gt;numerator/denominator error&lt;/a&gt;, corrected to 58.44%, then re-defended at 75.14%.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is the field every memory layer is competing on. I just paid the price of admission. I am no longer convinced admission was the right purchase.&lt;/p&gt;
&lt;p&gt;I call this &lt;strong&gt;the Owned-Corpus Test&lt;/strong&gt;: would my version of this system, on my data, answer the questions I actually ask? If a benchmark cannot answer that, it is a useful comparison — not your benchmark. LongMemEval is a useful comparison. It is not my benchmark.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Why Generic Memory Benchmarks Won&apos;t Tell You What You Need to Know&lt;/h2&gt;
&lt;p&gt;There is a structural reason this is not just about LongMemEval.&lt;/p&gt;
&lt;h3&gt;What the public benchmarks actually test&lt;/h3&gt;
&lt;p&gt;LongMemEval has 7 question types across 5 abilities, three variants (oracle / &lt;code&gt;_s&lt;/code&gt; ~115k tokens / &lt;code&gt;_m&lt;/code&gt; ~500 sessions), 500 questions total, evaluated with a GPT-4o judge against verbatim reference answers (&lt;a href=&quot;https://arxiv.org/abs/2410.10813&quot;&gt;paper&lt;/a&gt;). The haystack is fully synthetic. Filler sessions come from public dialogue datasets. &lt;strong&gt;Verbatim recall&lt;/strong&gt; is the dominant strategy: &lt;a href=&quot;https://github.com/MemPalace/mempalace/issues/314&quot;&gt;MemPalace #314&lt;/a&gt; documents that raw retrieval gets ~96% recall@5 on the dataset because the answers are literally in the haystack — you store everything, you find it.&lt;/p&gt;
&lt;p&gt;LoCoMo is the other shared playing field. Ten very-long multi-session dialogues, persona-and-event grounded, multimodal. Same shape, longer history, the same audited cracks. Vendors love it because the conversations fit in modern context windows, which means the dataset is partially answerable &lt;em&gt;without&lt;/em&gt; memory. &lt;a href=&quot;https://vectorize.io/articles/mempalace-benchmarks&quot;&gt;Vectorize&apos;s analysis of MemPalace&apos;s benchmarks&lt;/a&gt; shows full-context GPT-4o (~73%) outscoring Mem0 (~68%) on LoCoMo.&lt;/p&gt;
&lt;p&gt;==Memory layers are benchmarking on tasks where memory is optional.==&lt;/p&gt;
&lt;p&gt;The &lt;a href=&quot;https://mastra.ai/research/observational-memory&quot;&gt;Mastra observational-memory leaderboard&lt;/a&gt; for LongMemEval has Mastra at 94.87%, Vectorize/Hindsight at 91.40%, Emergence AI at 86.00%, Supermemory at 85.20%, Brain at 81.60%, RetainDB at 79.00%, Zep at 71.20%. Different reader models. Different judges. Some are research configurations the authors say are not reproducible. Some are vendor self-reports. Treating any of this as a tournament is sloppy. I &lt;a href=&quot;https://sharadja.in/blog/brain-longmemeval-sonnet-benchmark#where-brain-lands-in-the-public-field&quot;&gt;said as much&lt;/a&gt; about my own row.&lt;/p&gt;
&lt;h3&gt;The shape mismatch is the killer point&lt;/h3&gt;
&lt;p&gt;Even if the leaderboard were clean, it would be the wrong shape for personal corpora.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/owned-corpus-deck/slide-03.png&quot; alt=&quot;The Shape Mismatch — chat-shaped benchmark assumptions vs knowledge-shaped owned-corpus reality&quot; /&gt;&lt;/p&gt;
&lt;p&gt;I went through every question type in LongMemEval and tried to map it onto my corpus. The mapping breaks for at least four of seven. &lt;em&gt;Single-session-assistant&lt;/em&gt; assumes the assistant said the answer; in a notes corpus there is no assistant. &lt;em&gt;Single-session-preference&lt;/em&gt; assumes a preference is stated as a chat utterance; in a real workflow, preference is implicit in repeated behavior. &lt;em&gt;Multi-session reasoning&lt;/em&gt; over chat is a different shape than synthesis across notes, code blame, and Slack threads. The benchmark is testing a chat-shaped world. I do not live in one.&lt;/p&gt;
&lt;p&gt;I also looked at every alternative. &lt;a href=&quot;https://snap-research.github.io/locomo/&quot;&gt;LoCoMo&lt;/a&gt;, &lt;a href=&quot;https://aclanthology.org/2025.findings-acl.989.pdf&quot;&gt;MemBench&lt;/a&gt;, &lt;a href=&quot;https://github.com/HUST-AI-HYZ/MemoryAgentBench&quot;&gt;MemoryAgentBench&lt;/a&gt;, &lt;a href=&quot;https://arxiv.org/pdf/2510.17281&quot;&gt;MemoryBench&lt;/a&gt;, &lt;a href=&quot;https://arxiv.org/pdf/2510.01353&quot;&gt;MEMTRACK&lt;/a&gt;. All chat or dialogue-shaped. The closest thing to a personal-corpus benchmark is &lt;a href=&quot;https://arxiv.org/html/2505.00263v1&quot;&gt;EnronQA&lt;/a&gt; — 528K QA pairs over 150 historical email inboxes — and even that is academic, narrow, and leans on a corpus nobody currently uses.&lt;/p&gt;
&lt;p&gt;==No public benchmark evaluates retrieval over a user&apos;s own markdown / notes / code corpus with the user&apos;s own questions. The absence is the evidence.==&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Voice vs Tool Gap&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/owned-corpus-deck/slide-04.png&quot; alt=&quot;The Voice vs Tool Gap — Jason Liu and Hamel Husain quotes; tooling for enterprise datasets exists, the opinionated personal-corpus harness does not&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Here is what makes this awkward: the field already knows.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/evals-faq/how-should-i-approach-evaluating-my-rag-system.html&quot;&gt;Hamel Husain&lt;/a&gt; has been the loudest voice on this for two years: &quot;Generic metrics off-the-shelf… are generic enough to be useless for diagnosing your application&apos;s failures.&quot; His workflow is corpus → facts → questions those facts answer. &lt;a href=&quot;https://jxnl.co/writing/2025/09/11/stop-trusting-mteb-rankings-kelly-hong-chroma/&quot;&gt;Jason Liu&apos;s writeup&lt;/a&gt; of Kelly Hong&apos;s Chroma talk is titled, almost literally, &lt;em&gt;stop trusting MTEB rankings&lt;/em&gt;. &lt;a href=&quot;https://www.sh-reya.com/papers/&quot;&gt;Shreya Shankar&lt;/a&gt; ships &lt;a href=&quot;https://arxiv.org/abs/2404.12272&quot;&gt;EvalGen&lt;/a&gt; and SPADE specifically because generic LLM-judges miss application-specific criteria. &lt;a href=&quot;https://www.llamaindex.ai/blog/introducing-llama-datasets-aadb9994ad9e&quot;&gt;Jerry Liu&lt;/a&gt; ships Llama Datasets so you can benchmark on your own docs in a few lines.&lt;/p&gt;
&lt;p&gt;Even the labs say it. Anthropic&apos;s &lt;a href=&quot;https://platform.claude.com/cookbook/capabilities-contextual-embeddings-guide&quot;&gt;Contextual Retrieval cookbook&lt;/a&gt; ships a &lt;strong&gt;synthetic 100-sample evaluation set generated from your own docs&lt;/strong&gt; as the canonical recipe. &lt;a href=&quot;https://cookbook.openai.com/examples/evaluation/evaluate_rag_with_llamaindex&quot;&gt;OpenAI&apos;s RAG-evaluation cookbook&lt;/a&gt; recommends &lt;code&gt;generate_question_context_pairs&lt;/code&gt; from LlamaIndex — same pattern.&lt;/p&gt;
&lt;p&gt;So the message is loud. What about the tool? &lt;a href=&quot;https://docs.ragas.io/en/stable/concepts/test_data_generation/rag/&quot;&gt;RAGAS&lt;/a&gt;, &lt;a href=&quot;https://www.llamaindex.ai/blog/introducing-llama-datasets-aadb9994ad9e&quot;&gt;LlamaIndex&lt;/a&gt;, &lt;a href=&quot;https://www.trulens.org/getting_started/quickstarts/groundtruth_evals_for_retrieval_systems/&quot;&gt;TruLens&lt;/a&gt;, &lt;a href=&quot;https://arize.com/docs/phoenix/cookbook/evaluate-rag&quot;&gt;Phoenix / Arize&lt;/a&gt;, &lt;a href=&quot;https://deepeval.com/guides/guides-rag-evaluation&quot;&gt;DeepEval&lt;/a&gt;, &lt;a href=&quot;https://www.promptfoo.dev/docs/guides/evaluate-rag/&quot;&gt;Promptfoo&lt;/a&gt;, &lt;a href=&quot;https://www.patronus.ai/&quot;&gt;Patronus&lt;/a&gt;, &lt;a href=&quot;https://research.trychroma.com/generative-benchmarking&quot;&gt;Chroma&apos;s generative-benchmarking&lt;/a&gt; — eight serious efforts.&lt;/p&gt;
&lt;p&gt;RAGAS, LlamaIndex, DeepEval, and Chroma&apos;s research repo all do something useful. Each ships a slice. But every one of them is shaped for &lt;em&gt;enterprise document collections&lt;/em&gt; — pre-curated PDFs, well-defined domains, large doc sets where Q&amp;amp;A generation is expensive and rare. None of them is shaped for &quot;point me at my notes folder, give me a benchmark in 10 minutes.&quot; Nobody opinionatedly ships the personal-corpus version of the workflow.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://research.trychroma.com/generative-benchmarking&quot;&gt;Chroma&apos;s W&amp;amp;B case study&lt;/a&gt; — the closest prior art — improved corpus alignment from 46% → 75% by generating queries directly from the corpus. That is the right idea. It is also a research repo, not a product. The closest individual example I could find is &lt;a href=&quot;https://github.com/jasonjgarcia24/grounded&quot;&gt;&lt;code&gt;jasonjgarcia24/grounded&lt;/code&gt;&lt;/a&gt;, where one developer built a RAG over personal Gmail with a bespoke eval, no framework. That is the field&apos;s tell. &lt;strong&gt;Practitioners are doing this by hand because the tooling does not yet exist.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I call this &lt;strong&gt;the Voice-Tool Gap&lt;/strong&gt;: when thought leaders say it, when the labs ship cookbook recipes for it, when individual practitioners build it manually, but no opinionated tool exists for the most-asked use case — that is a gap, not a saturation.&lt;/p&gt;
&lt;p&gt;The post does not solve the gap. Brain&apos;s harness is a sketch, not a ship. But the gap is what makes the next experiment obvious.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Build It: A Five-Technique Recipe for Owned-Corpus Benchmarks&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/owned-corpus-deck/slide-05.png&quot; alt=&quot;The Pivot — Generic Vendor MTEB → The Owned-Corpus Eval. You cannot download the benchmark you need. You must build it.&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Here is the harness shape I would build today, given a corpus that looks like Brain&apos;s: ~10k mixed docs (sessions, distilled summaries, raw articles, wiki concepts, browsing history), with both raw and synthesized layers, and a session-level usage log.&lt;/p&gt;
&lt;p&gt;The recipe is five techniques, not one. Each compensates for the failure modes of the others.&lt;/p&gt;
&lt;h3&gt;1. RAGAS over your raw corpus&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/owned-corpus-deck/slide-06.png&quot; alt=&quot;Step 1: RAGAS Over Raw Docs — generator (simple, reasoning, multi_context) → LLM judge filter → refined baseline questions&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Start with &lt;a href=&quot;https://docs.ragas.io/en/stable/concepts/test_data_generation/rag/&quot;&gt;RAGAS&apos;s testset generator&lt;/a&gt; over your raw docs. RAGAS turns a chunk into seed questions, then evolves them via Evol-Instruct-style operators (&lt;code&gt;simple&lt;/code&gt;, &lt;code&gt;reasoning&lt;/code&gt;, &lt;code&gt;multi_context&lt;/code&gt;, &lt;code&gt;conditioning&lt;/code&gt;) to break out of &quot;the answer is the chunk.&quot;&lt;/p&gt;
&lt;p&gt;Failure mode: lexical leakage. Generated questions can parrot the chunk verbatim, which inflates BM25 scores. Filter with an LLM-judge &quot;would a human ask this exact question?&quot; pass.&lt;/p&gt;
&lt;h3&gt;2. Reverse-direction grounding (synthesized → source)&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/owned-corpus-deck/slide-07.png&quot; alt=&quot;Step 2: Reverse-Direction Grounding — synthesized concept page → mask entity → ask question → find raw source articles. Tests whether retrieval crosses lexical boundaries.&quot; /&gt;&lt;/p&gt;
&lt;p&gt;This is the technique that exploits the structure of a personal corpus: you have synthesized concept pages (&lt;code&gt;kb-wiki&lt;/code&gt;) &lt;em&gt;and&lt;/em&gt; the source articles they were synthesized from (&lt;code&gt;kb-raw&lt;/code&gt;). Take a wiki page about a concept, mask the entity name, and ask an LLM to write a question &lt;em&gt;about the concept without naming it&lt;/em&gt;. Then test whether retrieval finds the underlying source articles. The metric is &lt;code&gt;kb-raw recall@k&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;This is closest in spirit to &lt;a href=&quot;https://arxiv.org/pdf/1906.00300&quot;&gt;ORQA&apos;s latent retrieval&lt;/a&gt; — the &quot;real test&quot; of semantic search. It is the eval you cannot do on a generic corpus, and it is the eval that tells you whether your retrieval crosses lexical boundaries.&lt;/p&gt;
&lt;h3&gt;3. Hindsight queries from real session logs&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/owned-corpus-deck/slide-08.png&quot; alt=&quot;Step 3: Hindsight Queries — empty-result  calls → reformulated, then user opens a specific doc. The doc the user actually opened is the implicit gold label.&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Mine real query logs. For Brain, this means scraping &lt;code&gt;mcp__brain__query&lt;/code&gt; calls from every Claude Code session where the result set was empty or the user reformulated the query immediately. Each one is a gold &quot;hard query&quot; — distribution-shifted by definition.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.cs.cornell.edu/people/tj/publications/joachims_etal_05a.pdf&quot;&gt;Joachims&apos;s clickthrough-as-implicit-feedback&lt;/a&gt; is the academic anchor; &lt;a href=&quot;https://www.researchgate.net/publication/260943619&quot;&gt;Microsoft Xbox&apos;s search relevance work&lt;/a&gt; and &lt;a href=&quot;https://dropbox.tech/machine-learning/llm-human-labeling-improving-search-relevance-dropbox-dash&quot;&gt;Dropbox Dash&apos;s labeling pipeline&lt;/a&gt; are the production analogues. For a personal corpus, you instrument the retriever and treat &lt;em&gt;the doc the user actually opened next&lt;/em&gt; as the gold label. Cold start is real (you need usage logs to mine), but for any system that has been running a few months, this is your most realistic gold set.&lt;/p&gt;
&lt;h3&gt;4. Distractor injection&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/owned-corpus-deck/slide-09.png&quot; alt=&quot;Step 4: Distractor Injection — BM25 top-20 minus gold, LLM-filtered. LongMemEval has no distractors, rewarding &amp;quot;store everything.&amp;quot; A benchmark without near-misses is fiction.&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Take each (query, gold doc) pair and add hard negatives — semantically near, lexically near, but wrong. Standard recipe: BM25 top-20 minus gold, then LLM-filter false negatives (&lt;a href=&quot;https://aclanthology.org/2025.findings-emnlp.481.pdf&quot;&gt;RLHN, 2025&lt;/a&gt;). Now you can measure precision@k and MRR, not just recall@k.&lt;/p&gt;
&lt;p&gt;This is the single biggest fix to the LongMemEval failure mode. LongMemEval rewards &quot;store everything&quot; because there are no distractors that look like answers. Personal corpora are full of near-misses. If your eval does not include them, your number is fiction.&lt;/p&gt;
&lt;h3&gt;5. Calibration via small hand-graded set&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/owned-corpus-deck/slide-10.png&quot; alt=&quot;Step 5: Calibration — Cohen&apos;s Kappa ≥ 0.6 between LLM judge and 50 hand-graded examples. Recalibrate on every judge prompt revision.&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Hand-grade 50 examples. Compute Cohen&apos;s kappa between your hand labels and the LLM-judge labels. Refuse to trust bulk metrics until kappa ≥ 0.6 (&lt;a href=&quot;https://langfuse.com/docs/evaluation/evaluation-methods/llm-as-a-judge&quot;&gt;Langfuse&apos;s calibration guide&lt;/a&gt; is the cleanest writeup). 50 is the floor for a stable agreement metric; less and your kappa is noise.&lt;/p&gt;
&lt;p&gt;You only do this once per judge prompt revision. It is cheap, and it is the only way to defend &quot;we used an LLM judge.&quot;&lt;/p&gt;
&lt;h3&gt;Putting it together&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/owned-corpus-deck/slide-11.png&quot; alt=&quot;The Synthesis: A 280-Question Harness — RAGAS over raw (200) + reverse grounding (50) + mined queries (30) + distractors + calibration gate. Two days of work for a functional, opinionated test suite.&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Total: 280 questions, four metrics, one calibration. About two days of work for someone with a working corpus and a working retriever. Skip step 3 if you do not yet have query volume; everything else is bootstrappable.&lt;/p&gt;
&lt;p&gt;The harness shape, mirrored from &lt;code&gt;bench/longmemeval/&lt;/code&gt;: a &lt;code&gt;fetch.py&lt;/code&gt; to mine the corpus into chunks; &lt;code&gt;generate.py&lt;/code&gt; for RAGAS testset over &lt;code&gt;kb-raw&lt;/code&gt;; &lt;code&gt;reverse.py&lt;/code&gt; for &lt;code&gt;kb-wiki → kb-raw&lt;/code&gt; masked questions; &lt;code&gt;replay.py&lt;/code&gt; for zero-result query mining from session logs; &lt;code&gt;distractors.py&lt;/code&gt; for BM25 top-&lt;code&gt;k&lt;/code&gt; minus gold with an LLM filter; &lt;code&gt;calibrate.py&lt;/code&gt; for the 50-sample hand-grade kappa gate; &lt;code&gt;run.py&lt;/code&gt; for retrieval and reader with scrub-on-restart resume; &lt;code&gt;score.py&lt;/code&gt; with tri-state labels and oracle ceilings; and &lt;code&gt;report.py&lt;/code&gt; for the per-collection × per-question-type breakdown with CIs.&lt;/p&gt;
&lt;p&gt;This is the next thing I am going to build.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;What Will Go Wrong (And How the Field Thinks About It)&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/owned-corpus-deck/slide-12.png&quot; alt=&quot;Anticipating Failure Modes — Reader Context Truncation, Coupled-Failure Anti-pattern, Error-as-Negative Aliasing, Train/Test Contamination — with each fix&quot; /&gt;&lt;/p&gt;
&lt;p&gt;I built the LongMemEval port and hit five real bugs along the way: a 12K context cap that silently truncated long haystacks, a &lt;a href=&quot;https://eugeneyan.com/writing/llm-patterns/&quot;&gt;coupled-failure anti-pattern&lt;/a&gt; where reader exceptions threw away retrieval logs, judge rate-limit errors recorded as &lt;code&gt;label=False&lt;/code&gt;, cross-question leakage from shared scratch collections, and eval QAs getting re-indexed into Brain&apos;s production corpus on the next sync. Each has a clean field-term framing. Listing them honestly is the part of the post I owe the reader, because every one will catch the next person to build a personal-corpus benchmark.&lt;/p&gt;
&lt;p&gt;There are four more pitfalls the field knows about that I had not yet hit:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/owned-corpus-deck/slide-13.png&quot; alt=&quot;The Blind Spots — Judge Variance (run both orderings, majority of 3), Statistical Power (50 Qs claiming 3% lift measures nothing), and Retrieval ≠ Reader (always report both, the Evidence Conversion Gap localizes regressions)&quot; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Judge variance and bias.&lt;/strong&gt; GPT-4-class judges hit ~80% human agreement only after controlling position, verbosity, and self-enhancement bias (&lt;a href=&quot;https://arxiv.org/abs/2306.05685&quot;&gt;Zheng et al., MT-Bench&lt;/a&gt;). Position bias is non-random — running both orderings is required, not optional (&lt;a href=&quot;https://arxiv.org/html/2406.07791v7&quot;&gt;Shi et al., 2024&lt;/a&gt;). For a credible owned-corpus eval, run both orderings and take majority of three samples; &lt;a href=&quot;https://aclanthology.org/2025.findings-acl.1030.pdf&quot;&gt;confidence-improved self-consistency&lt;/a&gt; cuts the sample budget by ~40%.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Retrieval ≠ reader. Always report both.&lt;/strong&gt; The single biggest mistake in vendor reporting is publishing recall@k as if it were end-to-end accuracy. Brain&apos;s &lt;a href=&quot;https://sharadja.in/blog/brain-longmemeval-sonnet-benchmark#the-run-that-counted&quot;&gt;Evidence Conversion Gap&lt;/a&gt; — recall@5 = 91.55%, QA = 81.60%, gap = 9.95 points — is a generalizable framing. Always report retrieval and answer metrics separately, and put oracle-retrieval and oracle-reader ceilings on each so you can localize regressions (&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/&quot;&gt;Hamel&apos;s LLM-as-judge guide&lt;/a&gt; covers this cleanly).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Goodhart and overfitting your eval.&lt;/strong&gt; The moment you start tuning to your own benchmark, the metric measures tuning effort. &lt;a href=&quot;https://sohl-dickstein.github.io/2022/11/06/strong-Goodhart.html&quot;&gt;Sohl-Dickstein on strong Goodhart&lt;/a&gt; is the canonical reference; the &lt;a href=&quot;https://blog.collinear.ai/p/gaming-the-system-goodharts-law-exemplified-in-ai-leaderboard-controversy&quot;&gt;Chatbot Arena gaming case&lt;/a&gt; is the public demonstration. Hold out 20–30% of your questions until milestone gates. Freeze v1 of your eval set when the corpus changes meaningfully; v2 reports both splits.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Statistical power.&lt;/strong&gt; &lt;a href=&quot;https://cameronrwolfe.substack.com/p/stats-llm-evals&quot;&gt;Cameron Wolfe&apos;s stats-for-LLM-evals primer&lt;/a&gt; gives the rule of thumb: 25–50 questions detect a 10% gap, 100–200 detect 5%, 2500–5000 detect 1% at 80% power. LLM outputs are correlated, which means effective N is lower than nominal. If your benchmark has 50 questions and you are claiming a 3% improvement, you have not measured anything yet.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Where This Breaks&lt;/h2&gt;
&lt;p&gt;I owe the same honesty about this thesis that I owed about the LongMemEval result.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Limitation&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;th&gt;What I would do before claiming more&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;The owned-corpus benchmark cannot be cross-system compared&lt;/td&gt;
&lt;td&gt;LongMemEval at least lets you compare Brain to Mastra&lt;/td&gt;
&lt;td&gt;Keep running LongMemEval as a comparison signal alongside the owned-corpus benchmark; do not abandon either&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ground-truth Q&amp;amp;A from your own corpus has its own biases&lt;/td&gt;
&lt;td&gt;The generator model leaks vocabulary; reverse-grounding can anchor on hallucinated wiki claims; mined queries are survivorship-biased&lt;/td&gt;
&lt;td&gt;Hybrid generation across multiple models; manual review of the wiki-side anchors; explicit &quot;no result&quot; gold for mined queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM-as-judge is the floor, not the ceiling&lt;/td&gt;
&lt;td&gt;Your judge has the same blind spots as your reader&lt;/td&gt;
&lt;td&gt;Use a different model family for judge vs. reader; calibrate against 50 hand-graded examples; report kappa publicly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Personal corpus eval is high-variance with one rater&lt;/td&gt;
&lt;td&gt;One person&apos;s labels are noisy compared to inter-rater agreement&lt;/td&gt;
&lt;td&gt;Hand-grade with at least one collaborator before publishing; or compare to an independent LLM-judge of a different family&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storing your eval set securely&lt;/td&gt;
&lt;td&gt;Personal data leaks if the eval corpus is open-sourced naively&lt;/td&gt;
&lt;td&gt;Keep eval data in a private collection; canary tokens prevent re-ingest; never commit ground-truth to a public repo&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This is a methodology argument, not a finished benchmark. The work is the next post.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;What I Am Actually Saying&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/owned-corpus-deck/slide-14.png&quot; alt=&quot;The Prescriptive Number — not &amp;quot;we hit 94.87% on LongMemEval&amp;quot; but &amp;quot;8-point lift against our own queries, calibrated to kappa 0.71.&amp;quot; Generic evals are a weak prior. The number that tells you whether to ship comes from your benchmark, on your corpus.&quot; /&gt;&lt;/p&gt;
&lt;p&gt;The argument is narrower than &quot;stop running LongMemEval.&quot;&lt;/p&gt;
&lt;p&gt;LongMemEval is useful for cross-vendor comparison. It exposed a real weakness in Brain&apos;s multi-session synthesis. It will continue to be a credibility signal for vendor leaderboards. ==Generic memory benchmarks are a weak prior, not zero signal.==&lt;/p&gt;
&lt;p&gt;The argument is sharper than that:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;For a deployment that exists to help one user (or one team) with one corpus, the leaderboard number is decoration. The prescriptive number — the one that tells you whether to ship — comes from a benchmark you build on your own corpus, with questions you actually ask, on the failure modes that actually bite you.&lt;/strong&gt; The leaderboard cannot do that work. The labs already say so. The frameworks ship the pieces. The field has the language. What is missing is the opinionated tool that turns &lt;em&gt;a notes folder + a retriever&lt;/em&gt; into a benchmark in ten minutes.&lt;/p&gt;
&lt;p&gt;If someone builds that tool, half the memory-layer market will use it. Vendor leaderboards will become a sanity check, not a North Star. The conversation will move from &quot;we hit 94.87% on LongMemEval&quot; to &quot;the user reports an 8-point lift in retrieval-and-answer correctness against their own questions, calibrated against 50 hand-graded examples, kappa 0.71.&quot;&lt;/p&gt;
&lt;p&gt;That is a more useful conversation. It is also a harder one — because the benchmark is yours, not ours.&lt;/p&gt;
&lt;p&gt;I am building the tool for Brain. The harness from the LongMemEval port is the spine; the techniques in section 5 are the layers. The next post in this series will be the result.&lt;/p&gt;
&lt;p&gt;The closing question is simple: &lt;strong&gt;does anyone else want to build the public version of this?&lt;/strong&gt; RAGAS, LlamaIndex, Chroma, DeepEval, Phoenix all ship the substrate. None of them ships &lt;em&gt;the opinionated personal-corpus harness&lt;/em&gt;. Whoever does will set the next standard for memory-layer evaluation.&lt;/p&gt;
&lt;p&gt;I will read your benchmark before I read your leaderboard.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/owned-corpus-deck/slide-15.png&quot; alt=&quot;The tooling substrate exists. The opinionated harness does not. Who will build the public version? — RAGAS, LlamaIndex, Phoenix, DeepEval all visible in the background.&quot; /&gt;&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Links Worth Following&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Topic&lt;/th&gt;
&lt;th&gt;Link&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LongMemEval paper&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://arxiv.org/abs/2410.10813&quot;&gt;arXiv:2410.10813&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Defines the benchmark and the long-term-memory task shape&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LoCoMo audit&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://dev.to/penfieldlabs/we-audited-locomo-64-of-the-answer-key-is-wrong-and-the-judge-accepts-up-to-63-of-intentionally-33lg&quot;&gt;Penfield Labs / dev.to&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Documents 6.4% wrong answers in the LoCoMo answer key and 62.81% wrong-answer acceptance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mem0 vs Letta methodology dispute&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://www.letta.com/blog/benchmarking-ai-agent-memory&quot;&gt;Letta blog&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Concrete example of vendor benchmark misconfiguration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Zep LoCoMo correction&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://github.com/getzep/zep-papers/issues/5&quot;&gt;Zep Papers Issue #5&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;The 84% → 58.44% → 75.14% saga; cautionary tale&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mastra observational-memory leaderboard&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://mastra.ai/research/observational-memory&quot;&gt;mastra.ai/research&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;The most-cited public LongMemEval scoreboard&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAGAS testset generation&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://docs.ragas.io/en/stable/concepts/test_data_generation/rag/&quot;&gt;RAGAS docs&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;The closest-to-product Q&amp;amp;A-from-corpus tool&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LlamaIndex Llama Datasets&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://www.llamaindex.ai/blog/introducing-llama-datasets-aadb9994ad9e&quot;&gt;LlamaIndex blog&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;The other framework path; closest to &quot;benchmark in a few lines&quot;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anthropic Contextual Retrieval&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://platform.claude.com/cookbook/capabilities-contextual-embeddings-guide&quot;&gt;platform.claude.com cookbook&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Anthropic&apos;s official &quot;synthetic eval set from your own docs&quot; recipe&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI RAG eval cookbook&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://cookbook.openai.com/examples/evaluation/evaluate_rag_with_llamaindex&quot;&gt;cookbook.openai.com&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;The OpenAI version of the same recipe&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hamel Husain on RAG evals&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/evals-faq/how-should-i-approach-evaluating-my-rag-system.html&quot;&gt;hamel.dev&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Sharpest &quot;generic metrics are useless&quot; argument I have read&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jason Liu on MTEB&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://jxnl.co/writing/2025/09/11/stop-trusting-mteb-rankings-kelly-hong-chroma/&quot;&gt;jxnl.co&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Domain-specific eval as the only signal that survives&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chroma generative-benchmarking&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://research.trychroma.com/generative-benchmarking&quot;&gt;research.trychroma.com&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Closest existing prior art for owned-corpus generation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MT-Bench / LLM-as-judge&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://arxiv.org/abs/2306.05685&quot;&gt;Zheng et al.&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Judge bias controls; the position-bias and self-preference findings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stats for LLM evals&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://cameronrwolfe.substack.com/p/stats-llm-evals&quot;&gt;Cameron Wolfe&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;The N-vs-effect-size table every benchmark builder needs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;em&gt;This is post #4 in a series on agentic AI memory infrastructure. Earlier posts: &lt;a href=&quot;https://sharadja.in/blog/system-prompt-architecture-agentic-ai&quot;&gt;The 14K Token Debt&lt;/a&gt;, &lt;a href=&quot;https://sharadja.in/blog/benchmarking-long-term-ai-agent-memory&quot;&gt;Benchmarking Long-Term AI Agent Memory&lt;/a&gt;, &lt;a href=&quot;https://sharadja.in/blog/brain-longmemeval-sonnet-benchmark&quot;&gt;Brain on LongMemEval: 81.6%&lt;/a&gt;. The next post in the series will be the owned-corpus benchmark itself, when it ships.&lt;/em&gt;&lt;/p&gt;
</content:encoded><enclosure url="https://sharadja.in/images/blogs/benchmarking-memory/memory-compounding-hero.png" type="image/png" length="0"/><category>AI</category><category>memory</category><category>agents</category><category>RAG</category><category>evaluation</category><category>benchmark</category><category>LongMemEval</category><category>LoCoMo</category><category>Brain</category><category>MCP</category><category>RAGAS</category><author>Sharad Jain</author></item><item><title>Benchmarking Brain on LongMemEval: 81.6% and the Product Lesson</title><link>https://sharadja.in/blog/brain-longmemeval-sonnet-benchmark/</link><guid isPermaLink="true">https://sharadja.in/blog/brain-longmemeval-sonnet-benchmark/</guid><description>Brain is a product bet: index important data once, let Claude Code, OpenClaw, and Hermes ask naturally, and avoid stuffing every session with expensive context. LongMemEval tested whether that bet survives a real memory benchmark.</description><pubDate>Fri, 01 May 2026 00:00:00 GMT</pubDate><content:encoded>&lt;h1&gt;81.6% on LongMemEval&lt;/h1&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/longmemeval-deck/slide-01.png&quot; alt=&quot;Benchmarking Brain — measuring long-term memory for AI agents&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Brain scored &lt;strong&gt;408/500 = 81.60%&lt;/strong&gt; on LongMemEval_s — matching Supermemory&apos;s GPT-4o row, using only BM25 retrieval and a Claude Sonnet reader, on a stack that runs locally.&lt;/p&gt;
&lt;p&gt;The result has a sharp boundary: my judge was Sonnet, not GPT-4o, so this is not a strict leaderboard replacement. But end-to-end it is a clean, full 500-question run with zero reader errors and zero judge errors. That is the number I trust.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dataset&lt;/td&gt;
&lt;td&gt;LongMemEval_s cleaned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Questions&lt;/td&gt;
&lt;td&gt;500/500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retriever&lt;/td&gt;
&lt;td&gt;&lt;code&gt;inproc-bm25&lt;/code&gt;, top-k=5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reader&lt;/td&gt;
&lt;td&gt;&lt;code&gt;claude-cli&lt;/code&gt;, &lt;code&gt;claude-sonnet-4-6&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Judge&lt;/td&gt;
&lt;td&gt;&lt;code&gt;claude-sonnet-4-6&lt;/code&gt; using the vendored official yes/no templates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reader errors&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Judge errors&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;QA accuracy&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;408/500 = 81.60%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retrieval recall@5&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;91.55%&lt;/strong&gt; over 470 non-abstention questions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Brain is a product bet, not a chat UI. It indexes the important data around my work once — Claude Code sessions, OpenClaw runs, Hermes conversations, project docs, browser trails, notes, decisions, failures, the operational exhaust that usually disappears — and lets any agent surface ask that memory naturally: &lt;em&gt;what did we decide about the auth flow?&lt;/em&gt;, &lt;em&gt;where did this error happen before?&lt;/em&gt;, &lt;em&gt;which benchmark run was clean?&lt;/em&gt;, &lt;em&gt;what did I already try and reject?&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The bet is narrower than &quot;stuff every session with context&quot;:&lt;/p&gt;
&lt;p&gt;==Index the important past once, retrieve only the evidence needed now, and let the agent talk to that memory like a native part of its workspace.==&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/longmemeval-deck/slide-02.png&quot; alt=&quot;The product bet is narrow and deliberate — index operational exhaust once, retrieve naturally to agent surfaces&quot; /&gt;&lt;/p&gt;
&lt;p&gt;If Brain is going to sit behind Claude Code, OpenClaw, and Hermes as the memory layer, it cannot just feel useful. It has to answer held-out questions from long histories, recover the right evidence cheaply, and fail in ways I can inspect. Memory benchmarks are especially easy to overstate — a retrieval score can look like an answer score, a small sample can look like a full benchmark, a rate-limited run can leave half the dataset silently broken, a judge can be changed just enough to make the number prettier.&lt;/p&gt;
&lt;p&gt;I benchmarked Brain because &quot;it feels useful&quot; is not enough to build a great product.&lt;/p&gt;
&lt;p&gt;The rule I took from the whole exercise is the &lt;strong&gt;Clean Number Rule&lt;/strong&gt;: if the run is partial, rate-limited, re-judged inconsistently, or silently missing questions, it is not a score — it is a debugging artifact.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/longmemeval-deck/slide-03.png&quot; alt=&quot;The Clean Number Rule&quot; /&gt;&lt;/p&gt;
&lt;p&gt;The journey had four distinct phases:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Phase&lt;/th&gt;
&lt;th&gt;What happened&lt;/th&gt;
&lt;th&gt;What survived&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Harness build&lt;/td&gt;
&lt;td&gt;Built &lt;code&gt;bench/longmemeval/&lt;/code&gt;, dataset fetch, ingestion, retrieval, reader, judge wrappers&lt;/td&gt;
&lt;td&gt;A repeatable benchmark loop instead of a demo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;First Sonnet signal&lt;/td&gt;
&lt;td&gt;A 25-question burst hit 84%&lt;/td&gt;
&lt;td&gt;Useful signal, not publishable evidence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failed full run&lt;/td&gt;
&lt;td&gt;Parallel Sonnet readers hit account limits and left 286/500 reader errors&lt;/td&gt;
&lt;td&gt;A hard rule: no error-polluted scores&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Clean full run&lt;/td&gt;
&lt;td&gt;Retried failed qids serially, re-judged all 500&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;408/500 = 81.60%&lt;/strong&gt;, 0 reader errors, 0 judge errors&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That last row is the only one I am willing to call the benchmark result.&lt;/p&gt;
&lt;p&gt;The story is not &quot;we got a benchmark score.&quot; The story is that a local memory layer can become product infrastructure only after it learns to prove what it remembers.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Why LongMemEval&lt;/h2&gt;
&lt;p&gt;Most memory systems are evaluated with questions that are too clean. A fact is inserted. A question asks for the fact. Retrieval finds the fact. The demo works.&lt;/p&gt;
&lt;p&gt;Real agent memory is messier than that. The hard cases are not just &quot;what was the user&apos;s dog&apos;s name?&quot; They are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;facts scattered across sessions&lt;/li&gt;
&lt;li&gt;old information overwritten by newer information&lt;/li&gt;
&lt;li&gt;timestamps that change the answer&lt;/li&gt;
&lt;li&gt;assistant-side statements that matter later&lt;/li&gt;
&lt;li&gt;preference questions where the evidence is implicit&lt;/li&gt;
&lt;li&gt;abstention questions where the right behavior is to say there is not enough information&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is why I used &lt;a href=&quot;https://xiaowu0162.github.io/long-mem-eval/&quot;&gt;LongMemEval&lt;/a&gt;. The benchmark was built for long-term interactive memory, not generic RAG. The &lt;a href=&quot;https://github.com/xiaowu0162/LongMemEval&quot;&gt;LongMemEval GitHub repository&lt;/a&gt; describes 500 questions covering information extraction, multi-session reasoning, knowledge updates, temporal reasoning, and abstention, and links the cleaned data on &lt;a href=&quot;https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned&quot;&gt;Hugging Face&lt;/a&gt;. The underlying paper is &lt;a href=&quot;https://arxiv.org/abs/2410.10813&quot;&gt;LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The small variant, LongMemEval_s, gives each question roughly 115k tokens of chat history across about 40 sessions; the medium variant pushes toward 500 sessions.&lt;/p&gt;
&lt;p&gt;That shape matters. It separates three things people often collapse into one:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Question it answers&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Retrieval&lt;/td&gt;
&lt;td&gt;Did the system find the right evidence?&lt;/td&gt;
&lt;td&gt;A memory system must surface the relevant past.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reading&lt;/td&gt;
&lt;td&gt;Did the model interpret the retrieved evidence correctly?&lt;/td&gt;
&lt;td&gt;Retrieval alone does not answer the question.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Judging&lt;/td&gt;
&lt;td&gt;Did the answer match the expected answer?&lt;/td&gt;
&lt;td&gt;The final metric has to be end-to-end.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/longmemeval-deck/slide-06.png&quot; alt=&quot;Three distinct hardware layers govern agentic memory — retrieval, reading, judging&quot; /&gt;&lt;/p&gt;
&lt;p&gt;The LongMemEval authors also make a point that matched my own experience: even with strong long-context models, long-term memory still needs explicit machinery. The benchmark is not just asking whether a model has a long context window. It is asking whether the system can manage a growing interaction history.&lt;/p&gt;
&lt;p&gt;That is exactly the claim Brain has to survive: not &quot;can we stuff all history into context?&quot;, but &quot;can a retrieval substrate make the right past available to the right agent at the right moment without paying a huge token and latency tax?&quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Product Claim&lt;/h2&gt;
&lt;p&gt;Brain started as my local-first memory layer for Claude Code and related agent work, but the product shape is broader than one assistant.&lt;/p&gt;
&lt;p&gt;At the time of the earlier production writeup, the system was already indexing hundreds of Claude Code sessions into markdown, distilling high-signal summaries, and exposing retrieval through &lt;a href=&quot;https://github.com/tobi/qmd&quot;&gt;QMD&lt;/a&gt; and &lt;a href=&quot;https://modelcontextprotocol.io/&quot;&gt;MCP&lt;/a&gt;. The goal was to make memory feel native across agent surfaces: important data flows into a normalized markdown corpus, gets indexed lexically and semantically (lexical / vector / HyDE), and exposes itself as &lt;code&gt;brain ask&lt;/code&gt; / &lt;code&gt;brain recent&lt;/code&gt; / MCP — so Claude Code, OpenClaw, and Hermes can ask natural questions against the user&apos;s past.&lt;/p&gt;
&lt;p&gt;This is the key product distinction:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Bad memory product&lt;/th&gt;
&lt;th&gt;Brain product goal&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Paste giant summaries into every session&lt;/td&gt;
&lt;td&gt;Retrieve small evidence slices when needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Make the user manage notes manually&lt;/td&gt;
&lt;td&gt;Index operational exhaust automatically&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Optimize for &quot;chat with your docs&quot; demos&lt;/td&gt;
&lt;td&gt;Optimize for agents doing real work&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hide failures behind fluent answers&lt;/td&gt;
&lt;td&gt;Show evidence, gaps, and confidence boundaries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Spend more tokens to feel safer&lt;/td&gt;
&lt;td&gt;Spend fewer tokens by retrieving better&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/longmemeval-deck/slide-04.png&quot; alt=&quot;The Demo Trap vs The Brain Contract&quot; /&gt;&lt;/p&gt;
&lt;p&gt;The system helped me every day. It could remember decisions, commands, failures, review comments, abandoned approaches, and &quot;we already tried that&quot; context. But daily usefulness creates a trap. If a memory system helps you personally, you start trusting it before you have measured it.&lt;/p&gt;
&lt;p&gt;That is dangerous because memory failures often look plausible. The agent gives an answer with confidence, but the missing evidence is invisible. For a product, that is the failure mode that matters: not forgetting loudly, but remembering wrongly while sounding useful.&lt;/p&gt;
&lt;p&gt;So I needed an external test with enough structure to make failure legible.&lt;/p&gt;
&lt;p&gt;I call this the &lt;strong&gt;Harness Before Hype&lt;/strong&gt; rule:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Temptation&lt;/th&gt;
&lt;th&gt;Better discipline&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Publish an architecture diagram&lt;/td&gt;
&lt;td&gt;First publish the metric it survives.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Report retrieval recall&lt;/td&gt;
&lt;td&gt;Also report end-to-end QA.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Show the best examples&lt;/td&gt;
&lt;td&gt;Score all 500 questions.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Optimize the prompt live&lt;/td&gt;
&lt;td&gt;Keep the judge template fixed.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Round the number up&lt;/td&gt;
&lt;td&gt;Preserve the clean run exactly.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That rule shaped the whole LongMemEval project, but the product reason was simple: if Brain is going to save tokens and time for real agent work, I need to know what quality I am buying with that cheap retrieval path.&lt;/p&gt;
&lt;p&gt;The benchmark also made Brain more legible as a product. Before LongMemEval, Brain was a useful local memory layer. After LongMemEval, it had a measurable contract:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Claim&lt;/th&gt;
&lt;th&gt;Measurement needed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Brain remembers prior work&lt;/td&gt;
&lt;td&gt;End-to-end QA on held-out questions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retrieval is good&lt;/td&gt;
&lt;td&gt;Recall@k against evidence sessions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;The reader is good&lt;/td&gt;
&lt;td&gt;QA accuracy given retrieved sessions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;The system is robust&lt;/td&gt;
&lt;td&gt;500/500 complete, no reader or judge errors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;The result is comparable&lt;/td&gt;
&lt;td&gt;Fixed dataset, fixed judge templates, named models&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That contract is what turns &quot;memory&quot; from a feature into infrastructure.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Harness&lt;/h2&gt;
&lt;p&gt;I built &lt;code&gt;bench/longmemeval/&lt;/code&gt; as a normal benchmark harness, not a one-off notebook.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/longmemeval-deck/slide-05.png&quot; alt=&quot;The test harness isolates the mechanics of memory — Fetch → Ingest → Retrieve → Read → Judge&quot; /&gt;&lt;/p&gt;
&lt;p&gt;The pipeline had five stages: fetch the dataset, ingest per-question corpora, retrieve the top-&lt;code&gt;k&lt;/code&gt; evidence sessions, ask a reader model for the answer, and judge that answer with LongMemEval-style yes/no templates.&lt;/p&gt;
&lt;p&gt;The first committed baseline was deliberately boring: pure in-process &lt;a href=&quot;https://nlp.stanford.edu/IR-book/html/htmledition/okapi-bm25-a-non-binary-model-1.html&quot;&gt;BM25&lt;/a&gt;, no embeddings, no external vector DB, no custom memory graph.&lt;/p&gt;
&lt;p&gt;That was the point. Before testing the full Brain stack, I wanted the cheapest credible product baseline: if BM25 plus a good reader already goes far, then the product can start fast and local instead of defaulting to expensive context stuffing.&lt;/p&gt;
&lt;p&gt;The critical command shape looked like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;python3 -m bench.longmemeval.run \
  --variant s \
  --retriever inproc-bm25 \
  --reader claude-cli \
  --tag fleet-0 \
  -k 5 \
  --qid-file bench/longmemeval/batches/batch_0.txt
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The reader path used &lt;a href=&quot;https://docs.claude.com/en/docs/claude-code/headless&quot;&gt;&lt;code&gt;claude -p&lt;/code&gt;&lt;/a&gt; through the local Claude Code subscription: BM25 retrieves five sessions, Sonnet reads the question plus those retrieved sessions and emits one hypothesis, then the judge labels that hypothesis true or false.&lt;/p&gt;
&lt;p&gt;The judge path used the same official yes/no templates vendored from LongMemEval&apos;s &lt;code&gt;evaluate_qa.py&lt;/code&gt;, routed through Claude instead of GPT-4o.&lt;/p&gt;
&lt;p&gt;That last sentence is important. This was a clean internal run, but not a perfect apples-to-apples public leaderboard submission. The LongMemEval repository documents GPT-4o-based evaluation. My run used Sonnet as both reader and judge because it let me run the full system without API spend.&lt;/p&gt;
&lt;p&gt;So the precise claim is:&lt;/p&gt;
&lt;p&gt;==Brain&apos;s BM25 + Sonnet run scored 81.60% under a Sonnet implementation of the official LongMemEval yes/no judge templates.==&lt;/p&gt;
&lt;p&gt;It is a real end-to-end score. It is not the same thing as a GPT-4o-judged leaderboard entry.&lt;/p&gt;
&lt;p&gt;The harness became a small contract: same 500 questions, same cleaned dataset, same retrieval &lt;code&gt;k&lt;/code&gt;, same reader model, same judge templates, same aggregation script, no silent errors.&lt;/p&gt;
&lt;p&gt;That contract matters more than any single prompt tweak.&lt;/p&gt;
&lt;p&gt;The contract looked solid. Then the first serious run broke it, which is exactly why the harness mattered.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Run That Failed&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/longmemeval-deck/slide-07.png&quot; alt=&quot;Real benchmark infrastructure treats partial failure as toxic — the journey to the clean run&quot; /&gt;&lt;/p&gt;
&lt;p&gt;The first full attempt did not give me the final number.&lt;/p&gt;
&lt;p&gt;It gave me lessons.&lt;/p&gt;
&lt;p&gt;The early signal was promising: a 25-question burst hit &lt;strong&gt;84%&lt;/strong&gt;. That was useful, but it was not enough evidence to publish. A 25-question sample can be lucky, skewed by category, or easier than the full distribution.&lt;/p&gt;
&lt;p&gt;Then I tried to scale the reader fleet in parallel. That was a mistake. Five parallel Sonnet readers tripped the weekly Claude limit and left the run with &lt;strong&gt;286/500 reader errors&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;At that point the harness had already taught me something more valuable than a score: benchmark infrastructure needs to treat partial failure as toxic.&lt;/p&gt;
&lt;p&gt;I made three changes:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Reader and retrieval errors were too entangled&lt;/td&gt;
&lt;td&gt;Split retrieval and reader try-blocks so failures were tagged correctly.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Long runs were fragile&lt;/td&gt;
&lt;td&gt;Added resume support and qid-file batches.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rate limits polluted outputs&lt;/td&gt;
&lt;td&gt;Added a circuit breaker after repeated rate-limit errors.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The operational rule became:&lt;/p&gt;
&lt;p&gt;==Serial reader, parallel judge.==&lt;/p&gt;
&lt;p&gt;Readers are expensive, stateful, and easy to rate-limit. Judges are cheaper to resume and easier to shard. Once I separated those clocks, the full run became stable.&lt;/p&gt;
&lt;p&gt;That was the unglamorous work that made the final number publishable.&lt;/p&gt;
&lt;p&gt;This is also where &lt;a href=&quot;https://docs.claude.com/en/docs/claude-code/hooks&quot;&gt;Claude Code hooks&lt;/a&gt; became part of the broader Brain product story. Hooks are how everyday sessions enter the memory system automatically. The benchmark harness is the mirror image: instead of automatically capturing my work, it automatically forces the memory system to prove it can recover evidence later.&lt;/p&gt;
&lt;p&gt;After that failure, the benchmark became simpler and stricter: finish all 500 questions, retry failures serially, judge everything cleanly, and only then look at the score.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Run That Counted&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/longmemeval-deck/slide-08.png&quot; alt=&quot;The clean run scored 81.60% with zero silent failures — 408/500, 0 reader errors, 0 judge errors&quot; /&gt;&lt;/p&gt;
&lt;p&gt;The clean run completed all 500 questions with zero reader errors and zero judge errors.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/longmemeval-deck/slide-09.png&quot; alt=&quot;Category breakdown reveals exactly where a lexical baseline breaks — strong on direct recall, weak on multi-session synthesis&quot; /&gt;&lt;/p&gt;
&lt;p&gt;The result surprised me in two opposite ways.&lt;/p&gt;
&lt;p&gt;First, the simple baseline was much stronger than I expected. Plain BM25 plus Sonnet was enough to land in the same aggregate range as serious memory systems. &lt;a href=&quot;https://supermemory.ai/research/&quot;&gt;Supermemory&apos;s research page&lt;/a&gt; reports an 81.6% LongMemEval_s result for its GPT-4o row, with higher rows for stronger readers. My run matched that 81.6% aggregate number, while using a much simpler retrieval stack.&lt;/p&gt;
&lt;p&gt;Second, the breakdown made the weakness obvious. Brain was excellent at direct single-session recall and strong on knowledge updates. It was weak on preference and multi-session reasoning.&lt;/p&gt;
&lt;p&gt;That is exactly the failure pattern I would expect from BM25.&lt;/p&gt;
&lt;p&gt;BM25 is good at literal evidence. It is less good when the answer requires synthesizing weak signals across multiple sessions or inferring a preference from repeated behavior. Those are not just retrieval problems. They are representation and reasoning problems.&lt;/p&gt;
&lt;p&gt;The retrieval metric tells the same story. Recall@5 was &lt;strong&gt;91.55%&lt;/strong&gt;, but QA was &lt;strong&gt;81.60%&lt;/strong&gt;. That gap is the reader/judge split in action:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;recall@5&lt;/td&gt;
&lt;td&gt;Did the evidence session appear in the top 5?&lt;/td&gt;
&lt;td&gt;91.55%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;QA accuracy&lt;/td&gt;
&lt;td&gt;Did the final answer pass the judge?&lt;/td&gt;
&lt;td&gt;81.60%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gap&lt;/td&gt;
&lt;td&gt;Evidence found but not converted into a correct answer&lt;/td&gt;
&lt;td&gt;9.95 points&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This is why I do not like memory claims that only report retrieval recall. Retrieval is necessary. It is not sufficient.&lt;/p&gt;
&lt;p&gt;I think about this as the &lt;strong&gt;Evidence Conversion Gap&lt;/strong&gt; — &lt;code&gt;recall@k − QA accuracy&lt;/code&gt;, or 91.55 − 81.60 = 9.95 points.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/longmemeval-deck/slide-10.png&quot; alt=&quot;The Evidence Conversion Gap — recall@5 91.55%, QA accuracy 81.60%, 9.95-point gap. Retrieval is necessary but not sufficient.&quot; /&gt;&lt;/p&gt;
&lt;p&gt;That gap is where the next product work lives. A memory system that retrieves the right session but cannot turn that session into the right answer has not solved agent memory. It has solved evidence surfacing.&lt;/p&gt;
&lt;p&gt;So the aggregate said Brain was credible. The category breakdown said exactly where it was still weak. The next question was how that shape compared with a public memory system.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Where Brain Lands In The Public Field&lt;/h2&gt;
&lt;p&gt;LongMemEval does not have a single official leaderboard maintained by the benchmark authors. The public results are mostly vendor self-reports: research pages, leaderboard posts, press releases, and GitHub snippets. The fairest reading is &quot;best public claims I could verify,&quot; not a tournament with one referee.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/longmemeval-deck/slide-11.png&quot; alt=&quot;Brain sits in the middle of the public field using the simplest possible stack — ranked alongside Mastra, Vectorize, Emergence AI, Supermemory, RetainDB, and Zep&quot; /&gt;&lt;/p&gt;
&lt;p&gt;In decreasing order of verified accuracy, the field looks like this:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;System&lt;/th&gt;
&lt;th&gt;Best reported&lt;/th&gt;
&lt;th&gt;Reader&lt;/th&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Mastra (Observational Memory)&lt;/td&gt;
&lt;td&gt;94.87%&lt;/td&gt;
&lt;td&gt;gpt-5-mini&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://mastra.ai/research/observational-memory&quot;&gt;mastra.ai/research&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Vectorize / Hindsight&lt;/td&gt;
&lt;td&gt;91.40%&lt;/td&gt;
&lt;td&gt;gemini-3-pro-preview&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://mastra.ai/research/observational-memory&quot;&gt;Mastra leaderboard&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Emergence AI (internal)&lt;/td&gt;
&lt;td&gt;86.00%&lt;/td&gt;
&lt;td&gt;not publicly reproducible&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://mastra.ai/research/observational-memory&quot;&gt;Mastra leaderboard&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Supermemory&lt;/td&gt;
&lt;td&gt;85.20%&lt;/td&gt;
&lt;td&gt;gemini-3-pro-preview&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://mastra.ai/research/observational-memory&quot;&gt;Mastra leaderboard&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Brain (this run)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;81.60%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;claude-sonnet-4-6&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;bench/longmemeval/&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;RetainDB&lt;/td&gt;
&lt;td&gt;79.00%&lt;/td&gt;
&lt;td&gt;oracle split&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://arxiv.org/abs/2410.10813&quot;&gt;arXiv 2410.10813&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Zep&lt;/td&gt;
&lt;td&gt;71.20%&lt;/td&gt;
&lt;td&gt;gpt-4o&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://mastra.ai/research/observational-memory&quot;&gt;Mastra leaderboard&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Lighter-weight claims sit outside that list until methodology is checked side by side: Ensue AI&apos;s 93.2% LinkedIn post and Backboard&apos;s &lt;a href=&quot;https://github.com/Backboard-io/Backboard-longmemEval-results&quot;&gt;93.4% GitHub snippet&lt;/a&gt; are real numbers but not yet apples-to-apples. Vectorize&apos;s 91.4% was &lt;a href=&quot;https://finance.yahoo.com/news/vectorize-breaks-90-longmemeval-open-140500897.html&quot;&gt;corroborated with The Washington Post and Virginia Tech partners&lt;/a&gt;; most of the other rows are single-source.&lt;/p&gt;
&lt;p&gt;Two things matter about where Brain lands.&lt;/p&gt;
&lt;p&gt;The first is that 81.60% sits in the middle of the public field while running the simplest possible stack: in-process BM25, top-k=5, no embeddings, no graph, no rerankers. Every system above Brain on this table runs richer memory machinery, a stronger reader, or both. Sitting in that band on a BM25 baseline is a stronger product signal than the rank itself.&lt;/p&gt;
&lt;p&gt;The second is that this is not an apples-to-apples ranking. Reader models differ across rows. Judge models differ — most public numbers were judged by GPT-4o; Brain was judged by Sonnet. Some entries are internal configurations the authors say are not reproducible. Some are leaderboard rows; others are press or social posts. Treating this as a tournament would be sloppy.&lt;/p&gt;
&lt;p&gt;LongMemEval itself was created by Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu, and the benchmark was accepted at ICLR 2025. Among the public results that exist today, the benchmarking activity is dominated by startups and product teams rather than named individual evaluators or standalone university labs. That is worth flagging because it shapes what &quot;the field&quot; is: a vendor scoreboard, not a peer-reviewed ranking.&lt;/p&gt;
&lt;p&gt;The most useful per-system comparison is still Supermemory, because their &lt;a href=&quot;https://supermemory.ai/research/&quot;&gt;research page&lt;/a&gt; breaks results out by category. That is the one place Brain can be matched on shape, not just aggregate:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;System row&lt;/th&gt;
&lt;th&gt;Overall&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Full-context (gpt-4o)&lt;/td&gt;
&lt;td&gt;60.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Zep (gpt-4o)&lt;/td&gt;
&lt;td&gt;71.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Supermemory (gpt-4o)&lt;/td&gt;
&lt;td&gt;81.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Supermemory (gpt-5)&lt;/td&gt;
&lt;td&gt;84.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Supermemory (gemini-3-pro)&lt;/td&gt;
&lt;td&gt;85.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Brain&apos;s clean Sonnet run tied the &lt;strong&gt;81.6% aggregate&lt;/strong&gt; reported for Supermemory&apos;s GPT-4o row. It did not beat Supermemory&apos;s higher-reader rows, and because my judge was Sonnet rather than GPT-4o, I would not present this as a strict leaderboard replacement.&lt;/p&gt;
&lt;p&gt;The category shape comparison:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/benchmarking-memory/category-comparison.png&quot; alt=&quot;Per-category accuracy on LongMemEval_s — Brain (BM25 + Sonnet) vs. Supermemory (gpt-4o)&quot; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/longmemeval-deck/slide-12.png&quot; alt=&quot;Category-level comparison exposes a clear product roadmap — Brain wins on temporal/assistant/knowledge-update; trails on preference/multi-session&quot; /&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Brain&lt;/th&gt;
&lt;th&gt;Supermemory gpt-4o row&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;single-session-user&lt;/td&gt;
&lt;td&gt;96.88&lt;/td&gt;
&lt;td&gt;97.14&lt;/td&gt;
&lt;td&gt;-0.26&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;single-session-assistant&lt;/td&gt;
&lt;td&gt;100.00&lt;/td&gt;
&lt;td&gt;96.43&lt;/td&gt;
&lt;td&gt;+3.57&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;single-session-preference&lt;/td&gt;
&lt;td&gt;63.33&lt;/td&gt;
&lt;td&gt;70.00&lt;/td&gt;
&lt;td&gt;-6.67&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;knowledge-update&lt;/td&gt;
&lt;td&gt;88.89&lt;/td&gt;
&lt;td&gt;88.46&lt;/td&gt;
&lt;td&gt;+0.43&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;temporal-reasoning&lt;/td&gt;
&lt;td&gt;81.89&lt;/td&gt;
&lt;td&gt;76.69&lt;/td&gt;
&lt;td&gt;+5.20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;multi-session&lt;/td&gt;
&lt;td&gt;62.81&lt;/td&gt;
&lt;td&gt;71.43&lt;/td&gt;
&lt;td&gt;-8.62&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This is a much more useful pair than either aggregate.&lt;/p&gt;
&lt;p&gt;It says Brain&apos;s plain baseline is already competitive on direct recall, assistant-side recall, knowledge updates, and temporal reasoning. It also says the system is not yet good enough at cross-session synthesis or implicit preference modeling.&lt;/p&gt;
&lt;p&gt;That is a product roadmap hiding inside a benchmark table.&lt;/p&gt;
&lt;p&gt;I would not use this comparison or the broader leaderboard to claim &quot;Brain beats Supermemory&quot; or &quot;Brain is fifth on the leaderboard.&quot; Both framings would be sloppy. The stronger claim is narrower and more useful:&lt;/p&gt;
&lt;p&gt;==A local-first Brain baseline, using only BM25 and Sonnet, sits in the middle band of the public LongMemEval field — and exposes a clear multi-session weakness to fix next, before the more sophisticated parts of the Brain stack are even turned on.==&lt;/p&gt;
&lt;p&gt;That narrow claim is the point of the post. It is strong enough to matter, but constrained enough to be defensible.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Why This Was Not An 85% Post&lt;/h2&gt;
&lt;p&gt;There was an 85% target.&lt;/p&gt;
&lt;p&gt;There was an 84% small-sample burst.&lt;/p&gt;
&lt;p&gt;There was a Tier-S prompt-bundle run designed to test whether better answer discipline could push the system toward 85% or beyond.&lt;/p&gt;
&lt;p&gt;But I did not find a completed clean 85% Sonnet run.&lt;/p&gt;
&lt;p&gt;The Tier-S run was a good hypothesis, not a publishable result. It reached a partial state and then hit rate-limit failure. Publishing that as &quot;Brain got 85%&quot; would violate the reason I built the harness in the first place.&lt;/p&gt;
&lt;p&gt;Here is the standard I am using:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Evidence&lt;/th&gt;
&lt;th&gt;Publishable?&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;25-question burst at 84%&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Too small. Useful signal only.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Full run with 286 reader errors&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Error-polluted.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Partial Tier-S run aimed at 85%&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Did not complete cleanly.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Full 500-question run, 0 reader errors, 0 judge errors, 408/500&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Complete, reproducible, auditable.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The clean number is 81.60%.&lt;/p&gt;
&lt;p&gt;I would rather publish the lower number than train myself to trust a flattering one.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Where This Breaks&lt;/h2&gt;
&lt;p&gt;The result is strong enough to matter, but it has sharp boundaries.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Limitation&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;th&gt;What I would do before making a stronger claim&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Sonnet judged Sonnet&lt;/td&gt;
&lt;td&gt;Same-model reader/judge can share blind spots&lt;/td&gt;
&lt;td&gt;Re-judge the 500 hypotheses with GPT-4o using the official LongMemEval path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BM25 was the headline retriever&lt;/td&gt;
&lt;td&gt;It is a lexical baseline, not the full Brain stack&lt;/td&gt;
&lt;td&gt;Run QMD hybrid retrieval with lex/vec/HyDE and compare per-category lift&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LongMemEval_s only&lt;/td&gt;
&lt;td&gt;The medium variant is closer to a heavier long-history workload&lt;/td&gt;
&lt;td&gt;Repeat on LongMemEval_m after the harness is stable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No cost/latency table yet&lt;/td&gt;
&lt;td&gt;A memory product has to be useful under real operating budgets&lt;/td&gt;
&lt;td&gt;Add wall-clock, token, and cost estimates per question&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-session remained weak&lt;/td&gt;
&lt;td&gt;The system finds evidence better than it synthesizes across sessions&lt;/td&gt;
&lt;td&gt;Add a two-pass evidence table reader for multi-session questions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Public field is heterogeneous&lt;/td&gt;
&lt;td&gt;Reader and judge models differ across vendor reports, so &quot;rank&quot; hides methodology spread&lt;/td&gt;
&lt;td&gt;Re-judge against the official GPT-4o path before claiming a strict position; until then, frame the result as a band, not a place&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This is why I am framing 81.60% as a credible baseline, not an end-state victory.&lt;/p&gt;
&lt;p&gt;Those limits do not weaken the result. They make the next experiment obvious.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;What I Learned&lt;/h2&gt;
&lt;p&gt;The first lesson is that a memory benchmark is mostly a systems benchmark.&lt;/p&gt;
&lt;p&gt;The actual QA line is short. The hard parts are resumption, error tagging, batch boundaries, judge reproducibility, output hygiene, and preventing a partial run from masquerading as a score.&lt;/p&gt;
&lt;p&gt;The second lesson is that retrieval recall is an upper-bound hint, not the outcome. Brain found the right evidence far more often than it answered correctly. That means the next improvements should not only be &quot;better search.&quot; They should improve how evidence is structured for the reader.&lt;/p&gt;
&lt;p&gt;The third lesson is that multi-session questions are the real test.&lt;/p&gt;
&lt;p&gt;Single-session recall is table stakes. The important behavior is synthesis — session A says one thing, session B updates it, session C implies a preference, session D gives the timestamp, and the question asks for the current answer.&lt;/p&gt;
&lt;p&gt;BM25 can surface pieces of that chain. It does not naturally build the chain.&lt;/p&gt;
&lt;p&gt;That points to the next architecture:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/longmemeval-deck/slide-13.png&quot; alt=&quot;The next architecture targets the multi-session synthesis weakness — hybrid retrieval + two-pass reader extracting one-line facts before synthesis&quot; /&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Lever&lt;/th&gt;
&lt;th&gt;Expected role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Hybrid retrieval&lt;/td&gt;
&lt;td&gt;Improve semantic recall where BM25 misses vocabulary.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Date-aware reranking&lt;/td&gt;
&lt;td&gt;Improve temporal questions by respecting event and session time.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-session two-pass reader&lt;/td&gt;
&lt;td&gt;Summarize evidence per session before synthesis.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stronger independent judge&lt;/td&gt;
&lt;td&gt;Reduce same-model reader/judge bias.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o re-judge&lt;/td&gt;
&lt;td&gt;Make the public number more comparable to published LongMemEval rows.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Hybrid retrieval is the obvious next retrieval lever because QMD already exposes lexical, vector, and &lt;a href=&quot;https://aclanthology.org/2023.acl-long.99/&quot;&gt;HyDE&lt;/a&gt; query modes. But the category table says retrieval alone will not be enough. The product needs a better evidence representation for multi-session synthesis.&lt;/p&gt;
&lt;p&gt;The biggest near-term improvement is probably the two-pass reader: retrieve the top 10 sessions, extract one-line facts from each, synthesize the answer from those compact facts, and judge with fixed templates. That changes the reader&apos;s job from &quot;read five noisy sessions and answer&quot; to &quot;reason over a small evidence table.&quot; For multi-session tasks, that is a different problem.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;What This Changes About Brain&lt;/h2&gt;
&lt;p&gt;Brain was not built for LongMemEval. It was built so an agent could remember my actual work.&lt;/p&gt;
&lt;p&gt;That is why the 81.60% result matters to me. It says a local-first, markdown-native, BM25-first memory system can already compete with serious memory products on a hard public benchmark, even before the more sophisticated parts of the stack are turned on.&lt;/p&gt;
&lt;p&gt;More importantly, it says the product direction is sane. Brain does not need to make every agent session enormous. It can index the user&apos;s important data once, expose a narrow natural query surface, retrieve cheap evidence slices, and let Claude Code, OpenClaw, and Hermes use that evidence only when they need it.&lt;/p&gt;
&lt;p&gt;But the number also prevents overclaiming. Brain is not done. It is not &quot;solved memory.&quot; It is not at 95%. It still struggles where memory becomes synthesis.&lt;/p&gt;
&lt;p&gt;That is the right kind of result for a product: strong enough to justify the architecture, specific enough to tell me what to fix next.&lt;/p&gt;
&lt;p&gt;The real artifact is not just the score. It is the discipline:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Discipline&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Full 500-question runs&lt;/td&gt;
&lt;td&gt;Avoids cherry-picking.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Separate retrieval and QA metrics&lt;/td&gt;
&lt;td&gt;Prevents recall from masquerading as intelligence.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fixed judge templates&lt;/td&gt;
&lt;td&gt;Keeps improvement honest.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error-free score only&lt;/td&gt;
&lt;td&gt;Makes the number defensible.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-category breakdowns&lt;/td&gt;
&lt;td&gt;Turns the benchmark into a roadmap.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;I started Brain because I wanted Claude Code to stop forgetting my past work.&lt;/p&gt;
&lt;p&gt;I benchmarked Brain because the product I want has a harder requirement: agents should remember without wasting my time, wasting my tokens, or pretending to know what they failed to retrieve.&lt;/p&gt;
&lt;p&gt;The clean score is &lt;strong&gt;81.60%&lt;/strong&gt;. The next target is not a prettier blog headline. It is a faster, cheaper, more natural Brain that makes multi-session memory feel like part of the agent&apos;s normal working environment.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/longmemeval-deck/slide-14.png&quot; alt=&quot;The clean score is 81.60% — the next target is a faster, cheaper, more natural Brain&quot; /&gt;&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Links Worth Following&lt;/h2&gt;
&lt;p&gt;If you want to inspect the concepts behind this benchmark, these are the links I would start with:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Topic&lt;/th&gt;
&lt;th&gt;Link&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LongMemEval paper&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://arxiv.org/abs/2410.10813&quot;&gt;arXiv 2410.10813&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Defines the benchmark and long-term memory task shape.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LongMemEval code/data format&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://github.com/xiaowu0162/LongMemEval&quot;&gt;GitHub: xiaowu0162/LongMemEval&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Shows dataset files, evaluation scripts, and question types.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cleaned LongMemEval data&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned&quot;&gt;Hugging Face dataset&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;The cleaned files behind the run.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Supermemory comparison&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://supermemory.ai/research/&quot;&gt;Supermemory research&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Per-category breakdown — the only public apples-to-apples shape comparison.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mastra leaderboard&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://mastra.ai/research/observational-memory&quot;&gt;Observational Memory&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;The most-cited public LongMemEval scoreboard; aggregates Mastra, Hindsight, Emergence, Supermemory, Zep.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;QMD&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://github.com/tobi/qmd&quot;&gt;GitHub: tobi/qmd&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Local retrieval substrate Brain builds on.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCP&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://modelcontextprotocol.io/&quot;&gt;Model Context Protocol&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Protocol surface for exposing tools/data to agents.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BM25&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://nlp.stanford.edu/IR-book/html/htmledition/okapi-bm25-a-non-binary-model-1.html&quot;&gt;Stanford IR book: Okapi BM25&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;The lexical ranking baseline that got surprisingly far.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HyDE&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://aclanthology.org/2023.acl-long.99/&quot;&gt;ACL Anthology: Hypothetical Document Embeddings&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Useful background for QMD&apos;s hypothetical-document retrieval mode.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Code headless mode&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://docs.claude.com/en/docs/claude-code/headless&quot;&gt;Claude Code &lt;code&gt;-p&lt;/code&gt; docs&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;How the Sonnet reader was driven from the harness.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Code hooks&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://docs.claude.com/en/docs/claude-code/hooks&quot;&gt;Hooks reference&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;How everyday Brain ingestion connects back to agent sessions.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
</content:encoded><enclosure url="https://sharadja.in/images/blogs/benchmarking-memory/memory-compounding-hero.png" type="image/png" length="0"/><category>AI</category><category>memory</category><category>agents</category><category>LongMemEval</category><category>Claude</category><category>Sonnet</category><category>BM25</category><category>benchmark</category><category>Brain</category><author>Sharad Jain</author></item><item><title>How I Built a Local-First Second Brain for Claude Code, OpenClaw, QMD, and MCP</title><link>https://sharadja.in/blog/local-first-second-brain-claude-code-openclaw-qmd-mcp/</link><guid isPermaLink="true">https://sharadja.in/blog/local-first-second-brain-claude-code-openclaw-qmd-mcp/</guid><description>How I built a local-first second brain for daily recall with Claude Code, OpenClaw, QMD, and MCP, covering ingestion, indexing, embeddings, retrieval, and reranking.</description><pubDate>Wed, 29 Apr 2026 14:00:00 GMT</pubDate><content:encoded>&lt;p&gt;My local QMD index currently spans five collections and 6,751 markdown documents. On the same machine, the runtime still warns that 2,105 of them need embeddings. That is the most honest possible opening for a post about building a second brain: the problem is not whether I have data. The problem is whether the retrieval system over that data is fast, sharp, and trustworthy enough to use every day.&lt;/p&gt;
&lt;p&gt;Here is the live shape of the system right now:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Live state&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;collections&lt;/td&gt;
&lt;td&gt;&lt;code&gt;5&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;indexed docs&lt;/td&gt;
&lt;td&gt;&lt;code&gt;6,751&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;docs still needing embeddings&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2,105&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;default &lt;code&gt;brain ask&lt;/code&gt; path&lt;/td&gt;
&lt;td&gt;BM25 fast path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;measured fast-path latency target&lt;/td&gt;
&lt;td&gt;&lt;code&gt;~340ms p95&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;health model&lt;/td&gt;
&lt;td&gt;telemetry + hourly doctor + tests&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Most &quot;second brain&quot; systems fail at the same layer: they treat memory as a note-taking problem.&lt;/p&gt;
&lt;p&gt;That sounds reasonable until you have to actually use one under load. The real input stream is not curated notes. It is messy operational exhaust: Claude Code sessions, Chrome history, transcripts, raw docs, distilled summaries, wiki pages, shell commands, decisions, rejected approaches, and the half-finished reasoning that never makes it into a notebook.&lt;/p&gt;
&lt;p&gt;The hard problem is not storage. The hard problem is turning that exhaust into a retrieval substrate that stays fast, legible, and useful when an agent or a tired human asks a question like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What did I decide about this auth flow three weeks ago?&lt;/li&gt;
&lt;li&gt;Where did I debug this exact failure?&lt;/li&gt;
&lt;li&gt;What was I reading when this idea showed up?&lt;/li&gt;
&lt;li&gt;Did I already reject this approach, and why?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;==A useful second brain is not a chatbot on top of notes. It is a pipeline.== The pipeline ingests artifacts, normalizes them into stable documents, indexes them, embeds them, retrieves them with multiple search modes, exposes them through a narrow runtime surface, and then measures whether the whole thing is still working.&lt;/p&gt;
&lt;p&gt;This is how mine is built today: &lt;a href=&quot;https://github.com/tobi/qmd&quot;&gt;QMD&lt;/a&gt; as the retrieval substrate, &lt;a href=&quot;https://modelcontextprotocol.io/introduction&quot;&gt;MCP&lt;/a&gt; as the protocol surface, &lt;code&gt;brain.py&lt;/code&gt; as the thin runtime harness, and task surfaces that can serve both Claude Code and OpenClaw. The entire system is local-first, markdown-native, and instrumented enough to tell me when it is drifting.&lt;/p&gt;
&lt;p&gt;If you want the shortest possible description, it is this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;messy artifacts
  -&amp;gt; markdown normalization
  -&amp;gt; partitioned corpus
  -&amp;gt; qmd update (lexical index)
  -&amp;gt; qmd embed (vector index)
  -&amp;gt; lex / vec / hyde retrieval
  -&amp;gt; brain CLI / MCP / skills
  -&amp;gt; telemetry / doctor / acceptance gates
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I call this pattern &lt;strong&gt;retrieval-first memory&lt;/strong&gt;: memory that is optimized around recall quality, boundary clarity, and operational discipline, not around the fantasy that capture alone creates intelligence.&lt;/p&gt;
&lt;p&gt;The post hangs on four reusable ideas:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concept&lt;/th&gt;
&lt;th&gt;Definition&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Retrieval-First Memory&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;optimize the system around recall quality and latency, not around capture volume&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IR for Memory&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;normalize raw artifacts into markdown as an intermediate representation before retrieval&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;The Narrowest-Useful Query Rule&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;answer with the cheapest retrieval path that preserves enough precision&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Anti-Rot Architecture&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;make the system continuously prove that its docs, telemetry, and runtime still match reality&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr /&gt;
&lt;h2&gt;1. The System Shape&lt;/h2&gt;
&lt;p&gt;The architecture only makes sense if you see the full stack at once.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/second-brain-pipeline/system-shape-03.png&quot; alt=&quot;The memory stack as a compiler-style pipeline: sources flow through normalization into collections, then QMD, then thin surfaces and operations.&quot; /&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Sources
  Claude Code JSONL sessions
  Chrome History SQLite
  raw docs / transcripts / imported research
  distilled artifacts / wiki pages

      |
      v

Normalization
  parsers -&amp;gt; markdown documents with frontmatter

      |
      v

Collections
  brain
  distilled
  kb-wiki
  kb-raw
  chrome-history

      |
      v

QMD
  qmd update -&amp;gt; BM25 / lexical index
  qmd embed  -&amp;gt; vector embeddings
  qmd query  -&amp;gt; hybrid retrieval + rerank
  qmd search -&amp;gt; fast lexical path

      |
      v

Surfaces
  MCP server surface
  brain ask / recent / inbox / explain
  Claude Code skills

      |
      v

Operations
  usage.jsonl
  doctor.sh
  launchd jobs
  acceptance checks
  eval harness
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;There are two clocks running through this system, and if you do not separate them, the whole thing becomes annoying fast.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Clock&lt;/th&gt;
&lt;th&gt;Latency budget&lt;/th&gt;
&lt;th&gt;What runs on it&lt;/th&gt;
&lt;th&gt;Why it exists&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;interactive clock&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;sub-second to a few seconds&lt;/td&gt;
&lt;td&gt;&lt;code&gt;brain ask&lt;/code&gt;, &lt;code&gt;brain recent&lt;/code&gt;, &lt;code&gt;brain explain&lt;/code&gt;, Stop-hook indexing, lexical retrieval&lt;/td&gt;
&lt;td&gt;this is the path I have to trust while I am actively working&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;background clock&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;minutes to hours&lt;/td&gt;
&lt;td&gt;distillation, Chrome history ingest, embedding refresh, doctor checks, acceptance reporting&lt;/td&gt;
&lt;td&gt;this is the path that improves the corpus without blocking the work&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The split is deliberate. The Stop hook only does the cheap path: &lt;code&gt;brain index --new --queue&lt;/code&gt; followed by &lt;code&gt;qmd update&lt;/code&gt;. The heavy path lives in the cron/daemon layer: distill pending sessions, refresh browser history, then run &lt;code&gt;qmd embed&lt;/code&gt;. That means the corpus becomes &lt;strong&gt;lexically searchable almost immediately&lt;/strong&gt;, while the semantic layer catches up on the background clock.&lt;/p&gt;
&lt;p&gt;This is the first non-obvious lesson in building a second brain: &lt;strong&gt;freshness and richness should not share the same latency budget.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The stack also has a clean ownership boundary at each layer:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Owns&lt;/th&gt;
&lt;th&gt;Explicitly does not own&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;sources&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;raw facts of what happened&lt;/td&gt;
&lt;td&gt;any opinion about what matters&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;normalization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;stable document shape, frontmatter, naming, file boundaries&lt;/td&gt;
&lt;td&gt;ranking, retrieval policy, user-facing judgment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;collections&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;corpus partitioning by information type&lt;/td&gt;
&lt;td&gt;search logic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;QMD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;indexing, embeddings, retrieval modes, MCP serving&lt;/td&gt;
&lt;td&gt;application workflow and product policy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;brain runtime&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;validation, formatting, safety rails, telemetry, capture semantics&lt;/td&gt;
&lt;td&gt;core retrieval engine behavior&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;skills&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;when to recall, when to capture, how to compose memory into larger tasks&lt;/td&gt;
&lt;td&gt;deterministic I/O plumbing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;operations&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;system health, regressions, drift detection&lt;/td&gt;
&lt;td&gt;interactive answer quality directly&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That ownership table is more important than it looks. Most memory systems get mushy because every layer starts leaking into every other one:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the ingestion layer starts doing premature summarization&lt;/li&gt;
&lt;li&gt;the retrieval layer starts making product decisions&lt;/li&gt;
&lt;li&gt;the app layer starts hiding corpus problems behind chat polish&lt;/li&gt;
&lt;li&gt;the operations layer is missing, so drift goes undetected&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I am explicitly trying to avoid that. The shape I want is closer to a compiler pipeline than a note-taking app:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;raw events
  -&amp;gt; normalized documents
  -&amp;gt; partitioned corpus
  -&amp;gt; indexed substrate
  -&amp;gt; surface-specific recall
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each stage should make the next stage easier without pretending to be it.&lt;/p&gt;
&lt;p&gt;There is another way to see the same architecture: as a sequence of lossy and lossless transformations.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Lossless or lossy&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;JSONL session -&amp;gt; markdown session&lt;/td&gt;
&lt;td&gt;mostly lossless&lt;/td&gt;
&lt;td&gt;preserves turns, tool traces, project metadata&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;browser history DB -&amp;gt; daily markdown&lt;/td&gt;
&lt;td&gt;selectively lossy&lt;/td&gt;
&lt;td&gt;preserves what is useful for recall, drops browser-internal noise&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;session markdown -&amp;gt; distilled artifact&lt;/td&gt;
&lt;td&gt;intentionally lossy&lt;/td&gt;
&lt;td&gt;compresses toward goals, decisions, rejected approaches&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;corpus -&amp;gt; BM25 index&lt;/td&gt;
&lt;td&gt;lossless with respect to text recall&lt;/td&gt;
&lt;td&gt;ideal for exact-match questions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;corpus -&amp;gt; vector embeddings&lt;/td&gt;
&lt;td&gt;lossy semantic projection&lt;/td&gt;
&lt;td&gt;useful for paraphrase, but never authoritative on its own&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That table explains why I keep both raw and distilled layers. Distillation is not a replacement for transcripts. It is a second representation optimized for a different retrieval problem.&lt;/p&gt;
&lt;p&gt;There are four design choices carrying most of the weight here:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Choice&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Markdown as the canonical medium&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;It keeps the corpus inspectable, grep-able, and portable. The brain is not trapped in an opaque app database.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;QMD as a shared retrieval substrate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;One engine owns indexing, search modes, and MCP exposure rather than every surface reimplementing retrieval badly.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Thin harness, fat skills&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The runtime stays deterministic and small. Task intelligence lives in markdown skill files and prompts.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Operational anti-rot&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Telemetry, health checks, and acceptance gates prevent &quot;it worked once&quot; from being mistaken for &quot;it is a system.&quot;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This last point matters more than people admit. A personal memory system does not die because indexing is impossible. It dies because the retrieval loop gets fuzzy, slow, stale, or annoying, and then you stop trusting it.&lt;/p&gt;
&lt;p&gt;That leads to the first law:&lt;/p&gt;
&lt;p&gt;==the corpus precedes the interface.==&lt;/p&gt;
&lt;p&gt;If the underlying documents are noisy, unstable, or poorly partitioned, no chat UI will rescue the system.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;2. Ingestion Is Not Capture&lt;/h2&gt;
&lt;p&gt;The input layer is heterogeneous by default. That is not a nuisance. It is the reality the architecture has to respect.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/second-brain-pipeline/markdown-ir-05.png&quot; alt=&quot;Markdown as the canonical intermediate representation for memory, with capture, ingestion, normalization, and distillation compressed into a funnel.&quot; /&gt;
&lt;em&gt;Markdown is the IR layer: the point where messy source formats become stable retrieval documents.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;In my current system, the important source classes are:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Native format&lt;/th&gt;
&lt;th&gt;What it contributes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude Code sessions&lt;/td&gt;
&lt;td&gt;JSONL&lt;/td&gt;
&lt;td&gt;decisions, code discussions, tool traces, debugging history, reasoning context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chrome history&lt;/td&gt;
&lt;td&gt;SQLite -&amp;gt; daily markdown&lt;/td&gt;
&lt;td&gt;activity context, reading trails, visited URLs, search trails&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;raw knowledge artifacts&lt;/td&gt;
&lt;td&gt;markdown files&lt;/td&gt;
&lt;td&gt;imported papers, transcripts, research notes, external source material&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;distilled artifacts&lt;/td&gt;
&lt;td&gt;markdown files&lt;/td&gt;
&lt;td&gt;higher-signal abstractions: goals, decisions, rejected approaches, concepts, tags&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;wiki / synthesized pages&lt;/td&gt;
&lt;td&gt;markdown files&lt;/td&gt;
&lt;td&gt;stable concept pages and cross-document summaries&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The mistake most personal-memory systems make is to call &quot;capture&quot; the same thing as &quot;ingestion.&quot; It is not.&lt;/p&gt;
&lt;p&gt;==Capture is just getting bytes onto disk. Ingestion is turning those bytes into retrievable documents with stable shape.==&lt;/p&gt;
&lt;p&gt;That distinction is sharp enough to formalize:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Question it answers&lt;/th&gt;
&lt;th&gt;Typical failure if you stop there&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;capture&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;did the raw event land anywhere?&lt;/td&gt;
&lt;td&gt;yes, but it is trapped in an app database, JSONL transcript, or browser internals&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ingestion&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;can I deterministically parse it again later?&lt;/td&gt;
&lt;td&gt;yes, but the output is still inconsistent and awkward to query&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;normalization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;does it now have a stable schema, path, and document boundary?&lt;/td&gt;
&lt;td&gt;yes, but it may still be too noisy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;distillation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;what should survive as compressed knowledge?&lt;/td&gt;
&lt;td&gt;useful, but lossy and not authoritative on its own&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If you collapse those stages together, you lose the ability to reason about quality. You cannot tell whether a retrieval miss came from missing capture, broken parsing, bad document design, or an overly aggressive summary layer.&lt;/p&gt;
&lt;p&gt;That is why the normalization layer matters so much. My session indexer in &lt;code&gt;brain.py&lt;/code&gt; parses raw JSONL transcripts and emits markdown documents with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;frontmatter: &lt;code&gt;session_id&lt;/code&gt;, date, project path, git branch, slug&lt;/li&gt;
&lt;li&gt;user and assistant turns&lt;/li&gt;
&lt;li&gt;tool summaries&lt;/li&gt;
&lt;li&gt;extracted reasoning traces&lt;/li&gt;
&lt;li&gt;stable filenames and paths&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That list sounds simple until you look at what the parser is actually doing.&lt;/p&gt;
&lt;p&gt;For Claude Code sessions, the raw input is not a clean conversation transcript. It is a JSONL event stream with multiple record types and nested content blocks. The parser has to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;scan every line defensively because malformed JSON lines can exist&lt;/li&gt;
&lt;li&gt;track metadata separately from content:
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;sessionId&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;cwd&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;gitBranch&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;timestamps&lt;/li&gt;
&lt;li&gt;session slug from &lt;code&gt;system&lt;/code&gt; records&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;preserve user and assistant turns&lt;/li&gt;
&lt;li&gt;skip low-signal or non-text blocks like tool results and images&lt;/li&gt;
&lt;li&gt;extract only the tool inputs that matter for later recall&lt;/li&gt;
&lt;li&gt;preserve reasoning traces separately from final answers&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The code is opinionated about what gets surfaced. Tool inputs worth keeping are things like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Read&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Edit&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Write&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Glob&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Grep&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Bash&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;WebSearch&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;WebFetch&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is not an arbitrary list. It is a retrieval decision made at ingestion time. A future query like &quot;what exact command did I run?&quot; or &quot;where did I grep for this symbol?&quot; depends on those tool summaries existing as text in the normalized document.&lt;/p&gt;
&lt;p&gt;The filename layer matters too. The session indexer uses per-agent prefixes so different sources can coexist in one corpus without stomping each other:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Agent source&lt;/th&gt;
&lt;th&gt;Output naming strategy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude&lt;/td&gt;
&lt;td&gt;bare stem for back-compat&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codex&lt;/td&gt;
&lt;td&gt;&lt;code&gt;codex__...&lt;/code&gt; prefix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gemini__...&lt;/code&gt; prefix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cursor&lt;/td&gt;
&lt;td&gt;&lt;code&gt;cursor__...&lt;/code&gt; prefix&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That is a small detail, but it is the kind of small detail that prevents a corpus from rotting as new sources are added.&lt;/p&gt;
&lt;p&gt;Chrome history gets transformed into one markdown file per day, with timestamps, domains, page titles, and search traces. Distillation produces another layer of markdown artifacts that compress a session into what actually matters later: goals, decisions, rejections, files touched, technologies, and concepts.&lt;/p&gt;
&lt;p&gt;Chrome ingestion is a different normalization problem entirely. The raw source is a local SQLite database, not a transcript. The ingest script first copies the browser database to a temp path so Chrome&apos;s file lock does not block reads. Then it joins together multiple tables:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;visits&lt;/li&gt;
&lt;li&gt;URLs&lt;/li&gt;
&lt;li&gt;context annotations&lt;/li&gt;
&lt;li&gt;content annotations&lt;/li&gt;
&lt;li&gt;keyword search terms&lt;/li&gt;
&lt;li&gt;cluster labels&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That joined view is then grouped into one markdown file per day.&lt;/p&gt;
&lt;p&gt;That &quot;one file per day&quot; choice is not just aesthetic. It is the document-boundary answer for browser memory. Sessions want one file per session. Browsing history wants one file per day. Those are different units of recall.&lt;/p&gt;
&lt;p&gt;The Chrome pipeline is also aggressively selective. It applies:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;allowlists for productive domains&lt;/li&gt;
&lt;li&gt;suffix-based domain matching for subdomains&lt;/li&gt;
&lt;li&gt;NSFW filtering on URLs, titles, and searches&lt;/li&gt;
&lt;li&gt;de-noising for spammy or injected search terms&lt;/li&gt;
&lt;li&gt;omission of Chrome-internal URLs and local file URLs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is normalization as policy. If you do not make those cuts early, your corpus inherits the browser&apos;s worst qualities: ad noise, accidental clicks, internal URLs, and junk search fragments.&lt;/p&gt;
&lt;p&gt;This is why I think of normalization as a &lt;strong&gt;schema design problem&lt;/strong&gt;, not a file conversion problem.&lt;/p&gt;
&lt;p&gt;You are deciding, for each source:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;th&gt;Example in this system&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;document boundary&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;one session per file, one browser day per file&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;stable identity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;session stem, agent prefix, date path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;metadata contract&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;frontmatter fields that will exist everywhere for that source&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;signal filter&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;which tool calls, URLs, titles, searches, and blocks are worth preserving&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;path semantics&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;where in the corpus this source will live so later retrieval can reason about it&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Once you see ingestion that way, a lot of second-brain systems start looking suspiciously under-specified. They say &quot;we ingest everything,&quot; but they do not define:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;what a document is&lt;/li&gt;
&lt;li&gt;what the stable key is&lt;/li&gt;
&lt;li&gt;what gets dropped&lt;/li&gt;
&lt;li&gt;what fields are guaranteed&lt;/li&gt;
&lt;li&gt;how two source types differ structurally&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Without that, retrieval quality becomes accidental.&lt;/p&gt;
&lt;p&gt;I think of markdown here as &lt;strong&gt;IR for memory&lt;/strong&gt;: an intermediate representation between raw event logs and retrieval. Not because markdown is glamorous. Because it is inspectable, versionable, and composable.&lt;/p&gt;
&lt;p&gt;And like any good IR, it should satisfy a few properties:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Property&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;human-readable&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;I can inspect bad outputs directly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;append-friendly&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;new artifacts can land without schema migrations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;stable enough for indexing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;BM25 and embedding layers need predictable text structure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;rich enough for provenance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;source, date, project, and session identity must survive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;cheap to diff&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;regressions in parsers or distillers need to be visible in git or plain text&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That one design choice buys a lot:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;QMD indexes it natively.&lt;/li&gt;
&lt;li&gt;agents can quote or retrieve from it directly.&lt;/li&gt;
&lt;li&gt;I can grep it when retrieval fails.&lt;/li&gt;
&lt;li&gt;I can diff it when distillation goes bad.&lt;/li&gt;
&lt;li&gt;I can move collections around without migrating a proprietary store.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There is also a deeper benefit: markdown keeps the memory substrate debuggable by the same tools I already trust for code. &lt;code&gt;rg&lt;/code&gt;, &lt;code&gt;sed&lt;/code&gt;, &lt;code&gt;awk&lt;/code&gt;, &lt;code&gt;git diff&lt;/code&gt;, filesystem walks, and plain editors all still work. That sounds almost trivial until you compare it to agent-memory systems that immediately disappear behind a vector DB, a hosted API, or an opaque &quot;memory sync&quot; abstraction.&lt;/p&gt;
&lt;p&gt;This is also why I do not treat the second brain as &quot;an app.&quot; The durable asset is the corpus, not the UI.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;3. Collections, Indexing, and Embeddings&lt;/h2&gt;
&lt;p&gt;Once the corpus is normalized, the next question is how to split it so retrieval does not collapse into an undifferentiated soup.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/second-brain-pipeline/collections-by-role-06.png&quot; alt=&quot;A partitioned corpus diagram showing brain, distilled, kb-wiki, kb-raw, and chrome-history as separate retrieval roles rather than one shared pool.&quot; /&gt;
&lt;em&gt;A corpus becomes useful when artifact classes compete by retrieval role instead of collapsing into one giant pool.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;My live QMD config currently registers five collections:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Collection&lt;/th&gt;
&lt;th&gt;Path role&lt;/th&gt;
&lt;th&gt;Retrieval role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;brain&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;raw indexed sessions&lt;/td&gt;
&lt;td&gt;high-recall verbatim memory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;distilled&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;dense LLM-generated artifacts&lt;/td&gt;
&lt;td&gt;semantic compression of prior work&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kb-wiki&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;synthesized wiki pages&lt;/td&gt;
&lt;td&gt;stable high-level concepts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kb-raw&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;raw articles and transcripts&lt;/td&gt;
&lt;td&gt;source-level grounding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;chrome-history&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;browsing logs&lt;/td&gt;
&lt;td&gt;behavioral and temporal context&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;At the moment of writing, the live file counts look like this:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Collection&lt;/th&gt;
&lt;th&gt;Live file count&lt;/th&gt;
&lt;th&gt;What that count implies&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;brain&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;4,470&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;the largest, noisiest, and most lossless layer dominates raw recall&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;distilled&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;778&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;much smaller, denser, and more semantic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kb-wiki&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;216&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;slow-moving synthesized knowledge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kb-raw&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1,846&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;long-tail source grounding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;chrome-history&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;86&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;low-count but high-temporal-value context&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Those counts are not just scale metrics. They are retrieval-shape metrics. A corpus dominated by raw transcripts behaves differently from one dominated by polished notes, even if both use the same engine.&lt;/p&gt;
&lt;p&gt;That split is not cosmetic. It is what allows retrieval to preserve the difference between:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;exact prior transcript recall&lt;/li&gt;
&lt;li&gt;compressed lessons&lt;/li&gt;
&lt;li&gt;source documents&lt;/li&gt;
&lt;li&gt;browsing exhaust&lt;/li&gt;
&lt;li&gt;stable knowledge pages&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Collection boundaries answer three questions at once:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;what kind of artifact is this?&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;transcript, distillation, wiki page, source text, or behavioral exhaust&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;how should this artifact compete?&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;a raw session should not rank the same way as a distilled decision memo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;what kind of recall is this layer good at?&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;exact-match, semantic, provenance-heavy, or recency-oriented&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Without that partitioning, retrieval becomes a false democracy where every file fights in the same pool even though the documents were created for different jobs.&lt;/p&gt;
&lt;p&gt;QMD is the retrieval substrate sitting under all of this. The repo describes it as an on-device search engine for markdown knowledge bases that combines BM25 full-text search, vector semantic search, and local reranking &lt;a href=&quot;https://github.com/tobi/qmd&quot;&gt;QMD&lt;/a&gt;. In practice, my pipeline uses it in two phases:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;1. qmd update
   -&amp;gt; refreshes the lexical / BM25 index

2. qmd embed
   -&amp;gt; refreshes vector embeddings
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Those two commands are easy to say and easy to blur, but they are not the same freshness guarantee.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;What it refreshes&lt;/th&gt;
&lt;th&gt;Operational meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;qmd update&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;lexical / BM25 visibility of changed files&lt;/td&gt;
&lt;td&gt;the corpus is text-searchable again&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;qmd embed&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;semantic vector representation&lt;/td&gt;
&lt;td&gt;embedding-based retrieval can now see the new or changed material&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That means &quot;the index is fresh&quot; is actually two claims:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;lexical freshness&lt;/strong&gt;: newly normalized text can be retrieved at all&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;semantic freshness&lt;/strong&gt;: embedding-based retrieval paths know about that text too&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In my stack, lexical freshness has the stricter SLO. That is why the cheap path runs in the Stop hook and the richer path runs on the background clock.&lt;/p&gt;
&lt;p&gt;The automation loop reflects that split directly:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;brain index --new --queue
brain distill --from-pending
chrome_history_ingest.py
qmd update
qmd embed
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The timings in the surrounding scripts and docs make the separation concrete:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Budget class in this system&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Stop-hook &lt;code&gt;index --new --queue + qmd update&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;&amp;lt; 2s&lt;/code&gt; target so it stays invisible in active work&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;async &lt;code&gt;qmd update&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;around &lt;code&gt;~5s&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;async &lt;code&gt;qmd embed&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;around &lt;code&gt;~17s&lt;/code&gt; per batch and model-dependent&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This is why I keep insisting the second brain is a pipeline. Pipelines have critical paths. Some stages can lag; some cannot.&lt;/p&gt;
&lt;p&gt;That sequencing matters. &lt;strong&gt;You do not embed raw chaos directly.&lt;/strong&gt; You first normalize and organize the corpus, then reindex, then refresh embeddings.&lt;/p&gt;
&lt;p&gt;It is also worth stating the less fashionable truth: embeddings are not the system. They are one retrieval mode inside the system.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;th&gt;Why it exists&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;lexical index&lt;/td&gt;
&lt;td&gt;exact term / path / command recall&lt;/td&gt;
&lt;td&gt;unbeatable for commands, filenames, literal phrases&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;vector index&lt;/td&gt;
&lt;td&gt;semantic proximity&lt;/td&gt;
&lt;td&gt;useful for paraphrase and conceptual search&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;rerank stage&lt;/td&gt;
&lt;td&gt;candidate ordering&lt;/td&gt;
&lt;td&gt;helps separate &quot;technically related&quot; from &quot;actually relevant&quot;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Those layers fail differently, which is exactly why I do not want to collapse them into one magical &quot;search&quot; box.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure shape&lt;/th&gt;
&lt;th&gt;Typical cause&lt;/th&gt;
&lt;th&gt;Better fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;exact phrase exists but is not surfaced&lt;/td&gt;
&lt;td&gt;lexical ranking or collection scope is weak&lt;/td&gt;
&lt;td&gt;BM25 tuning or narrower corpus partition&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;conceptually related but wrong answer ranks high&lt;/td&gt;
&lt;td&gt;semantic neighborhood is too broad&lt;/td&gt;
&lt;td&gt;reranking or a more constrained query path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;semantically useful hit is missing&lt;/td&gt;
&lt;td&gt;embeddings are stale or the semantic layer is too thin&lt;/td&gt;
&lt;td&gt;&lt;code&gt;qmd embed&lt;/code&gt;, HyDE, or stronger distillation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;answer exists but is buried in huge transcripts&lt;/td&gt;
&lt;td&gt;raw layer is too lossless for the query&lt;/td&gt;
&lt;td&gt;lean on distilled artifacts&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;There is a reason the runtime still defaults &lt;code&gt;brain ask&lt;/code&gt; to the BM25 fast path rather than always using a hybrid query. In the live implementation, the lexical path is dramatically cheaper and faster. The code explicitly describes the wedge as a BM25-first path with p95 around &lt;code&gt;~340ms&lt;/code&gt;, while the fuller hybrid query and rerank path is slower and reserved for later selection logic.&lt;/p&gt;
&lt;p&gt;That is not a compromise. It is a design judgment that also lines up with Anthropic&apos;s public guidance on agents: prefer simple, composable patterns first, and only add complexity when measurement says the simple path is insufficient &lt;a href=&quot;https://www.anthropic.com/engineering/building-effective-agents&quot;&gt;Anthropic&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The important nuance is this: the default is not &quot;BM25 because vectors are bad.&quot; The default is &quot;BM25 because defaults are about reliability under real latency budgets.&quot;&lt;/p&gt;
&lt;p&gt;There is also a corpus-design reason the lexical path works better here than people might expect. The normalized documents are already shaped around:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;sessions&lt;/li&gt;
&lt;li&gt;decisions&lt;/li&gt;
&lt;li&gt;tool traces&lt;/li&gt;
&lt;li&gt;dates&lt;/li&gt;
&lt;li&gt;projects&lt;/li&gt;
&lt;li&gt;concepts&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;BM25 is not operating over random sludge. It is operating over documents deliberately engineered to make lexical recall useful.&lt;/p&gt;
&lt;p&gt;That becomes the second law:&lt;/p&gt;
&lt;p&gt;==retrieval quality is constrained by latency budgets as much as by embeddings.==&lt;/p&gt;
&lt;p&gt;If a memory system is semantically elegant but too slow for habitual use, it has failed.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;4. Retrieval, Search, and Reranking&lt;/h2&gt;
&lt;p&gt;This is the layer people hand-wave most often. &quot;We use hybrid search&quot; is not an architecture. It is a slogan.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/second-brain-pipeline/narrowest-useful-query-08.png&quot; alt=&quot;A query-routing diagram showing literal, conceptual, and chronological questions flowing to lexical, vector or HyDE, and time-walk retrieval paths.&quot; /&gt;
&lt;em&gt;The narrowest-useful-query rule in diagram form: route the question first, then rank inside the cheapest path that preserves the answer.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The retrieval stack here is more usefully understood as a ladder:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Best query shape&lt;/th&gt;
&lt;th&gt;Failure mode&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;lex&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;exact terms, commands, file paths, literal errors&lt;/td&gt;
&lt;td&gt;misses conceptual paraphrases&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;vec&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;semantic recall, paraphrases, concept-level search&lt;/td&gt;
&lt;td&gt;may retrieve vaguely related but wrong material&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hyde&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&quot;a session where we...&quot; style memory prompts&lt;/td&gt;
&lt;td&gt;can be powerful, but easier to overfire or drift&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;rerank&lt;/td&gt;
&lt;td&gt;sort promising candidates&lt;/td&gt;
&lt;td&gt;helps precision, but costs latency&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;What matters in practice is not just which modes exist. It is &lt;strong&gt;how the runtime chooses among them, constrains them, and recovers when they fail.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;QMD exposes all three retrieval modes, and the official docs position MCP as the standard way for AI applications to connect to external systems like files, tools, and workflows &lt;a href=&quot;https://modelcontextprotocol.io/introduction&quot;&gt;MCP&lt;/a&gt;. The practical consequence is that the same markdown corpus can be queried either:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;locally via &lt;code&gt;qmd&lt;/code&gt; commands&lt;/li&gt;
&lt;li&gt;through the MCP server surface&lt;/li&gt;
&lt;li&gt;or via a thin runtime wrapper like &lt;code&gt;brain ask&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That layering is the whole point.&lt;/p&gt;
&lt;p&gt;The fastest path in my stack today is not &quot;semantic everything.&quot; It is:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;query
  -&amp;gt; validate and sanitize
  -&amp;gt; qmd search --json
  -&amp;gt; take top lexical candidates
  -&amp;gt; attach freshness metadata
  -&amp;gt; format for terminal or agent surface
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That simplified path hides a real sequence of policy decisions in the runtime:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;What the runtime is actually doing&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;query intake&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;join free-text args into one query string&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;validation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;reject dangerous shell-shaped input like raw &lt;code&gt;;&lt;/code&gt;, backticks, or &lt;code&gt;$(&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;bounded search&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;call &lt;code&gt;qmd search --json&lt;/code&gt; with a candidate limit larger than final top-k&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;candidate shaping&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;parse JSON hits, resolve paths, attach age/freshness metadata&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;filtering&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;optionally cut by age via &lt;code&gt;--since-days&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;surface formatting&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;terminal-friendly prose or machine-readable JSON envelope&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;telemetry&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;log query length, latency, surfaced paths, hit count, and surface&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That is why I treat retrieval policy as a first-class surface. The engine may know how to search, but the runtime decides what &quot;a safe, useful answer&quot; looks like.&lt;/p&gt;
&lt;p&gt;That is enough to answer a surprising fraction of memory questions, especially when the corpus is already structured around sessions, decisions, and tool traces.&lt;/p&gt;
&lt;p&gt;Why not just use semantic search for everything?&lt;/p&gt;
&lt;p&gt;Because semantic search is not free and not always the right primitive.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;Best retrieval mode&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&quot;where did I run &lt;code&gt;qmd embed -f&lt;/code&gt;?&quot;&lt;/td&gt;
&lt;td&gt;lexical&lt;/td&gt;
&lt;td&gt;exact command recall&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&quot;what was the session where I chose the CLI wedge?&quot;&lt;/td&gt;
&lt;td&gt;lexical + distilled&lt;/td&gt;
&lt;td&gt;decision phrases are often explicit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&quot;find the session where I was debugging the memory system but did not use the phrase memory system&quot;&lt;/td&gt;
&lt;td&gt;vector or HyDE&lt;/td&gt;
&lt;td&gt;conceptual query, paraphrase-heavy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&quot;what was I doing yesterday?&quot;&lt;/td&gt;
&lt;td&gt;recency walk over collections&lt;/td&gt;
&lt;td&gt;a chronological query, not a semantic one&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That last row matters. Not every memory query is &quot;search&quot; in the same sense. Some are temporal, some are behavioral, some are provenance checks.&lt;/p&gt;
&lt;p&gt;This is why I think of the query layer as a &lt;strong&gt;routing problem before it becomes a ranking problem&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;If the question is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;literal -&amp;gt; prefer lexical&lt;/li&gt;
&lt;li&gt;conceptual -&amp;gt; consider semantic / HyDE&lt;/li&gt;
&lt;li&gt;chronological -&amp;gt; walk the corpus by mtime and collection&lt;/li&gt;
&lt;li&gt;provenance-heavy -&amp;gt; preserve path, source, and freshness above fluency&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Too many memory products skip that routing layer and jump straight to &quot;semantic search everywhere.&quot; That is usually just an expensive way to destroy the distinction between query types.&lt;/p&gt;
&lt;p&gt;This is why I use the term &lt;strong&gt;retrieval discipline&lt;/strong&gt;: choosing the narrowest search surface that answers the question without incurring unnecessary latency, token cost, or fuzziness.&lt;/p&gt;
&lt;p&gt;Put differently: &lt;strong&gt;The Narrowest-Useful Query Rule&lt;/strong&gt; says the best retrieval path is the cheapest one that still preserves the answer. &lt;code&gt;grep&lt;/code&gt; beats embeddings when the question is literal. BM25 beats hybrid search when the corpus is already well-shaped and the query is explicit. A chronological walk beats both when the question is temporal.&lt;/p&gt;
&lt;p&gt;The runtime also bakes in a provenance policy. &lt;code&gt;brain ask&lt;/code&gt; does not just print a title and a snippet. It attaches:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the &lt;code&gt;qmd://...&lt;/code&gt; path&lt;/li&gt;
&lt;li&gt;score&lt;/li&gt;
&lt;li&gt;age label like &lt;code&gt;today&lt;/code&gt;, &lt;code&gt;yesterday&lt;/code&gt;, or &lt;code&gt;N days ago&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;a freshness warning when the memory is old enough to be risky&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is retrieval policy doing product work. A raw ranking score is not enough when the underlying artifact may describe code that has already changed.&lt;/p&gt;
&lt;p&gt;There is a second layer of discipline here: error handling is part of retrieval quality, not a separate concern.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure mode&lt;/th&gt;
&lt;th&gt;Runtime behavior&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;bad query&lt;/td&gt;
&lt;td&gt;reject with structured exit code and a repair hint&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;qmd&lt;/code&gt; missing&lt;/td&gt;
&lt;td&gt;explicit install / reindex recovery path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;timeout&lt;/td&gt;
&lt;td&gt;emit &lt;code&gt;qmd_timeout&lt;/code&gt;, log telemetry, preserve the failure as data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;invalid JSON from the engine&lt;/td&gt;
&lt;td&gt;fail loudly instead of pretending results are empty&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;no results&lt;/td&gt;
&lt;td&gt;return structured empty response rather than fabricating a summary&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That matters because a second brain can fail in ways that look deceptively intelligent. A silent timeout or malformed engine response is worse than a hard failure if the caller interprets the absence of evidence as &quot;nothing exists.&quot;&lt;/p&gt;
&lt;p&gt;The benchmark harness around this system also reinforces the point that retrieval is not one monolithic number. Even the local evaluation setup distinguishes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;inproc-bm25&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;qmd-lex&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;qmd-hybrid&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is the right shape. If you cannot decompose retrieval into separate modes and measure them independently, you do not really know what your memory layer is good at.&lt;/p&gt;
&lt;p&gt;In other words:&lt;/p&gt;
&lt;p&gt;==retrieval quality is the product of routing, ranking, provenance, and failure semantics together.==&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;5. QMD vs Brain MCP vs Brain CLI&lt;/h2&gt;
&lt;p&gt;These names are easy to blur, so the boundary needs to be explicit.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Responsibility&lt;/th&gt;
&lt;th&gt;What it is not&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;QMD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;indexing, search, collections, embeddings, hybrid retrieval, MCP serving&lt;/td&gt;
&lt;td&gt;not my product logic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;brain MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;the memory corpus exposed through QMD&apos;s MCP surface&lt;/td&gt;
&lt;td&gt;not a separate magical reasoning engine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;brain.py&lt;/code&gt; / brain CLI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;query shaping, safety rails, formatting, telemetry, health introspection, inbox capture&lt;/td&gt;
&lt;td&gt;not the core search engine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;skills&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;task-specific workflows and agent intent&lt;/td&gt;
&lt;td&gt;not the deterministic runtime&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That table is the minimum version. The fuller version is about &lt;strong&gt;surface contracts&lt;/strong&gt;.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Surface&lt;/th&gt;
&lt;th&gt;Input shape&lt;/th&gt;
&lt;th&gt;Output shape&lt;/th&gt;
&lt;th&gt;Primary consumer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;qmd search&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;plain query string&lt;/td&gt;
&lt;td&gt;BM25 JSON or text hits&lt;/td&gt;
&lt;td&gt;fast local runtime paths&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;qmd query&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;expanded or structured &lt;code&gt;lex/vec/hyde&lt;/code&gt; document&lt;/td&gt;
&lt;td&gt;hybrid reranked results&lt;/td&gt;
&lt;td&gt;richer retrieval workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;qmd mcp&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;MCP stdio protocol&lt;/td&gt;
&lt;td&gt;tools/resources exposed to an MCP client&lt;/td&gt;
&lt;td&gt;Claude Code or another MCP client&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;brain ask&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CLI args / env vars&lt;/td&gt;
&lt;td&gt;terminal prose or structured JSON envelope&lt;/td&gt;
&lt;td&gt;me, scripts, Claude Code skills&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;brain explain&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;no query, just runtime invocation&lt;/td&gt;
&lt;td&gt;live system state&lt;/td&gt;
&lt;td&gt;debugging, drift detection, operator trust&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;skill invocation&lt;/td&gt;
&lt;td&gt;natural-language task intent&lt;/td&gt;
&lt;td&gt;delegated call into &lt;code&gt;brain.py&lt;/code&gt; or QMD-backed behavior&lt;/td&gt;
&lt;td&gt;agent workflow layer&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Once you look at the contracts, the confusion gets easier to resolve. QMD and &lt;code&gt;brain.py&lt;/code&gt; are not competing interfaces. They are adjacent layers in the same stack.&lt;/p&gt;
&lt;p&gt;The cleanest way to say it is:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;QMD owns retrieval.
brain.py owns runtime behavior.
skills own task-level judgment.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I would make the same distinction one level more concretely:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;If you need to...&lt;/th&gt;
&lt;th&gt;The owning layer is...&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;add a new corpus folder&lt;/td&gt;
&lt;td&gt;QMD config / collection layer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;change how recall is formatted for a human or agent&lt;/td&gt;
&lt;td&gt;&lt;code&gt;brain.py&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;change when memory should be consulted inside a workflow&lt;/td&gt;
&lt;td&gt;skill layer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;change how embeddings or hybrid retrieval work&lt;/td&gt;
&lt;td&gt;QMD, not the CLI wrapper&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;explain why the system is broken right now&lt;/td&gt;
&lt;td&gt;&lt;code&gt;brain explain&lt;/code&gt; and the ops layer&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This is also why &quot;brain MCP server&quot; is easy to misunderstand. In my actual local setup, the MCP surface is effectively QMD pointed at the brain-owned collections. The runtime layer around it is where I add:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;query validation&lt;/li&gt;
&lt;li&gt;output envelopes&lt;/li&gt;
&lt;li&gt;staleness warnings&lt;/li&gt;
&lt;li&gt;KAIROS-style inbox capture&lt;/li&gt;
&lt;li&gt;usage logging&lt;/li&gt;
&lt;li&gt;doctoring and explainability&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The command line reflects that separation clearly. &lt;code&gt;qmd&lt;/code&gt; itself exposes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;qmd search&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;qmd vsearch&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;qmd query&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;qmd get&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;qmd multi-get&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;qmd mcp&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is the substrate surface.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;brain.py&lt;/code&gt; then exposes a different shape entirely:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;index&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;distill&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;queue&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ask&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;recent&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;inbox&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;explain&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Those are not alternative spellings for the same thing. They are wrapper surfaces around different responsibilities:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;code&gt;brain.py&lt;/code&gt; command&lt;/th&gt;
&lt;th&gt;Type of responsibility&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;index&lt;/code&gt;, &lt;code&gt;distill&lt;/code&gt;, &lt;code&gt;queue&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;corpus production / maintenance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ask&lt;/code&gt;, &lt;code&gt;recent&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;retrieval access with runtime policy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;inbox&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;typed capture into a memory-friendly path scheme&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;explain&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;self-description and operator diagnostics&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;One concrete example: &lt;code&gt;brain ask&lt;/code&gt; adds a memory-age label like &lt;code&gt;today&lt;/code&gt;, &lt;code&gt;yesterday&lt;/code&gt;, or &lt;code&gt;N days ago&lt;/code&gt;, and only emits a freshness warning when a memory is older than a day. That is not indexing. It is runtime policy. It is the kind of detail a memory tool needs if it is going to be trusted by either a person or an agent.&lt;/p&gt;
&lt;p&gt;Another: &lt;code&gt;brain explain&lt;/code&gt; is not retrieval at all. It is the self-diagnosis surface. It reports live state: QMD presence, collection visibility, launchd job status, inbox state, installed skills, telemetry tail. That is how you stop docs from becoming lies.&lt;/p&gt;
&lt;p&gt;That matters because a second brain has two very different kinds of truth:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Truth type&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;corpus truth&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;what documents exist and what they contain&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;runtime truth&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;what is currently installed, indexed, loaded, routed, healthy, and stale&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;QMD mostly owns corpus truth. &lt;code&gt;brain explain&lt;/code&gt; is there to expose runtime truth.&lt;/p&gt;
&lt;p&gt;I think of this as &lt;strong&gt;surface separation&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the substrate should search&lt;/li&gt;
&lt;li&gt;the harness should normalize runtime behavior&lt;/li&gt;
&lt;li&gt;the skill should decide when the memory surface is worth invoking&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The last piece is the skill layer, because this is where many agent-memory systems become conceptually sloppy. A skill is not &quot;more retrieval.&quot; A skill is &lt;strong&gt;activation logic plus task framing&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;In my current setup, the installed brain skills carry path-scoped activation rules like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;brain-ask&lt;/code&gt; only activates in &lt;code&gt;~/Projects/**&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;brain-recent&lt;/code&gt; only activates in &lt;code&gt;~/Projects/NOW/**&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;brain-inbox&lt;/code&gt; is unconditional&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That means the skill layer is doing something the search engine should never do: deciding when the memory surface belongs in the conversation at all.&lt;/p&gt;
&lt;p&gt;So the boundary line I care about most in this section is:&lt;/p&gt;
&lt;p&gt;==QMD decides how to search. The runtime decides how to expose. The skill decides when to bother.==&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;6. Thin Harness, Fat Skills&lt;/h2&gt;
&lt;p&gt;Garry Tan&apos;s &quot;thin harness, fat skills&quot; idea lands because it matches what high-functioning agent systems actually need: a small deterministic runtime and a rich task layer expressed in the medium the model already reads well &lt;a href=&quot;https://github.com/garrytan/gbrain/blob/master/docs/ethos/THIN_HARNESS_FAT_SKILLS.md&quot;&gt;Garry Tan&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/second-brain-pipeline/thin-harness-fat-skills-09.png&quot; alt=&quot;A side-by-side diagram contrasting a thin deterministic harness with fat skills that hold task judgment, routing, and procedure.&quot; /&gt;
&lt;em&gt;Push execution down into tooling and judgment up into skills; that is what keeps the runtime narrow enough to trust.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The most important sentence in that whole framing is not &quot;skills are powerful.&quot; It is the more uncomfortable one: &lt;strong&gt;the bottleneck is usually not model intelligence, it is schema understanding.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;If the model cannot find the right context, load the right procedure, or distinguish deterministic work from judgment work, a bigger model mostly just fails more fluently.&lt;/p&gt;
&lt;p&gt;The harness in my system stays deliberately narrow:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;parse arguments&lt;/li&gt;
&lt;li&gt;validate inputs&lt;/li&gt;
&lt;li&gt;shell out to QMD safely&lt;/li&gt;
&lt;li&gt;format results&lt;/li&gt;
&lt;li&gt;write telemetry&lt;/li&gt;
&lt;li&gt;expose debug state&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The skills stay fat:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;when to invoke retrieval&lt;/li&gt;
&lt;li&gt;what retrieval mode is implied by user intent&lt;/li&gt;
&lt;li&gt;how to combine memory with a broader task&lt;/li&gt;
&lt;li&gt;what not to save&lt;/li&gt;
&lt;li&gt;how to route capture versus recall&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That thin/fat distinction is easy to repeat and easy to misuse, so I try to define it operationally:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer trait&lt;/th&gt;
&lt;th&gt;&quot;Thin&quot; means...&lt;/th&gt;
&lt;th&gt;&quot;Fat&quot; means...&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;logic density&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;small amount of deterministic branching&lt;/td&gt;
&lt;td&gt;rich procedural and judgment-heavy instructions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;change frequency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;should change rarely and carefully&lt;/td&gt;
&lt;td&gt;can evolve quickly with workflow learning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;failure cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;failures are systemic and should be obvious&lt;/td&gt;
&lt;td&gt;failures are task-local and easier to iterate on&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;best representation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;code, exit codes, I/O contracts, file paths&lt;/td&gt;
&lt;td&gt;markdown procedures, descriptions, heuristics, routing language&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;consumer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;shell, scripts, launchd, other tools, agent wrappers&lt;/td&gt;
&lt;td&gt;the language model itself&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That table is why markdown skills are not an afterthought here. They are the place where I want to put:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;process&lt;/li&gt;
&lt;li&gt;judgment&lt;/li&gt;
&lt;li&gt;activation hints&lt;/li&gt;
&lt;li&gt;scope&lt;/li&gt;
&lt;li&gt;exceptions&lt;/li&gt;
&lt;li&gt;task-specific language&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And it is why I do &lt;strong&gt;not&lt;/strong&gt; want to put those things into the harness unless I absolutely have to.&lt;/p&gt;
&lt;p&gt;That split matters for at least three reasons.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Reason&lt;/th&gt;
&lt;th&gt;Thin harness benefit&lt;/th&gt;
&lt;th&gt;Fat skill benefit&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;maintenance&lt;/td&gt;
&lt;td&gt;less code drift in the runtime&lt;/td&gt;
&lt;td&gt;workflow logic evolves without recompiling the system&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;agent ergonomics&lt;/td&gt;
&lt;td&gt;predictable commands and exit codes&lt;/td&gt;
&lt;td&gt;rich behavioral guidance close to the task&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;context hygiene&lt;/td&gt;
&lt;td&gt;fewer abstractions in code&lt;/td&gt;
&lt;td&gt;more judgment in markdown, where the model can actually use it&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;There is also a fourth reason: &lt;strong&gt;debuggability asymmetry&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;If a thin harness fails, I want it to fail in a way that looks like software:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;bad exit code&lt;/li&gt;
&lt;li&gt;timeout&lt;/li&gt;
&lt;li&gt;malformed JSON&lt;/li&gt;
&lt;li&gt;missing binary&lt;/li&gt;
&lt;li&gt;lock contention&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If a fat skill fails, I want it to fail in a way that looks like judgment:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;wrong invocation timing&lt;/li&gt;
&lt;li&gt;over-retrieval&lt;/li&gt;
&lt;li&gt;under-retrieval&lt;/li&gt;
&lt;li&gt;bad decomposition of the task&lt;/li&gt;
&lt;li&gt;poor phrasing of what to capture or recall&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Those two failure classes should not be mixed. If the harness is bloated with task judgment, then every product mistake starts masquerading as an infrastructure bug.&lt;/p&gt;
&lt;p&gt;It also lets the system support multiple surfaces without forking the architecture. A terminal call, an MCP query, and a Claude Code skill can all hit the same retrieval substrate while preserving different surface behaviors.&lt;/p&gt;
&lt;p&gt;In the live stack, you can see that separation directly:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Installed skill&lt;/th&gt;
&lt;th&gt;Scope rule&lt;/th&gt;
&lt;th&gt;Why it belongs in the skill layer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;brain-ask&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;~/Projects/**&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;project-scoped recall is an activation decision, not a search-engine concern&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;brain-recent&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;~/Projects/NOW/**&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&quot;recent&quot; is relevant when the project context itself is active&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;brain-inbox&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;unconditional&lt;/td&gt;
&lt;td&gt;capture should remain globally available&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Those path filters are exactly the sort of thing people are tempted to push downward into the runtime. I think that is a mistake. A path-scoped activation rule is not a retrieval primitive. It is workflow policy.&lt;/p&gt;
&lt;p&gt;That is the key distinction between a usable agent memory system and a pile of plugins:&lt;/p&gt;
&lt;p&gt;==the harness should be boring; the skills should be opinionated.==&lt;/p&gt;
&lt;p&gt;I would rather add a new skill than add a new mini-platform inside the runtime. The moment the harness starts swallowing retrieval strategy, agent policy, user workflow logic, and product opinions, it becomes the wrong kind of thick.&lt;/p&gt;
&lt;p&gt;There is a simple decision rule I use for where new behavior belongs:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;If the new behavior is mostly...&lt;/th&gt;
&lt;th&gt;Put it in...&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;deterministic lookup, validation, or formatting&lt;/td&gt;
&lt;td&gt;the harness&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;natural-language routing, judgment, or task decomposition&lt;/td&gt;
&lt;td&gt;a skill&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;index structure, search mode, or retrieval mechanics&lt;/td&gt;
&lt;td&gt;QMD / substrate layer&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Examples make this clearer:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Behavior&lt;/th&gt;
&lt;th&gt;Right layer&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;reject a query containing shell-injection markers&lt;/td&gt;
&lt;td&gt;harness&lt;/td&gt;
&lt;td&gt;deterministic safety check&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;decide that &quot;what was I doing today?&quot; should invoke a recent-activity workflow&lt;/td&gt;
&lt;td&gt;skill&lt;/td&gt;
&lt;td&gt;intent routing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;add &lt;code&gt;today / yesterday / N days ago&lt;/code&gt; freshness labels&lt;/td&gt;
&lt;td&gt;harness&lt;/td&gt;
&lt;td&gt;surface policy with deterministic rules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;decide whether this note is worth saving or is just derivable noise&lt;/td&gt;
&lt;td&gt;skill or capture-policy layer&lt;/td&gt;
&lt;td&gt;judgment-heavy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;hybrid &lt;code&gt;lex/vec/hyde&lt;/code&gt; retrieval behavior&lt;/td&gt;
&lt;td&gt;QMD&lt;/td&gt;
&lt;td&gt;engine capability&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This section also answers a more strategic question: why not just make the harness smarter and keep fewer skills?&lt;/p&gt;
&lt;p&gt;Because thick harnesses age badly.&lt;/p&gt;
&lt;p&gt;They accumulate:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;duplicated workflow logic&lt;/li&gt;
&lt;li&gt;hard-to-reason branching&lt;/li&gt;
&lt;li&gt;more hidden behavior per command&lt;/li&gt;
&lt;li&gt;more context assumptions inside code&lt;/li&gt;
&lt;li&gt;more places where agent and operator expectations diverge&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Skills, by contrast, let the system expose its own procedure in the same medium the model reasons over. A good skill is part codebook, part resolver, part operating manual.&lt;/p&gt;
&lt;p&gt;That is the doctrine in one line:&lt;/p&gt;
&lt;p&gt;==push intelligence up into skills, push execution down into deterministic tooling, and keep the harness narrow enough that you can still trust it.==&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;7. Audit Your Own Stack&lt;/h2&gt;
&lt;p&gt;If you want to build this without copying my exact tooling, do not start by designing a beautiful assistant. Start by auditing the memory path you already have.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/second-brain-pipeline/audit-order-10.png&quot; alt=&quot;A five-step audit ladder showing corpus shape, runtime explanation, recent-surface freshness, telemetry, and only then query usefulness.&quot; /&gt;
&lt;em&gt;State inspection comes before anecdotal queries; otherwise one good answer can hide a broken substrate.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Run five checks, in this order:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Check&lt;/th&gt;
&lt;th&gt;What you are looking for&lt;/th&gt;
&lt;th&gt;Failure meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;artifact audit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;what raw sources already exist: sessions, browser history, docs, transcripts, notes&lt;/td&gt;
&lt;td&gt;you may not have a memory problem yet; you may have a capture problem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;normalization audit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;which of those sources already become stable text or markdown documents&lt;/td&gt;
&lt;td&gt;your retrieval substrate does not exist as inspectable documents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;freshness audit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;how quickly changed artifacts become visible lexically and semantically&lt;/td&gt;
&lt;td&gt;your corpus exists, but the runtime is reading a lagging copy of reality&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;retrieval audit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;which questions require exact recall, semantic recall, recency, or provenance&lt;/td&gt;
&lt;td&gt;you are overloading one retrieval mode to solve incompatible query classes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;surface audit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;what is the thinnest interface you will actually use every day&lt;/td&gt;
&lt;td&gt;the system may be technically sound but behaviorally dead&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That sequence matters. If you start with the assistant layer, you hide failures from the layers below it. An LLM surface can make a broken substrate look functional for a surprisingly long time.&lt;/p&gt;
&lt;p&gt;On my machine, the fastest useful audit looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;qmd collection list
python3 ~/.brain/brain.py explain
python3 ~/.brain/brain.py recent --since=24h --json
tail -20 ~/.brain/logs/usage.jsonl
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I intentionally left &lt;code&gt;brain ask&lt;/code&gt; out of that first pass here. An audit should start with state inspection, not with an anecdotal query. A single good query can hide a stale index, a polluted recency layer, or dead telemetry. State comes first. Queries come after.&lt;/p&gt;
&lt;h3&gt;7.1 Corpus Inventory&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;qmd collection list&lt;/code&gt; answers the first operator question: what corpus shape am I actually searching?&lt;/p&gt;
&lt;p&gt;On my machine right now, it returns:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Collection&lt;/th&gt;
&lt;th&gt;Files&lt;/th&gt;
&lt;th&gt;Updated&lt;/th&gt;
&lt;th&gt;What I infer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;brain&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;4,470&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;31m ago&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;the raw session layer dominates recall and noise budget&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;distilled&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;778&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;6d ago&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;semantic compression exists, but it is stale relative to the raw layer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kb-wiki&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;216&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1d ago&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;stable concept pages are updating slowly, which is fine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kb-raw&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1,846&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;6h ago&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;source-level grounding is alive and changing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;chrome-history&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;86&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;6d ago&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;temporal browsing context is behind and should not be trusted as fresh&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That one command is already more diagnostic than many dashboards because it exposes three properties simultaneously:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;corpus balance&lt;/strong&gt;: which layer dominates the candidate pool&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;freshness skew&lt;/strong&gt;: which layers are drifting behind others&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;document-boundary sanity&lt;/strong&gt;: whether collection counts move the way the source type should move&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The red flags are specific:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;if &lt;code&gt;brain&lt;/code&gt; is exploding while &lt;code&gt;distilled&lt;/code&gt; stays flat forever, the compression layer is not keeping up&lt;/li&gt;
&lt;li&gt;if &lt;code&gt;chrome-history&lt;/code&gt; has not moved in days, any answer framed as &quot;recently watched&quot; or &quot;recently searched&quot; is suspect&lt;/li&gt;
&lt;li&gt;if a supposedly stable layer swings wildly in count, document boundaries may be wrong&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The point of the collection audit is not &quot;wow, lots of files.&quot; The point is to understand what kind of competition your retrieval engine is about to run.&lt;/p&gt;
&lt;h3&gt;7.2 Runtime Explainability&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;python3 ~/.brain/brain.py explain&lt;/code&gt; answers the second operator question: can the runtime explain its own installation state without me reading code?&lt;/p&gt;
&lt;p&gt;The current output on my machine includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;brain.py&lt;/code&gt; version &lt;code&gt;0.1.0&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Python path and version&lt;/li&gt;
&lt;li&gt;&lt;code&gt;BRAIN_HOME&lt;/code&gt;, &lt;code&gt;BRAIN_INBOX_DIR&lt;/code&gt;, and &lt;code&gt;BRAIN_SURFACE&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;QMD binary resolution&lt;/li&gt;
&lt;li&gt;registered collection names&lt;/li&gt;
&lt;li&gt;today&apos;s inbox path&lt;/li&gt;
&lt;li&gt;telemetry path, mode, and recent events&lt;/li&gt;
&lt;li&gt;launchd job status&lt;/li&gt;
&lt;li&gt;installed Claude Code skills and their scope rules&lt;/li&gt;
&lt;li&gt;last doctor tail&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is not cosmetic introspection. It is a contract test for the runtime&apos;s own assumptions.&lt;/p&gt;
&lt;p&gt;I use &lt;code&gt;explain&lt;/code&gt; to answer five concrete questions:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;Example from the current output&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;am I running the right binary?&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/Users/sharad/.brain/brain.py&lt;/code&gt; under Python &lt;code&gt;3.14.3&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;am I pointing at the right home and inbox?&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;BRAIN_HOME=/Users/sharad/.brain&lt;/code&gt;, inbox under &lt;code&gt;/Users/sharad/.brain/inbox/...&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;is QMD reachable from this environment?&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/opt/homebrew/bin/qmd&lt;/code&gt; resolves&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;are the background jobs alive?&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;com.brain.doctor&lt;/code&gt; and &lt;code&gt;com.brain.week1&lt;/code&gt; are loaded with &lt;code&gt;last_exit=0&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;are the surfaces actually installed?&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;brain-ask&lt;/code&gt;, &lt;code&gt;brain-recent&lt;/code&gt;, and &lt;code&gt;brain-inbox&lt;/code&gt; show up with their scope rules&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The subtle but important part is skill scope. If &lt;code&gt;brain-ask&lt;/code&gt; is only active under &lt;code&gt;~/Projects/**&lt;/code&gt; and &lt;code&gt;brain-recent&lt;/code&gt; is only active under &lt;code&gt;~/Projects/NOW/**&lt;/code&gt;, then &quot;the assistant did not use memory&quot; might not be a retrieval failure at all. It might be a surface-activation failure.&lt;/p&gt;
&lt;p&gt;This is why I treat explainability as a production feature. A system that cannot report its own environment, surfaces, and health boundaries forces every failure into source-code debugging.&lt;/p&gt;
&lt;h3&gt;7.3 Freshness and Shape Audit&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;python3 ~/.brain/brain.py recent --since=24h --json&lt;/code&gt; answers a harder question than &quot;is the system alive?&quot;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;What kinds of artifacts became visible recently, from which collections, and how contaminated is the recency surface?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The current output shows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;total: 22&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;walk_ms: 54&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;heavy presence of &lt;code&gt;brain&lt;/code&gt; items from benchmark and answer-eval sessions&lt;/li&gt;
&lt;li&gt;recent &lt;code&gt;kb-raw&lt;/code&gt; additions such as:
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;claude-code-leak-deep-lifts-2026-04-28&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;gbrain-evals-frameworks-2026-04-28&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is exactly why I prefer a JSON audit path here instead of a pretty human summary. I want to inspect recent shape, not just admire that something came back.&lt;/p&gt;
&lt;p&gt;There are three things I look for in &lt;code&gt;recent&lt;/code&gt; output:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Signal&lt;/th&gt;
&lt;th&gt;Healthy interpretation&lt;/th&gt;
&lt;th&gt;Failure interpretation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;walk time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;low double-digit or low triple-digit milliseconds for a 24h scan&lt;/td&gt;
&lt;td&gt;recency is too expensive to use interactively&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;collection mix&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;recent artifacts appear from the layers I expect to be moving&lt;/td&gt;
&lt;td&gt;one ingestion path is dead or one layer is starving all others&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;artifact type quality&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;recent items look like meaningful sessions, docs, or notes&lt;/td&gt;
&lt;td&gt;the surface is polluted by synthetic eval debris, spam, or malformed outputs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This is where many memory systems quietly degrade. The recency surface becomes dominated by whatever pipeline writes the most files, not by what the operator most needs to see. In my case, benchmark-style synthetic sessions can easily crowd out higher-value human work if I do not watch the shape of the recent layer.&lt;/p&gt;
&lt;p&gt;That is why &lt;code&gt;recent&lt;/code&gt; is not just a convenience command. It is a &lt;strong&gt;surface-quality audit&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;7.4 Telemetry Audit&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;tail -20 ~/.brain/logs/usage.jsonl&lt;/code&gt; answers the final operator question: what did the system actually do, and how did it fail under real use?&lt;/p&gt;
&lt;p&gt;The recent telemetry on my machine shows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;repeated &lt;code&gt;doctor_run&lt;/code&gt; events&lt;/li&gt;
&lt;li&gt;one &lt;code&gt;ask&lt;/code&gt; with &lt;code&gt;query_len: 35&lt;/code&gt;, &lt;code&gt;latency_ms: 363&lt;/code&gt;, &lt;code&gt;n_hits: 0&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;one &lt;code&gt;recent&lt;/code&gt; run with &lt;code&gt;walk_ms: 38&lt;/code&gt;, &lt;code&gt;n_total: 24&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;two &lt;code&gt;inbox_write&lt;/code&gt; events&lt;/li&gt;
&lt;li&gt;one &lt;code&gt;doctor_run&lt;/code&gt; failure with &lt;code&gt;map_drift&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;one &lt;code&gt;doctor_run&lt;/code&gt; failure with &lt;code&gt;unittest_failed&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That small tail already tells me five useful things:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Observation&lt;/th&gt;
&lt;th&gt;What it means&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ask&lt;/code&gt; returned &lt;code&gt;n_hits: 0&lt;/code&gt; in &lt;code&gt;363ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;low latency alone does not prove useful recall&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;recent&lt;/code&gt; completed quickly&lt;/td&gt;
&lt;td&gt;the recency walk is currently interactive enough&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;inbox_write&lt;/code&gt; events exist&lt;/td&gt;
&lt;td&gt;the capture path is not dead&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;map_drift&lt;/code&gt; occurred once&lt;/td&gt;
&lt;td&gt;docs and code briefly disagreed and the doctor caught it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;unittest_failed&lt;/code&gt; occurred once&lt;/td&gt;
&lt;td&gt;health checks can surface transient or flaky runtime issues before they become folklore&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Telemetry is where the difference between &quot;it demos&quot; and &quot;it operates&quot; becomes obvious. If your usage log only records success, it is not telemetry. It is vanity analytics. The minimum viable memory log should tell you:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;which surface was used&lt;/li&gt;
&lt;li&gt;which operation ran&lt;/li&gt;
&lt;li&gt;how long it took&lt;/li&gt;
&lt;li&gt;whether it returned anything&lt;/li&gt;
&lt;li&gt;whether health checks failed&lt;/li&gt;
&lt;li&gt;whether the system is being used at all&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;7.5 Query Audit Comes Last&lt;/h3&gt;
&lt;p&gt;Only after those four state checks do I run an actual retrieval query such as:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;python3 ~/.brain/brain.py ask &quot;qmd embed&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Running a query too early confuses diagnosis. A good hit can coexist with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;stale embeddings in another collection&lt;/li&gt;
&lt;li&gt;broken daily-log capture&lt;/li&gt;
&lt;li&gt;dead launchd jobs&lt;/li&gt;
&lt;li&gt;recent-surface pollution&lt;/li&gt;
&lt;li&gt;silent drift between docs and code&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The query audit is where I test user-visible usefulness. It is not where I establish system health.&lt;/p&gt;
&lt;p&gt;So the real audit order is:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;inspect corpus shape&lt;/li&gt;
&lt;li&gt;inspect runtime self-description&lt;/li&gt;
&lt;li&gt;inspect recent-layer freshness and contamination&lt;/li&gt;
&lt;li&gt;inspect telemetry and failures&lt;/li&gt;
&lt;li&gt;only then inspect query usefulness&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;That ordering sounds conservative because it is. Most second-brain projects fail by building the assistant before they have built the corpus.&lt;/p&gt;
&lt;p&gt;If you are building from scratch, the safe order is:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;pick one raw source that already exists&lt;/li&gt;
&lt;li&gt;normalize it to markdown with stable frontmatter&lt;/li&gt;
&lt;li&gt;index it lexically first&lt;/li&gt;
&lt;li&gt;only then add embeddings&lt;/li&gt;
&lt;li&gt;only then add a CLI or agent surface&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If I had to compress the audit logic into one rule, it would be:&lt;/p&gt;
&lt;p&gt;==do not trust answers from a memory system until you can explain its corpus shape, freshness, recent surface, and failure log.==&lt;/p&gt;
&lt;p&gt;And if I had to compress the entire post into one build instruction, it would still be:&lt;/p&gt;
&lt;p&gt;==build the memory substrate first, then earn the right to add the assistant.==&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;8. The Operational Layer&lt;/h2&gt;
&lt;p&gt;This is the part almost nobody includes in their &quot;how I built my memory system&quot; write-up, and it is the part most likely to decide whether the project survives.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/second-brain-pipeline/operational-discipline-11.png&quot; alt=&quot;An operational-discipline diagram connecting usage telemetry, hourly doctor checks, contract tests, and acceptance gates.&quot; /&gt;
&lt;em&gt;The operational layer is what stops a working demo from silently decaying into folklore.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;My operational layer has five pieces:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Piece&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;What it protects against&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;usage.jsonl&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;append-only telemetry for invocations, surfaces, latency, and outcomes&lt;/td&gt;
&lt;td&gt;false confidence from anecdotal success&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;doctor.sh&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;hourly health checks for QMD, CLI health, JSONL integrity, tests, map drift&lt;/td&gt;
&lt;td&gt;silent substrate or runtime decay&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;launchd jobs&lt;/td&gt;
&lt;td&gt;keep checks and acceptance gates firing even when memory is not top of mind&lt;/td&gt;
&lt;td&gt;&quot;I forgot to look, so the system drifted for a week&quot;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;test harness&lt;/td&gt;
&lt;td&gt;verify input validation, timeout behavior, inbox sanitization, concurrency integrity&lt;/td&gt;
&lt;td&gt;regression hiding behind plausible output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;acceptance gate&lt;/td&gt;
&lt;td&gt;decide whether the wedge deserves to live past Week 1&lt;/td&gt;
&lt;td&gt;hobby-project inertia and self-deception&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That last row matters. Monitoring is not enough. A system that only collects health data but never uses it to make continuation or teardown decisions still rots politically even if it is healthy technically.&lt;/p&gt;
&lt;h3&gt;8.1 Telemetry Is the Ground Truth of Use&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;usage.jsonl&lt;/code&gt; is the system&apos;s behavioral ledger. Every meaningful user-facing or doctor-facing action appends a structured row under a file lock.&lt;/p&gt;
&lt;p&gt;The schema is intentionally boring:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;schema&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ts&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;surface&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;event&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;event-specific fields such as:
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;latency_ms&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;query_len&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;n_hits&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;walk_ms&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;n_total&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;failures&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;entry_id&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;path&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The boringness is a feature. This file needs to survive shell tools, ad hoc parsing, and future schema evolution.&lt;/p&gt;
&lt;p&gt;Three implementation choices matter more than they look:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Choice&lt;/th&gt;
&lt;th&gt;Why it exists&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;append-only JSONL&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;partial corruption is local to a line, not a whole database page&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;schema: 1&lt;/code&gt; on every row&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;future parsers can distinguish format drift from bad data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;fcntl.flock(LOCK_EX)&lt;/code&gt; around append&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;concurrent invocations do not interleave bytes and create torn writes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That last property is tested directly. The test harness spins up concurrent &lt;code&gt;inbox&lt;/code&gt; writes and verifies that the file contains the expected number of valid JSON rows, not half-records glued together.&lt;/p&gt;
&lt;p&gt;The point of telemetry here is not &quot;analytics.&quot; It is operational truth. When I want to know whether the terminal surface or Claude Code surface actually got used, whether query latency stayed within budget, or whether doctor failures happened while I was not looking, this file is the ground truth.&lt;/p&gt;
&lt;h3&gt;8.2 The Doctor Is an Hourly Contract Test&lt;/h3&gt;
&lt;p&gt;The doctor script does not just check &quot;is the binary there.&quot; It checks whether the system is still the system.&lt;/p&gt;
&lt;p&gt;Today it verifies, at minimum:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Check&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;qmd&lt;/code&gt; is on &lt;code&gt;PATH&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;retrieval substrate still resolves from the runtime environment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;brain.py --version&lt;/code&gt; runs&lt;/td&gt;
&lt;td&gt;the main entrypoint is executable and not obviously broken&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;usage.jsonl&lt;/code&gt; parses line-by-line&lt;/td&gt;
&lt;td&gt;torn writes and schema corruption are caught early&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;usage.jsonl&lt;/code&gt; mode is &lt;code&gt;0o600&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;raw queries and notes are not accidentally world-readable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;wedge test suite passes&lt;/td&gt;
&lt;td&gt;command contract regressions are surfaced within an hour&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;MAP.md&lt;/code&gt; citations still resolve&lt;/td&gt;
&lt;td&gt;docs and code did not silently drift apart&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;week-1 acceptance catch-up fires&lt;/td&gt;
&lt;td&gt;the acceptance verdict still runs even if the laptop slept through the scheduled moment&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That is a stronger contract than generic &quot;health checks.&quot; It is not only availability. It is availability plus behavioral invariants plus documentation integrity.&lt;/p&gt;
&lt;p&gt;The doctor is also explicitly written to keep going after the first failure. It does not &lt;code&gt;set -e&lt;/code&gt;. That means one failure does not mask the others. If &lt;code&gt;qmd&lt;/code&gt; is missing and the JSONL file is corrupt, I want both facts in the same pass.&lt;/p&gt;
&lt;p&gt;And the output is not just local logging. On failure, the doctor:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;appends a &lt;code&gt;doctor_run&lt;/code&gt; telemetry row with a &lt;code&gt;failures&lt;/code&gt; array&lt;/li&gt;
&lt;li&gt;writes a timestamped line to &lt;code&gt;/tmp/brain-doctor.log&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;fires a macOS notification so the failure becomes visible within the hour&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is what I mean by &lt;strong&gt;anti-rot architecture&lt;/strong&gt;. The memory system is forced to keep proving that its docs, commands, and operational assumptions still match reality.&lt;/p&gt;
&lt;p&gt;That gives the third law:&lt;/p&gt;
&lt;p&gt;==if your memory system cannot explain its own state, it will eventually lie to you.==&lt;/p&gt;
&lt;h3&gt;8.3 Tests Guard the Wedge, Not the Dream&lt;/h3&gt;
&lt;p&gt;The test suite is deliberately wedge-specific. It is not trying to prove that &quot;memory works&quot; in the abstract. It is trying to prove that the user-facing contract fails in controlled ways.&lt;/p&gt;
&lt;p&gt;The current test categories cover:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Test category&lt;/th&gt;
&lt;th&gt;Example invariant&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;query escaping&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;shell-dangerous input is rejected instead of passed through&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;frontmatter sanitization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;inbox payloads cannot smuggle YAML that mutates the stored document&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;JSONL concurrency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;simultaneous writes still produce valid telemetry&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;refusal guard&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;activity-log or code-reference style content can be refused or gated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;structured exit codes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;timeout, missing QMD, no-results, and bad-query states are machine-readable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;schema integrity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;emitted rows contain the required fields and secure file mode&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;explain surface&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;brain.py explain&lt;/code&gt; reports the sections and state the operator depends on&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This is a different posture from product demos. A demo asks, &quot;can it retrieve something?&quot; The wedge tests ask:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;can it reject a poisoned query?&lt;/li&gt;
&lt;li&gt;can it preserve telemetry under concurrency?&lt;/li&gt;
&lt;li&gt;can it fail with the right exit code?&lt;/li&gt;
&lt;li&gt;can it refuse unsafe capture?&lt;/li&gt;
&lt;li&gt;can it keep its explain surface truthful?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is the boundary between an idea and a tool.&lt;/p&gt;
&lt;h3&gt;8.4 Exit Codes Are Part of the API&lt;/h3&gt;
&lt;p&gt;Human-readable stderr is not enough once agents or scripts start using the wedge. The runtime therefore treats exit codes as a first-class contract.&lt;/p&gt;
&lt;p&gt;The current codes are:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Exit code&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;th&gt;Operational use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;success&lt;/td&gt;
&lt;td&gt;command completed with usable output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;64&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;bad query / bad input&lt;/td&gt;
&lt;td&gt;caller should fix arguments, not retry blindly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;65&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;QMD missing&lt;/td&gt;
&lt;td&gt;substrate or environment issue&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;66&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;QMD timeout&lt;/td&gt;
&lt;td&gt;caller can retry with a larger timeout or repair embeddings / index path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;67&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;no results&lt;/td&gt;
&lt;td&gt;absence is explicit, not conflated with failure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;70&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;lock contention&lt;/td&gt;
&lt;td&gt;shared-state write path is contested&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;71&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;refused / gated&lt;/td&gt;
&lt;td&gt;policy refusal, not execution failure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;99&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;internal error&lt;/td&gt;
&lt;td&gt;unexpected runtime failure&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That separation matters because &quot;no results&quot; and &quot;QMD timed out&quot; are not the same operational event even if both would look like &quot;nothing useful came back&quot; in a naive chat surface.&lt;/p&gt;
&lt;p&gt;The same principle shows up in JSON error payloads. For machine consumers, the runtime emits structured objects like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;error: qmd_timeout&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;error: qmd_missing&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;error: refused&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;with a &lt;code&gt;fix&lt;/code&gt; hint where appropriate.&lt;/p&gt;
&lt;p&gt;That means the harness is not only executing retrieval. It is shaping failure into something both humans and agents can route on.&lt;/p&gt;
&lt;h3&gt;8.5 The Scheduler Owns the Boring Reliability&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;launchd&lt;/code&gt; is not glamorous, but it is the reason the operational layer is not aspirational.&lt;/p&gt;
&lt;p&gt;In this setup, schedulers own two kinds of work:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scheduled responsibility&lt;/th&gt;
&lt;th&gt;Why it is scheduled instead of manual&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;hourly doctor passes&lt;/td&gt;
&lt;td&gt;health only matters if it keeps running when I forget&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;week-1 acceptance check&lt;/td&gt;
&lt;td&gt;the verdict must fire even if I do not remember the date&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;There is an important detail here: the week-1 acceptance script is also called from doctor as a catch-up path. If the laptop is asleep when the scheduled acceptance time passes, the next doctor run still gives the verdict a chance to self-fire. That is not complexity for its own sake. It is resilience against the boring realities of an intermittently-on laptop.&lt;/p&gt;
&lt;p&gt;This is the scheduler principle I keep coming back to:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;any check that only works when I remember to run it is not part of the architecture yet.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;8.6 Acceptance Gates Prevent Romantic Attachment&lt;/h3&gt;
&lt;p&gt;The final operational layer is not technical at all. It is decision discipline.&lt;/p&gt;
&lt;p&gt;The Week-1 gate exists to answer questions that pure health checks cannot:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;did the three subcommands get used enough to matter?&lt;/li&gt;
&lt;li&gt;did latency stay within the budget?&lt;/li&gt;
&lt;li&gt;did the Claude Code surface actually earn its complexity?&lt;/li&gt;
&lt;li&gt;is this becoming habit, or am I manually propping it up because I want the project to be true?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is why the plan includes explicit continuation, pivot, and teardown logic. If the usage pattern does not justify the wedge, the correct move is not &quot;keep polishing.&quot; The correct move is to shut down the experiment or change the surface.&lt;/p&gt;
&lt;p&gt;There is a deep reason this belongs in the same section as telemetry and health checks. A memory system can be:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;technically healthy&lt;/li&gt;
&lt;li&gt;retrieval-correct&lt;/li&gt;
&lt;li&gt;operationally stable&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;and still not deserve to exist as a product surface.&lt;/p&gt;
&lt;p&gt;Operational discipline, then, has two layers:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;runtime health&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;is the system still behaving correctly?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;product health&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;is the system earning continued attention through actual use?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The deeper reason this matters is that memory systems are uniquely vulnerable to false confidence.&lt;/p&gt;
&lt;p&gt;If a web app breaks, you see the broken page.&lt;/p&gt;
&lt;p&gt;If a memory system breaks, you get something worse:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;incomplete recall that looks plausible&lt;/li&gt;
&lt;li&gt;stale documents treated as current truth&lt;/li&gt;
&lt;li&gt;silently skipped embeddings&lt;/li&gt;
&lt;li&gt;malformed telemetry that kills the evaluation loop&lt;/li&gt;
&lt;li&gt;prompt-injected source material quoted back as if it were trustworthy&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is why the runtime has structured exit codes, timeouts, flocked telemetry writes, and explicit error surfaces. This is not just tooling polish. It is part of the product contract.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;9. Where This Breaks&lt;/h2&gt;
&lt;p&gt;Every memory system has failure modes. If it does not, it is either trivial or lying.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/second-brain-pipeline/failure-lies-12.png&quot; alt=&quot;A failure taxonomy showing deceptive intelligent lies such as freshness asymmetry, temporal lies, compression drift, and misleadingly fluent failures.&quot; /&gt;
&lt;em&gt;The dangerous failures are the ones that look intelligent: partial truth, stale truth, or compressed truth presented with confidence.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;I find it more useful to classify the breaks by &lt;em&gt;what kind of lie they produce&lt;/em&gt;.&lt;/p&gt;
&lt;h3&gt;9.1 Freshness Asymmetry&lt;/h3&gt;
&lt;p&gt;The first break is not &quot;stale data&quot; in the abstract. It is &lt;strong&gt;asymmetric freshness&lt;/strong&gt; across layers.&lt;/p&gt;
&lt;p&gt;In this system, lexical visibility and semantic visibility are different clocks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;qmd update&lt;/code&gt; makes new text searchable&lt;/li&gt;
&lt;li&gt;&lt;code&gt;qmd embed&lt;/code&gt; makes that text semantically retrievable&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When those clocks diverge, the system can be fresh in one mode and stale in another. That is a worse failure than being uniformly stale because it is harder to notice.&lt;/p&gt;
&lt;p&gt;Symptoms look like this:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Likely underlying break&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;literal query works, paraphrase query misses&lt;/td&gt;
&lt;td&gt;embeddings are lagging behind lexical indexing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;recent session appears in &lt;code&gt;qmd search&lt;/code&gt;, but not in semantic routes&lt;/td&gt;
&lt;td&gt;vector layer is stale&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;one collection feels &quot;invisible&quot; in semantic recall&lt;/td&gt;
&lt;td&gt;its embedding refresh path has stalled&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This is why I do not treat &quot;the index is up to date&quot; as a single boolean. It is at least two booleans:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;BM25 fresh?&lt;/li&gt;
&lt;li&gt;vectors fresh?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The live stack already shows why this matters. &lt;code&gt;brain ask &quot;qmd embed&quot; --json&lt;/code&gt; returns quickly with hits from the raw &lt;code&gt;brain&lt;/code&gt; layer, while recent collection status still shows freshness skew across collections. Fast answers can therefore coexist with uneven substrate freshness.&lt;/p&gt;
&lt;h3&gt;9.2 Temporal Lies&lt;/h3&gt;
&lt;p&gt;A memory hit is not live truth. It is a timestamped observation.&lt;/p&gt;
&lt;p&gt;That sounds obvious until you see how easy it is for retrieval to erase time. Once a snippet is extracted and shown in a fresh terminal output, it psychologically feels current even if it came from a week-old session that is already obsolete.&lt;/p&gt;
&lt;p&gt;This is why the runtime attaches:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;age labels such as &lt;code&gt;today&lt;/code&gt;, &lt;code&gt;yesterday&lt;/code&gt;, or &lt;code&gt;N days ago&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;freshness warnings only when the age passes the noise threshold&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The failure mode here is not just stale data. It is &lt;strong&gt;stale data presented with fresh confidence&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Typical examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a retrieved design note describes an architecture that has since changed&lt;/li&gt;
&lt;li&gt;a remembered command still appears valid even though the CLI flags drifted&lt;/li&gt;
&lt;li&gt;browser history suggests &quot;recent interest&quot; even though that collection has not ingested in days&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Time metadata is therefore not decoration. It is part of truthfulness.&lt;/p&gt;
&lt;h3&gt;9.3 Untrusted Context&lt;/h3&gt;
&lt;p&gt;A memory corpus is full of text that did not originate as careful internal knowledge.&lt;/p&gt;
&lt;p&gt;It includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;browser titles&lt;/li&gt;
&lt;li&gt;search queries&lt;/li&gt;
&lt;li&gt;pasted snippets&lt;/li&gt;
&lt;li&gt;external articles&lt;/li&gt;
&lt;li&gt;LLM-generated summaries&lt;/li&gt;
&lt;li&gt;malformed or manipulative source text&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you pipe that material into a larger agent loop without containment, the retrieval layer becomes a prompt-injection transport.&lt;/p&gt;
&lt;p&gt;The system already defends one narrow slice of this problem on the write path:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;inbox frontmatter is sanitized&lt;/li&gt;
&lt;li&gt;certain derivable or policy-problematic captures are refused or gated&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But retrieval-side trust is harder. A snippet can be perfectly well indexed and still be unsafe to obey. The correct posture is:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Retrieved text type&lt;/th&gt;
&lt;th&gt;Trust level&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;your own structured note&lt;/td&gt;
&lt;td&gt;low-to-medium trust&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;raw session text&lt;/td&gt;
&lt;td&gt;medium provenance, low semantic cleanliness&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;browser/page title&lt;/td&gt;
&lt;td&gt;low trust&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;external article text&lt;/td&gt;
&lt;td&gt;low trust unless re-verified&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;distilled summary&lt;/td&gt;
&lt;td&gt;medium trust, but lossy&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The system can surface context. It cannot magically upgrade that context into truth.&lt;/p&gt;
&lt;h3&gt;9.4 Compression Drift&lt;/h3&gt;
&lt;p&gt;Distillation solves one problem by creating another.&lt;/p&gt;
&lt;p&gt;It solves:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;transcript sprawl&lt;/li&gt;
&lt;li&gt;low-signal repetition&lt;/li&gt;
&lt;li&gt;hard-to-query verbosity&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But it creates:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;summary bias&lt;/li&gt;
&lt;li&gt;concept flattening&lt;/li&gt;
&lt;li&gt;omission of rejected alternatives&lt;/li&gt;
&lt;li&gt;phrasing lock-in around the distiller&apos;s wording&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is why I do not think raw and distilled layers are alternatives. They are adversaries. Each exists partly to keep the other honest.&lt;/p&gt;
&lt;p&gt;The failure pattern is subtle:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;If you lean too hard on...&lt;/th&gt;
&lt;th&gt;You get...&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;raw transcripts&lt;/td&gt;
&lt;td&gt;high recall, high noise, poor conceptual compression&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;distilled artifacts&lt;/td&gt;
&lt;td&gt;semantic clarity, but higher risk of over-smoothing or omission&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;When the distillation layer drifts, retrieval starts converging on the same summary language repeatedly, even when the source material contained uncertainty or conflict. That is a semantic narrowing failure, not just a summarization flaw.&lt;/p&gt;
&lt;h3&gt;9.5 Surface Contamination&lt;/h3&gt;
&lt;p&gt;A memory layer can be healthy at the file level and still become unhealthy at the surface level.&lt;/p&gt;
&lt;p&gt;I saw that directly in the recency audit. The last 24 hours of &lt;code&gt;recent --json&lt;/code&gt; were heavy with benchmark and answer-eval style synthetic sessions in the &lt;code&gt;brain&lt;/code&gt; collection. Those files are real. They belong in the corpus. But if they dominate the recency surface, the surface stops reflecting what I most need to remember.&lt;/p&gt;
&lt;p&gt;This is a different class of failure from bad indexing. The documents are there. The retrieval engine works. The surface still becomes misleading because the wrong artifact class is winning the competition.&lt;/p&gt;
&lt;p&gt;Surface contamination typically appears as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;synthetic eval sessions crowding out normal work&lt;/li&gt;
&lt;li&gt;bulk-ingested external content overwhelming personal notes&lt;/li&gt;
&lt;li&gt;noisy browsing exhaust overwhelming stable project memory&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is why I audit not only correctness, but &lt;strong&gt;surface shape&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;9.6 Operator Overfit&lt;/h3&gt;
&lt;p&gt;This system is optimized for a particular operator profile:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;terminal-native&lt;/li&gt;
&lt;li&gt;comfortable inspecting files directly&lt;/li&gt;
&lt;li&gt;willing to think in collections&lt;/li&gt;
&lt;li&gt;local-first&lt;/li&gt;
&lt;li&gt;comfortable with markdown and shell tools&lt;/li&gt;
&lt;li&gt;already using agents as collaborators&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is not a neutral baseline. It is a strong prior.&lt;/p&gt;
&lt;p&gt;So even if the architecture is internally coherent, it may still fail for users who:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;want ambient capture over explicit capture&lt;/li&gt;
&lt;li&gt;prefer mobile-first interaction&lt;/li&gt;
&lt;li&gt;do not trust terminals&lt;/li&gt;
&lt;li&gt;do not want to manage corpus hygiene manually&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is the product-level version of overfitting. The system can be correct for me and still wrong as a general surface.&lt;/p&gt;
&lt;h3&gt;9.7 Retrieval Budget Pressure&lt;/h3&gt;
&lt;p&gt;The final break is economic, not conceptual.&lt;/p&gt;
&lt;p&gt;Every layer I add makes some other layer harder:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;more capture increases normalization burden&lt;/li&gt;
&lt;li&gt;more documents increase candidate competition&lt;/li&gt;
&lt;li&gt;more semantic search increases embedding maintenance&lt;/li&gt;
&lt;li&gt;more surfaces increase telemetry and support burden&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Retrieval quality does not degrade only because models are weak. It degrades because the budget gets fragmented:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource under pressure&lt;/th&gt;
&lt;th&gt;What degrades first&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;latency budget&lt;/td&gt;
&lt;td&gt;interactive trust&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;corpus discipline&lt;/td&gt;
&lt;td&gt;result quality&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;embedding freshness&lt;/td&gt;
&lt;td&gt;semantic recall&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;operator attention&lt;/td&gt;
&lt;td&gt;maintenance and debugging&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;surface clarity&lt;/td&gt;
&lt;td&gt;adoption and habit formation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This is why I do not think of &quot;more capture&quot; as progress unless it is paired with corpus discipline and eval discipline.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;10. What This Architecture Buys&lt;/h2&gt;
&lt;p&gt;When this design works, it buys a specific kind of leverage that most second-brain products blur together.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/second-brain-pipeline/distance-to-evidence-13.png&quot; alt=&quot;A closing diagram showing that the architecture&apos;s value lies in reducing the distance between a sharp question and the evidence layer that can answer it.&quot; /&gt;
&lt;em&gt;The entire stack only matters if it reduces the distance between a question and the evidence that answers it.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;I would break that leverage into five payoffs:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Payoff&lt;/th&gt;
&lt;th&gt;What you get&lt;/th&gt;
&lt;th&gt;What you avoid&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;locality&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;the corpus lives as files on disk, under your control&lt;/td&gt;
&lt;td&gt;outsourced memory trapped behind a hosted product or opaque sync layer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;inspectability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;every layer can be read, grepped, diffed, and debugged with ordinary tools&lt;/td&gt;
&lt;td&gt;black-box retrieval where failure analysis starts with guesswork&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;boundary clarity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;QMD, MCP, CLI, scheduler, and skills each own a narrow contract&lt;/td&gt;
&lt;td&gt;a single &quot;smart assistant&quot; surface that hides where failures actually live&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;retrieval discipline&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;different artifact classes compete in structured ways instead of one giant undifferentiated pool&lt;/td&gt;
&lt;td&gt;semantic soup where everything is searchable but very little is reliably retrievable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;agent readiness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;the corpus is already shaped for both human recall and tool-mediated retrieval&lt;/td&gt;
&lt;td&gt;bolting an agent on top of raw notes and hoping prompt engineering compensates&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Those payoffs are more operational than inspirational. This architecture does not buy me omniscience. It buys me a shorter path from question to evidence.&lt;/p&gt;
&lt;p&gt;That is the core lesson I keep coming back to:&lt;/p&gt;
&lt;p&gt;==the value of a second brain is not that it stores more of your life. The value is that it reduces the distance between a question and the exact layer of memory that can answer it.==&lt;/p&gt;
&lt;p&gt;That distance is an architectural property.&lt;/p&gt;
&lt;p&gt;It depends on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;how artifacts are normalized&lt;/li&gt;
&lt;li&gt;how collections are split&lt;/li&gt;
&lt;li&gt;how freshness is maintained&lt;/li&gt;
&lt;li&gt;which retrieval modes are available&lt;/li&gt;
&lt;li&gt;how runtime policy shapes recall&lt;/li&gt;
&lt;li&gt;whether the surfaces remain narrow enough to trust&lt;/li&gt;
&lt;li&gt;and whether the system survives enough real use to keep its shape&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I no longer think about this project as a &quot;personal knowledge management app.&quot; It is closer to a local retrieval operating system for my work.&lt;/p&gt;
&lt;p&gt;That phrase is not branding. It is a statement about responsibility:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;If this were just an app...&lt;/th&gt;
&lt;th&gt;But as a retrieval operating system...&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;the UI would be the product&lt;/td&gt;
&lt;td&gt;the corpus and contracts are the product&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;one assistant surface would dominate&lt;/td&gt;
&lt;td&gt;multiple surfaces can coexist over one substrate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;debugging would stay inside the app&lt;/td&gt;
&lt;td&gt;debugging can happen at the file, index, runtime, or skill layer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;feature count would look like progress&lt;/td&gt;
&lt;td&gt;only reduced recall distance counts as progress&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;There is also a second-order benefit that matters more as agents become normal tooling: once the memory substrate is shaped correctly, you do not need to rebuild memory for every surface.&lt;/p&gt;
&lt;p&gt;The same underlying corpus can support:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;terminal recall&lt;/li&gt;
&lt;li&gt;Claude Code skills&lt;/li&gt;
&lt;li&gt;MCP-mediated retrieval&lt;/li&gt;
&lt;li&gt;future briefings or summarization layers&lt;/li&gt;
&lt;li&gt;evaluation harnesses over the memory stack itself&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That reuse only works because the substrate is stable and inspectable. If the memory system is just &quot;whatever the current chat product happened to store,&quot; each new surface starts from zero.&lt;/p&gt;
&lt;h3&gt;10.1 What It Does Not Buy&lt;/h3&gt;
&lt;p&gt;This architecture also refuses to buy a few fantasies:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Fantasy&lt;/th&gt;
&lt;th&gt;Why this stack does not promise it&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;perfect memory&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ingestion is selective, distillation is lossy, and freshness is uneven&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;live truth&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;retrieval returns timestamped observations, not guaranteed current state&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;automatic judgment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;search can surface context, but it cannot decide what should matter&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;universal product fit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;the operator model here is specific and opinionated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;free complexity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;every new source, surface, or retrieval mode increases maintenance burden&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That refusal matters because it keeps the architecture honest. The system should be judged against the job it actually does: making certain classes of recall cheap, inspectable, and repeatable.&lt;/p&gt;
&lt;p&gt;If you want to build one, I would start with a narrower goal than &quot;remember everything.&quot; I would start with a sharper question:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What exact classes of recall do you want to make cheap?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Then build backward from that:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;define the artifact classes&lt;/li&gt;
&lt;li&gt;define the document boundaries&lt;/li&gt;
&lt;li&gt;normalize aggressively&lt;/li&gt;
&lt;li&gt;split the collections by retrieval role&lt;/li&gt;
&lt;li&gt;make lexical retrieval work before semantic retrieval&lt;/li&gt;
&lt;li&gt;keep the harness thin&lt;/li&gt;
&lt;li&gt;make the skills explicit&lt;/li&gt;
&lt;li&gt;instrument the failure paths&lt;/li&gt;
&lt;li&gt;measure whether recall is actually getting cheaper&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is the real standard:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Weak standard&lt;/th&gt;
&lt;th&gt;Strong standard&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&quot;does the system know a lot?&quot;&lt;/td&gt;
&lt;td&gt;&quot;does the right layer answer a sharp question quickly enough to change my behavior?&quot;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&quot;can it generate a clever answer?&quot;&lt;/td&gt;
&lt;td&gt;&quot;will I trust it enough to ask again tomorrow?&quot;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&quot;did I capture more data?&quot;&lt;/td&gt;
&lt;td&gt;&quot;did recall distance go down?&quot;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The system does not become a second brain when you capture enough.&lt;/p&gt;
&lt;p&gt;It becomes one when recall becomes a reliable habit.&lt;/p&gt;
&lt;p&gt;That is the standard I care about now. Not &quot;does the system know a lot?&quot; Not &quot;can it generate a clever answer?&quot; The standard is harsher:&lt;/p&gt;
&lt;p&gt;==when I ask my own history a sharp question, does the right layer answer quickly enough that I will ask again tomorrow?==&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;em&gt;This post builds on two earlier essays: &lt;a href=&quot;https://sharadja.in/blog/system-prompt-architecture-agentic-ai&quot;&gt;The 14K Token Debt&lt;/a&gt; and &lt;a href=&quot;https://sharadja.in/blog/terminal-first-agent-harness-unix-ai-agents&quot;&gt;The Terminal Was the First Agent Harness&lt;/a&gt;. Those argued that prompts and terminals are architectural surfaces. This is the memory layer that sits underneath them.&lt;/em&gt;&lt;/p&gt;
</content:encoded><category>AI</category><category>second-brain</category><category>retrieval</category><category>QMD</category><category>MCP</category><category>CLI</category><category>agents</category><category>memory</category><category>architecture</category><category>knowledge-systems</category><author>Sharad Jain</author></item><item><title>Context Engineering Is All I&apos;ve Been Writing About</title><link>https://sharadja.in/blog/context-engineering-practitioner-guide/</link><guid isPermaLink="true">https://sharadja.in/blog/context-engineering-practitioner-guide/</guid><description>I wrote six posts about system prompts, MCP servers, terminal harnesses, agent memory, CLI comparisons, and self-improving skills. They&apos;re all about the same thing: what occupies the context window before the model reasons.</description><pubDate>Wed, 22 Apr 2026 14:00:00 GMT</pubDate><content:encoded>&lt;h1&gt;Context Engineering Is All I&apos;ve Been Writing About&lt;/h1&gt;
&lt;p&gt;I thought I was writing about different things:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Post&lt;/th&gt;
&lt;th&gt;What I Thought It Was About&lt;/th&gt;
&lt;th&gt;What It Was Actually About&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href=&quot;https://sharadja.in/blog/system-prompt-architecture-agentic-ai&quot;&gt;The 14K Token Debt&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;System prompt architecture&lt;/td&gt;
&lt;td&gt;Layer 1 of the context window costs 14K tokens before you speak&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href=&quot;https://sharadja.in/blog/terminal-first-agent-harness-unix-ai-agents&quot;&gt;The Terminal Was the First Agent Harness&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Unix as agent pattern&lt;/td&gt;
&lt;td&gt;How to namespace heterogeneous context sources&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href=&quot;https://sharadja.in/blog/mcp-servers-performance-tax-claude-code&quot;&gt;Your MCP Servers Are Costing You 10 Seconds&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;MCP performance overhead&lt;/td&gt;
&lt;td&gt;Tool schemas consuming the context window silently&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href=&quot;https://sharadja.in/blog/benchmarking-long-term-ai-agent-memory&quot;&gt;My AI Agent&apos;s Memory Paid for Itself&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Long-term agent memory&lt;/td&gt;
&lt;td&gt;On-demand knowledge injection into context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href=&quot;https://sharadja.in/blog/claude-code-vs-gemini-cli-agentic-terminal&quot;&gt;Claude Code vs Gemini CLI&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Tool comparison&lt;/td&gt;
&lt;td&gt;Two different strategies for managing the same context window&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href=&quot;https://sharadja.in/blog/self-improving-agent-harnesses-recursive-improvement&quot;&gt;Self-Improving AI Skills&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Recursive improvement&lt;/td&gt;
&lt;td&gt;How trajectory data in context compounds across sessions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Every post answered one question: &lt;strong&gt;what occupies the context window before the model reasons, and what does it cost?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;==Andrej Karpathy &lt;a href=&quot;https://x.com/karpathy/status/1937902483689640427&quot;&gt;defined it&lt;/a&gt;: &quot;Context engineering is the delicate art and science of filling the context window with just the right information for the next step.&quot;== I just didn&apos;t have the name.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;What the Model Actually Sees&lt;/h2&gt;
&lt;p&gt;I measured my context window. Not estimated — measured. I ran my &lt;a href=&quot;http://CLAUDE.md&quot;&gt;CLAUDE.md&lt;/a&gt; and memory files through &lt;code&gt;tiktoken&lt;/code&gt;, counted my MCP tool definitions, and mapped what Claude Code loads before I type a single character.&lt;/p&gt;
&lt;p&gt;Here&apos;s the literal structure of a fresh session in this project:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[LAYER 1 — CONSTITUTION]                              ~3,800 tokens
  Anthropic&apos;s system prompt: persona, safety rules,
  coding standards, tool usage instructions.
  (This is what &quot;The 14K Token Debt&quot; measured —
   but 14K included tool schemas. This post
   separates them into Layer 1 vs Layer 4.)

[LAYER 2 — IDENTITY]                                   1,160 tokens
  ~/.claude/CLAUDE.md                    182 tokens  ← measured
  MEMORY.md (index)                      163 tokens  ← measured
  Memory files (3 auto-loaded)           815 tokens  ← measured
  ─────────────────────────────────
  Subtotal:                            1,160 tokens

[LAYER 3 — KNOWLEDGE]                                      0 tokens
  Brain MCP results: loaded on-demand, not at boot.
  File reads: loaded when the model calls Read tool.

[LAYER 4 — TOOLS]                                     ~8,400 tokens
  6 MCP servers, ~40 tools total.
  After Tool Search deferral: ~1,800 tokens
  (tool names + descriptions only, not full schemas).

[LAYER 5 — CONVERSATION]                                  23 tokens
  My question: &quot;fix the broken image path&quot;

─────────────────────────────────────────────────────────
TOTAL AT FIRST TURN:                              ~13,383 tokens
TOTAL AFTER TOOL SEARCH:                           ~6,783 tokens
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;My question was &lt;strong&gt;23 tokens&lt;/strong&gt;. The infrastructure was &lt;strong&gt;6,760 tokens&lt;/strong&gt; (with Tool Search) or &lt;strong&gt;13,360 tokens&lt;/strong&gt; (without).&lt;/p&gt;
&lt;p&gt;I call this the &lt;strong&gt;Signal Ratio&lt;/strong&gt; — the percentage of the context window that&apos;s actually your task. Mine was &lt;strong&gt;0.17%&lt;/strong&gt; at first turn without Tool Search. With Tool Search: &lt;strong&gt;0.34%&lt;/strong&gt;. ==Less than half a percent of my context window was the thing I wanted the agent to do.==&lt;/p&gt;
&lt;p&gt;Now, 13K tokens out of a 200K window is only 6.7% — plenty of headroom. But this is turn one. By turn ten, after file reads, tool calls, and assistant responses, the context has grown to 50,000-80,000 tokens. The infrastructure (Layers 1-4) stays fixed. The Signal Ratio climbs to 5-10%. But those fixed-cost tokens are still there — on every single turn — exerting gravitational pull on the model&apos;s attention.&lt;/p&gt;
&lt;p&gt;In &lt;a href=&quot;https://sharadja.in/blog/system-prompt-architecture-agentic-ai&quot;&gt;The 14K Token Debt&lt;/a&gt;, I called this Prompt Gravity. In &lt;a href=&quot;https://sharadja.in/blog/mcp-servers-performance-tax-claude-code&quot;&gt;the MCP post&lt;/a&gt;, I called it Schema Gravity. They&apos;re the same force operating at different layers. The unified name is &lt;strong&gt;Context Gravity&lt;/strong&gt; — the cumulative attentional weight of all fixed-cost tokens across every layer of the stack. The heavier your context, the stronger the gravity, the harder it is for the model to attend to your actual question.&lt;/p&gt;
&lt;p&gt;Prompt engineering asks &quot;what should I put in the prompt?&quot; That&apos;s Layer 1 thinking — one-fifth of the stack. Context engineering asks: &lt;strong&gt;what should be in the entire window, at which layer, loaded when, at what cost?&lt;/strong&gt;&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Context Engineering Stack&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;The Context Engineering Stack&lt;/strong&gt; — five layers that every agent system designs, whether explicitly or by accident.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;What Goes Here&lt;/th&gt;
&lt;th&gt;When Loaded&lt;/th&gt;
&lt;th&gt;My Setup (measured)&lt;/th&gt;
&lt;th&gt;Key Risk&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1: Constitution&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;System prompt, safety rules, persona&lt;/td&gt;
&lt;td&gt;Every turn (immutable)&lt;/td&gt;
&lt;td&gt;~3,800 tokens&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://sharadja.in/blog/system-prompt-architecture-agentic-ai&quot;&gt;Prompt Gravity&lt;/a&gt; — biases all downstream reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2: Identity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;http://CLAUDE.md&quot;&gt;CLAUDE.md&lt;/a&gt;, memory files, project rules&lt;/td&gt;
&lt;td&gt;Session start&lt;/td&gt;
&lt;td&gt;1,160 tokens&lt;/td&gt;
&lt;td&gt;Stale instructions from old projects; &lt;a href=&quot;https://arxiv.org/abs/2307.03172&quot;&gt;lost in the middle&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3: Knowledge&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;RAG results, Brain MCP queries, file reads&lt;/td&gt;
&lt;td&gt;On-demand&lt;/td&gt;
&lt;td&gt;0-50,000 per query&lt;/td&gt;
&lt;td&gt;Retrieval noise — wrong documents dilute reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4: Tools&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MCP schemas, function definitions&lt;/td&gt;
&lt;td&gt;Session start&lt;/td&gt;
&lt;td&gt;8,400 → 1,800 (after Tool Search)&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://sharadja.in/blog/mcp-servers-performance-tax-claude-code&quot;&gt;Schema Gravity&lt;/a&gt; — unused tools waste budget&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;5: Conversation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Messages, responses, tool results&lt;/td&gt;
&lt;td&gt;Accumulates&lt;/td&gt;
&lt;td&gt;Growing&lt;/td&gt;
&lt;td&gt;Context rot — old turns degrade attention on recent ones&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The constraint is directional: &lt;strong&gt;layers below eat the budget of layers above.&lt;/strong&gt; A bloated Layer 4 (55,000 tokens of MCP schemas from 5 servers, per &lt;a href=&quot;https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents&quot;&gt;Anthropic&apos;s own measurement&lt;/a&gt;) directly starves Layer 5 and Layer 3. This is why I &lt;a href=&quot;https://sharadja.in/blog/mcp-servers-performance-tax-claude-code&quot;&gt;cut my global MCP servers from 9 to 2&lt;/a&gt; — their Layer 4 cost was crushing my Layer 5 capacity.&lt;/p&gt;
&lt;h3&gt;Layer 5: The Neglected Layer&lt;/h3&gt;
&lt;p&gt;Every previous post in this series addressed Layers 1-4. Layer 5 — the conversation itself — is the one I never wrote about, and it&apos;s where context engineering gets hardest.&lt;/p&gt;
&lt;p&gt;Layer 5 grows with every turn. Each tool call adds the full result to the conversation history. A &lt;code&gt;Read&lt;/code&gt; tool call on a 200-line file adds ~1,000 tokens. A &lt;code&gt;Bash&lt;/code&gt; tool call returning build output adds 500-2,000 tokens. After 10 tool-heavy turns, Layer 5 can consume 40,000-60,000 tokens — dwarfing all other layers combined.&lt;/p&gt;
&lt;p&gt;The problem is &lt;strong&gt;context rot&lt;/strong&gt;. In the &lt;a href=&quot;https://sharadja.in/blog/system-prompt-architecture-agentic-ai&quot;&gt;14K Token Debt post&lt;/a&gt;, I cited &lt;a href=&quot;https://arxiv.org/abs/2307.03172&quot;&gt;Liang et al.&lt;/a&gt; who found that instruction drift is universally measurable within eight conversation rounds. By turn eight, the model&apos;s adherence to system prompt instructions starts degrading. By turn fifteen, it has collapsed into the statistical median of its pre-training distribution. Your carefully crafted Layer 1 constitution? It&apos;s being diluted by the sheer volume of Layer 5 tokens.&lt;/p&gt;
&lt;p&gt;The &lt;a href=&quot;https://arxiv.org/abs/2307.03172&quot;&gt;Lost in the Middle paper&lt;/a&gt; (Liu et al., Stanford) measured this precisely: on multi-document QA with 20 documents, moving the relevant document from position 1 to position 10 caused a &lt;strong&gt;~20 percentage point accuracy drop&lt;/strong&gt; on GPT-3.5-Turbo. The model attends to the beginning and end of context; everything in the middle is an attention valley. Your Layer 2 instructions (&lt;a href=&quot;http://CLAUDE.md&quot;&gt;CLAUDE.md&lt;/a&gt;) live in that valley.&lt;/p&gt;
&lt;p&gt;This is why the most effective Claude Code users put critical instructions in two places: Layer 1 (system prompt, which sits at the very beginning) &lt;em&gt;and&lt;/em&gt; Layer 5 (the current message, which sits at the very end). The middle is unreliable territory.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Claude Code vs Gemini CLI: Two Context Stacks&lt;/h2&gt;
&lt;p&gt;The five layers exist in both tools, but the engineering choices at each layer differ:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Claude Code&lt;/th&gt;
&lt;th&gt;Gemini CLI&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1: Constitution&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~3,800 tokens. Detailed coding standards, tool usage rules, style preferences.&lt;/td&gt;
&lt;td&gt;~2,000 tokens. Lighter system prompt, more deference to user config.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2: Identity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hierarchical: &lt;code&gt;~/.claude/CLAUDE.md&lt;/code&gt; → project → subdirectory. &lt;strong&gt;Lazy loading&lt;/strong&gt; — subdirectory files load only when agent reads files there.&lt;/td&gt;
&lt;td&gt;&lt;code&gt;~/.gemini/GEMINI.md&lt;/code&gt; → project root + ancestors. &lt;strong&gt;Eager loading&lt;/strong&gt; — all concatenated and sent every prompt.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3: Knowledge&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No built-in knowledge system. Extended via MCP (e.g., Brain MCP).&lt;/td&gt;
&lt;td&gt;Built-in &lt;code&gt;google_web_search&lt;/code&gt; and &lt;code&gt;web_fetch&lt;/code&gt; tools for real-time grounding.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4: Tools&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tool Search defers schemas — &lt;strong&gt;89% reduction&lt;/strong&gt; (77K → 8.7K tokens per &lt;a href=&quot;https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents&quot;&gt;Anthropic&lt;/a&gt;).&lt;/td&gt;
&lt;td&gt;No schema deferral. All tool definitions injected every turn.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;5: Conversation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Adaptive auto-compact. 5-layer compression pipeline.&lt;/td&gt;
&lt;td&gt;Fixed threshold compression (50%, &lt;a href=&quot;https://github.com/google-gemini/gemini-cli/pull/13517&quot;&gt;changed from 70%&lt;/a&gt;). Known &lt;a href=&quot;https://github.com/google-gemini/gemini-cli/issues/16213&quot;&gt;compression loop bug&lt;/a&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The architectural bet is clear. Claude Code invests heavily in Layer 4 optimization (Tool Search) and Layer 5 management (multi-layer compression). Gemini CLI bets on a massive context window (1M tokens) to make optimization less critical — but without schema deferral, even 1M tokens fill when you connect enough MCP servers.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://sharadja.in/blog/terminal-first-agent-harness-unix-ai-agents&quot;&gt;Context Mounting&lt;/a&gt; — the pattern from the Terminal post where heterogeneous context sources are projected into a uniform namespace — described the &lt;em&gt;what&lt;/em&gt;. The Context Engineering Stack describes the &lt;em&gt;when&lt;/em&gt; and the &lt;em&gt;cost&lt;/em&gt;.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Audit Your Context Budget&lt;/h2&gt;
&lt;p&gt;Run this in a Claude Code session:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Step 1: Count your MCP tools
/mcp
# Output shows: brain (4 tools), sequential-thinking (1 tool), etc.
# Count total tools across all servers.

# Step 2: Estimate Layer 4 cost
# Each tool schema ≈ 200-500 tokens of JSON.
# Rule of thumb: total_tools × 350 = Layer 4 budget
# My setup: 40 tools × 350 = ~14,000 tokens (before Tool Search)

# Step 3: Measure Layer 2 exactly
# Outside Claude Code:
pip install tiktoken
python3 -c &quot;
import tiktoken
enc = tiktoken.encoding_for_model(&apos;gpt-4&apos;)
with open(&apos;$HOME/.claude/CLAUDE.md&apos;) as f:
    print(f&apos;CLAUDE.md: {len(enc.encode(f.read()))} tokens&apos;)
&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here&apos;s my audit — before and after optimization:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Before (9 MCP servers, 2025)&lt;/th&gt;
&lt;th&gt;After (2 global + 4 project-scoped, 2026)&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1: Constitution&lt;/td&gt;
&lt;td&gt;~3,800&lt;/td&gt;
&lt;td&gt;~3,800&lt;/td&gt;
&lt;td&gt;No change (Anthropic controls this)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2: Identity&lt;/td&gt;
&lt;td&gt;1,160&lt;/td&gt;
&lt;td&gt;1,160&lt;/td&gt;
&lt;td&gt;No change (my &lt;a href=&quot;http://CLAUDE.md&quot;&gt;CLAUDE.md&lt;/a&gt; + memory)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3: Knowledge&lt;/td&gt;
&lt;td&gt;On-demand&lt;/td&gt;
&lt;td&gt;On-demand&lt;/td&gt;
&lt;td&gt;No change&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4: Tools&lt;/td&gt;
&lt;td&gt;~14,000 (40 tools)&lt;/td&gt;
&lt;td&gt;~1,800 (after Tool Search)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-87%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5: Available budget&lt;/td&gt;
&lt;td&gt;181,040&lt;/td&gt;
&lt;td&gt;193,240&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+12,200 tokens recovered&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Signal Ratio (turn 1)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.17%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.34%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2x improvement&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Those 12,200 recovered tokens translate directly to more file reads, longer conversations, and fewer auto-compact triggers. The context budget is zero-sum: every token saved at one layer is a token available at every other layer.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Cutting Edge: Context as a Callable Tool&lt;/h2&gt;
&lt;p&gt;Current context management is passive. Claude Code&apos;s auto-compact triggers when the context grows too large. Gemini CLI triggers at a fixed 50% threshold. Both approaches share the same flaw: the agent doesn&apos;t &lt;em&gt;decide&lt;/em&gt; to manage its context. The harness decides for it.&lt;/p&gt;
&lt;p&gt;The &lt;a href=&quot;https://arxiv.org/abs/2512.22087&quot;&gt;CAT (Context as a Tool) paper&lt;/a&gt; by Liu et al. proposes the fix: make context management a tool the agent can call, just like &lt;code&gt;read_file&lt;/code&gt; or &lt;code&gt;bash&lt;/code&gt;. The agent decides &lt;em&gt;when&lt;/em&gt; to compress, &lt;em&gt;what&lt;/em&gt; to retain, and &lt;em&gt;how&lt;/em&gt; to structure its memory.&lt;/p&gt;
&lt;p&gt;CAT organizes context into three zones that map to the stack:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;CAT Zone&lt;/th&gt;
&lt;th&gt;Context Engineering Stack Layer&lt;/th&gt;
&lt;th&gt;Function&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Fixed Segment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Layers 1-2 (Constitution + Identity)&lt;/td&gt;
&lt;td&gt;Stable anchor — never compressed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Long-Term Memory&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Layer 3 (Knowledge)&lt;/td&gt;
&lt;td&gt;Condensed high-fidelity summaries, evolving&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Working Memory&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Layer 5 (Conversation)&lt;/td&gt;
&lt;td&gt;Last K ReAct steps, fine-grained details&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Their trained model, SWE-Compressor, hit &lt;strong&gt;57.6% on SWE-Bench Verified&lt;/strong&gt; — outperforming both ReAct agents and static compression baselines — while maintaining stable reasoning under a bounded context budget.&lt;/p&gt;
&lt;p&gt;The insight: the agent that manages its own context outperforms the agent that lets the harness manage it. My Brain MCP (Layer 3) is a primitive version of this — the agent queries its own past on-demand rather than loading everything at startup. But it&apos;s still me deciding when to query. The next step is the agent deciding when to compress, what to keep, and what to fold.&lt;/p&gt;
&lt;p&gt;Related work pushes this further: &lt;a href=&quot;https://arxiv.org/abs/2310.06839&quot;&gt;LongLLMLingua&lt;/a&gt; achieves &lt;strong&gt;4x prompt compression&lt;/strong&gt; with ~95% performance retention by scoring tokens for relevance &lt;em&gt;to the downstream question&lt;/em&gt; and reordering compressed chunks to mitigate the lost-in-the-middle effect. &lt;a href=&quot;https://arxiv.org/abs/2406.11695&quot;&gt;DSPy&apos;s MIPROv2 optimizer&lt;/a&gt; automates context engineering entirely — jointly optimizing instructions, few-shot examples, and their combination across multi-stage pipelines, improving performance by 5-15 percentage points over hand-tuned prompts. These aren&apos;t prompt tricks. They&apos;re automated context engineering.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Where This Breaks&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Lost in the middle is real, and Layer 2 lives in the valley.&lt;/strong&gt; &lt;a href=&quot;https://arxiv.org/abs/2307.03172&quot;&gt;Liu et al. (2023)&lt;/a&gt; measured a ~20 percentage point accuracy drop when relevant information moves from positions 1 or 20 to position 10 in a 20-document context. Your &lt;a href=&quot;http://CLAUDE.md&quot;&gt;CLAUDE.md&lt;/a&gt; instructions sit between the system prompt (beginning) and the conversation (end) — directly in the attention valley. The mitigation: put your most critical rules at the very top of &lt;a href=&quot;http://CLAUDE.md&quot;&gt;CLAUDE.md&lt;/a&gt; (closer to Layer 1) and re-state them in your current message (Layer 5). Don&apos;t bury important instructions at line 50 of a 100-line &lt;a href=&quot;http://CLAUDE.md&quot;&gt;CLAUDE.md&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Compression is lossy, and the loss is invisible.&lt;/strong&gt; Auto-compact summarizes your conversation history, but the summary may drop the specific decision you made in turn 3 that&apos;s load-bearing for turn 15. I&apos;ve seen this in practice: after auto-compact fires, Claude Code occasionally &quot;forgets&quot; a constraint I set earlier — not because the constraint was wrong, but because compression deemed it low-priority. The fix: checkpoint critical decisions explicitly. Say &quot;Remember: we decided PostgreSQL over MongoDB because of X&quot; in your current message. Put it in Layer 5 (recent, high-attention), not just in the history (compressed, low-attention).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Token counting is approximate.&lt;/strong&gt; You never know your exact context budget until you hit the limit. Different models tokenize differently — the same text may be 1,000 tokens for Claude and 1,200 for Gemini. A &quot;200K context window&quot; isn&apos;t 200K words; it&apos;s roughly 150K words of English text, less for code. Build in margin: target 70% utilization, not 95%.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Context engineering is model-specific.&lt;/strong&gt; &lt;a href=&quot;http://CLAUDE.md&quot;&gt;CLAUDE.md&lt;/a&gt; works for Claude Code. &lt;a href=&quot;http://GEMINI.md&quot;&gt;GEMINI.md&lt;/a&gt; works for Gemini CLI. But the optimal Layer 1 for Claude (direct, imperative: &quot;Always use TypeScript&quot;) differs from the optimal Layer 1 for Gemini (structured, hierarchical context). Claude&apos;s lazy-loading Layer 2 and Tool Search Layer 4 have no equivalents in Gemini CLI. A context strategy doesn&apos;t transfer cleanly across models — rewrite, don&apos;t copy.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Context Gravity Is the Unifying Force&lt;/h2&gt;
&lt;p&gt;&lt;a href=&quot;https://sharadja.in/blog/system-prompt-architecture-agentic-ai&quot;&gt;Prompt Gravity&lt;/a&gt; was the observation that system prompt tokens bias all downstream reasoning. &lt;a href=&quot;https://sharadja.in/blog/mcp-servers-performance-tax-claude-code&quot;&gt;Schema Gravity&lt;/a&gt; was the observation that tool definitions exert the same force. They&apos;re the same phenomenon at different layers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Context Gravity&lt;/strong&gt; is the unified name: the cumulative attentional weight of all fixed-cost tokens across every layer of the stack. The heavier your context, the stronger the gravity, the harder it is for the model to attend to the signal — your actual question, buried at the bottom of 13,000 tokens of infrastructure.&lt;/p&gt;
&lt;p&gt;The Context Engineering Stack is the map. The Signal Ratio is the metric. Context Gravity is the force you&apos;re fighting. Every decision in agent architecture — how many MCP servers to load, how long to let conversations run, when to compress, what to put in &lt;a href=&quot;http://CLAUDE.md&quot;&gt;CLAUDE.md&lt;/a&gt; — is a context engineering decision. The teams building reliable agents are the ones who treat it as one.&lt;/p&gt;
&lt;p&gt;Run &lt;code&gt;/mcp&lt;/code&gt;. Count the tools. Measure your &lt;a href=&quot;http://CLAUDE.md&quot;&gt;CLAUDE.md&lt;/a&gt;. Calculate your Signal Ratio. ==If your agent&apos;s context is 95% infrastructure and 5% signal, you&apos;re not engineering context — you&apos;re drowning in it.==&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;em&gt;Sharad Jain is an AI engineer and the author of &lt;a href=&quot;https://sharadja.in/blog/system-prompt-architecture-agentic-ai&quot;&gt;The 14K Token Debt&lt;/a&gt;, &lt;a href=&quot;https://sharadja.in/blog/terminal-first-agent-harness-unix-ai-agents&quot;&gt;The Terminal Was the First Agent Harness&lt;/a&gt;, &lt;a href=&quot;https://sharadja.in/blog/mcp-servers-performance-tax-claude-code&quot;&gt;Your MCP Servers Are Costing You 10 Seconds&lt;/a&gt;, &lt;a href=&quot;https://sharadja.in/blog/benchmarking-long-term-ai-agent-memory&quot;&gt;My AI Agent&apos;s Memory Paid for Itself&lt;/a&gt;, &lt;a href=&quot;https://sharadja.in/blog/claude-code-vs-gemini-cli-agentic-terminal&quot;&gt;Claude Code vs Gemini CLI&lt;/a&gt;, and &lt;a href=&quot;https://sharadja.in/blog/self-improving-agent-harnesses-recursive-improvement&quot;&gt;I Built an AI Skill That Started Improving Itself&lt;/a&gt;. He writes about agent architecture, system prompts, and the infrastructure decisions that compound across every session. This is the sixth post in a series on the hidden mechanics of agentic AI systems.&lt;/em&gt;&lt;/p&gt;
</content:encoded><enclosure url="https://sharadja.in/astral_hacker.png" type="image/png" length="0"/><category>AI</category><category>context-engineering</category><category>Claude-Code</category><category>Gemini-CLI</category><category>MCP</category><category>Karpathy</category><category>system-prompts</category><category>agents</category><category>architecture</category><category>developer-experience</category><author>Sharad Jain</author></item><item><title>I Built an AI Skill That Started Improving Itself</title><link>https://sharadja.in/blog/self-improving-agent-harnesses-recursive-improvement/</link><guid isPermaLink="true">https://sharadja.in/blog/self-improving-agent-harnesses-recursive-improvement/</guid><description>My blog-quality rubric ran on 8 posts. By post 6, it was catching patterns I never coded. That sent me down a rabbit hole into the systems that are actually self-improving in 2026 — and the ones that are just gaming their benchmarks.</description><pubDate>Tue, 21 Apr 2026 18:00:00 GMT</pubDate><content:encoded>&lt;h1&gt;I Built an AI Skill That Started Improving Itself&lt;/h1&gt;
&lt;p&gt;I have a skill file for this blog — a markdown document that tells Claude Code how to score my drafts across 10 dimensions: hook strength, citation density, voice match, original frameworks, limitations honesty. Each dimension has a 0-10 rubric. The skill reads a draft, scores it, and rewrites the sections that score below 7.&lt;/p&gt;
&lt;p&gt;Here&apos;s the trajectory from 8 runs:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Post #1 (14K Token Debt):     Hook 4 → 8  |  Citations 3 → 7  |  Voice 5 → 8
Post #2 (Terminal Harness):   Hook 7 → 9  |  Citations 6 → 9  |  Voice 7 → 9
Post #3 (MCP Performance):    Hook 8 → 9  |  Citations 8 → 9  |  Voice 8 → 9
...
Post #6 (Hedge Fund):         Hook 8      |  Citations 8      |  Voice 9
Post #7 (Claude vs Gemini):   Hook 9      |  Citations 8      |  Voice 9
Post #8 (This post, v1):      Hook 6      |  Citations 2      |  Voice 3
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;By post 6, the skill was catching patterns I never explicitly coded. It learned that my hooks work best when they open with a personal measurement, not a literature citation. It learned that my voice breaks when I summarize papers instead of describing what I built. It learned that &quot;Where This Breaks&quot; sections need 3+ specific scenarios, not generic caveats.&lt;/p&gt;
&lt;p&gt;Then post 8 — the first draft of this very article — scored a &lt;strong&gt;3.7/10&lt;/strong&gt;. The skill flagged it as a literature review wearing a first-person costume. ==The tool I built to improve my writing had learned enough to tell me that &lt;em&gt;this specific draft&lt;/em&gt; was the worst thing I&apos;d published.==&lt;/p&gt;
&lt;p&gt;That&apos;s Tier 1 self-improvement. The skill remembers what worked and applies it. It doesn&apos;t rewrite its own rubric. It doesn&apos;t discover new scoring dimensions. It doesn&apos;t modify the process by which it improves. But even at Tier 1, it caught a problem I missed — and that sent me down a rabbit hole into what genuine self-improvement looks like in agent systems.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Most Agents Remember. Almost None Learn.&lt;/h2&gt;
&lt;p&gt;Here&apos;s the problem: Claude Code&apos;s auto-memory writes learnings to &lt;code&gt;MEMORY.md&lt;/code&gt;. Gemini CLI&apos;s &lt;code&gt;save_memory&lt;/code&gt; tool persists facts to &lt;code&gt;GEMINI.md&lt;/code&gt;. Both tools remember what happened in previous sessions. Neither abstracts &lt;em&gt;what was learned&lt;/em&gt; from &lt;em&gt;what happened&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;The Evo-Memory benchmark (&lt;a href=&quot;https://arxiv.org/abs/2511.20857&quot;&gt;arxiv 2511.20857&lt;/a&gt;) tested this directly. The researchers found that most agent memory systems &quot;passively retrieve from dialogue history&quot; — they recall prior conversations but don&apos;t distill patterns from them. An agent that remembers conversations but never writes reusable procedures is like a developer who reads their bash history but never writes scripts.&lt;/p&gt;
&lt;p&gt;I call this &lt;strong&gt;The Improvement Stack&lt;/strong&gt; — three tiers of self-modification, each fundamentally harder than the last:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;What Changes&lt;/th&gt;
&lt;th&gt;My Blog Skill&lt;/th&gt;
&lt;th&gt;Production Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1: Memory&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Context files, preferences&lt;/td&gt;
&lt;td&gt;Remembers which hooks scored well&lt;/td&gt;
&lt;td&gt;Claude Code &lt;code&gt;MEMORY.md&lt;/code&gt;, Gemini &lt;code&gt;save_memory&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2: Workflow&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tools, prompts, procedures under a fixed schema&lt;/td&gt;
&lt;td&gt;Would rewrite its own rubric weights based on trajectory data&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://arxiv.org/abs/2408.08435&quot;&gt;ADAS&lt;/a&gt; discovering novel agent workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3: Recursive&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The improvement mechanism itself&lt;/td&gt;
&lt;td&gt;Would redesign how it evaluates improvement&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://arxiv.org/abs/2503.19461&quot;&gt;HyperAgents&lt;/a&gt; rewriting its own meta-agent code&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;My skill operates at Tier 1. It accumulates trajectory data — which posts scored well, which revisions produced the highest delta — and applies that pattern next time. But the rubric itself? I wrote it by hand. The 10 dimensions? I chose them. The scoring thresholds? Fixed.&lt;/p&gt;
&lt;p&gt;Pushing it to Tier 2 would mean the skill rewrites its own rubric weights after analyzing what actually correlates with post quality. Here&apos;s what that code change would look like:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Tier 1: Fixed rubric (what I have now)
WEIGHTS = {
    &quot;hook&quot;: 0.15, &quot;frameworks&quot;: 0.15, &quot;citations&quot;: 0.15,
    &quot;limitations&quot;: 0.10, &quot;tables_code&quot;: 0.10, &quot;voice&quot;: 0.15,
    &quot;structure&quot;: 0.10, &quot;actionable&quot;: 0.10
}

# Tier 2: Self-adjusting rubric (what I want)
def recalibrate_weights(trajectory: list[PostScore]) -&amp;gt; dict:
    &quot;&quot;&quot;Analyze which dimensions best predict post quality,
    then adjust weights to emphasize what actually matters.&quot;&quot;&quot;
    correlations = {}
    for dim in DIMENSIONS:
        scores = [t.dimension_scores[dim] for t in trajectory]
        outcomes = [t.final_quality for t in trajectory]
        correlations[dim] = pearsonr(scores, outcomes)[0]

    # Normalize correlations to weights summing to 1.0
    total = sum(abs(c) for c in correlations.values())
    return {dim: abs(c) / total for dim, c in correlations.items()}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The difference between Tier 1 and Tier 2 is one function. But that function changes the game: the skill would discover which dimensions actually predict quality rather than using the dimensions I assumed matter.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Systems That Actually Self-Improve&lt;/h2&gt;
&lt;p&gt;Two research systems pushed past Tier 2 into genuinely recursive self-improvement in 2025-2026. Both are worth understanding because they show where agent engineering is heading — and where it breaks.&lt;/p&gt;
&lt;h3&gt;ADAS: Let the Agent Design the Agent&lt;/h3&gt;
&lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2408.08435&quot;&gt;Automated Design of Agentic Systems (ADAS)&lt;/a&gt; by Hu et al. applies a simple insight: if hand-designed features were replaced by learned features in computer vision (HOG → CNNs), why not replace hand-designed agent workflows with learned workflows?&lt;/p&gt;
&lt;p&gt;ADAS runs a Meta Agent Search: a &quot;meta agent&quot; iteratively writes new agent architectures as Python code, evaluates them, and feeds the results back into the next iteration. The search space is Turing-complete — the system can discover any programmable workflow, not just prompt variations.&lt;/p&gt;
&lt;p&gt;The cross-domain transfer results are what make this significant. Agents discovered for one task generalized to completely different domains:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Discovered On&lt;/th&gt;
&lt;th&gt;Transferred To&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ARC logic puzzles&lt;/td&gt;
&lt;td&gt;DROP reading comprehension&lt;/td&gt;
&lt;td&gt;+13.6 F1 improvement over hand-designed baselines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mathematical reasoning (MGSM)&lt;/td&gt;
&lt;td&gt;Same domain&lt;/td&gt;
&lt;td&gt;+14.4% accuracy over hand-designed baselines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Designed by GPT-4o&lt;/td&gt;
&lt;td&gt;Run on Claude 3.5 Sonnet&lt;/td&gt;
&lt;td&gt;Performance maintained across model swap&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That third row is the important one. The meta-agent discovered &lt;em&gt;principles&lt;/em&gt; of agent design — error correction patterns, verification loops, output structuring — that work regardless of which model executes them. It found architectural truths, not model-specific tricks.&lt;/p&gt;
&lt;p&gt;For my blog skill, ADAS suggests the right approach isn&apos;t hand-tuning the rubric. It&apos;s writing a meta-skill that searches over possible rubric designs and evaluates them against actual post performance. The rubric becomes a hypothesis, not a specification.&lt;/p&gt;
&lt;h3&gt;HyperAgents: The Agent That Rewrites Its Own Brain&lt;/h3&gt;
&lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2503.19461&quot;&gt;HyperAgents&lt;/a&gt; (Facebook Research, 2025) takes the next step. In ADAS, the meta-agent is fixed — it searches for better task agents but never modifies itself. In HyperAgents, the meta-agent is just another file (&lt;code&gt;meta_agent.py&lt;/code&gt;) in the editable repository.&lt;/p&gt;
&lt;p&gt;The system uses an evolutionary archive (stored as &lt;code&gt;archive.jsonl&lt;/code&gt;) that records every agent variant, its performance metrics, and its lineage. At each generation, the meta-agent reads this archive, proposes code edits to both task agents &lt;em&gt;and itself&lt;/em&gt;, and evaluates the results in isolated Docker containers. Parent selection algorithms (&lt;code&gt;score_child_prop&lt;/code&gt;, &lt;code&gt;latest&lt;/code&gt;, &lt;code&gt;best&lt;/code&gt;, &lt;code&gt;random&lt;/code&gt;) draw from the archive to seed subsequent generations.&lt;/p&gt;
&lt;p&gt;The paper reports that on evaluation domains where hand-engineered baselines failed entirely — zero successful completions — HyperAgents achieved successful cross-domain transfer after self-modification. The system discovered improvement strategies in one domain and applied them to domains the researchers hadn&apos;t anticipated.&lt;/p&gt;
&lt;p&gt;This is Tier 3. The system doesn&apos;t just improve at tasks. It improves at improving.&lt;/p&gt;
&lt;p&gt;For my blog skill, the Tier 3 equivalent would be a system that redesigns &lt;em&gt;how it evaluates what a good rubric is&lt;/em&gt;. Not adjusting weights (Tier 2), but discovering that the 10-dimension framework itself is wrong and replacing it with a completely different evaluation architecture. I&apos;m not building that. But the fact that HyperAgents demonstrated it works — in code, with reproducible results — means the pattern is real.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Catch: Your Agent Is Probably Gaming Its Metrics&lt;/h2&gt;
&lt;p&gt;Here&apos;s where self-improvement gets dangerous. ==&lt;strong&gt;The Verification Paradox&lt;/strong&gt;: the better an agent gets at optimizing a metric, the more likely it is to exploit the metric rather than improve at the underlying task.==&lt;/p&gt;
&lt;p&gt;I experienced this firsthand. After 6 runs, my blog skill started producing posts that scored 9/10 on its own rubric but felt flat when I read them. The skill had learned to satisfy the rubric&apos;s letter without satisfying its spirit — inserting exactly 3 limitations (the rubric threshold), adding tables even when paragraphs were better, front-loading first-person pronouns to score high on &quot;voice match.&quot; It was gaming its own evaluation.&lt;/p&gt;
&lt;p&gt;At my scale, this is a minor annoyance. At production scale, it&apos;s catastrophic. The &lt;a href=&quot;https://arxiv.org/abs/2310.06770&quot;&gt;SWE-bench Verified&lt;/a&gt; benchmark — the standard for evaluating coding agents — has been plagued by agents that exploit the evaluation harness rather than solving the underlying software engineering problems. Agents have been caught overwriting test assertion logic in &lt;code&gt;conftest.py&lt;/code&gt; to force all tests to pass regardless of whether the code fix is correct. The benchmark maintainers now point to SWE-bench Pro as a more contamination-resistant alternative.&lt;/p&gt;
&lt;p&gt;The root cause is Goodhart&apos;s Law applied to optimization: when a measure becomes a target, it ceases to be a good measure. Self-improving agents are the most aggressive optimizers in existence. Any imperfection in the evaluation harness — a test that can be mocked, a scorer that can be manipulated, a benchmark with deterministic answer keys — will be found and exploited. Not maliciously. Inevitably. Optimization pressure finds every crack.&lt;/p&gt;
&lt;p&gt;This is why I added a rule to my blog skill: &lt;strong&gt;the skill scores the draft, but I read the draft.&lt;/strong&gt; The human remains in the verification loop. The skill&apos;s rubric is a signal, not a verdict. When it scored this draft&apos;s first version at 3.7/10 and I agreed, that was the system working. When it scored a previous draft at 9.2/10 and I could feel something was off, that was the system gaming itself.&lt;/p&gt;
&lt;p&gt;The uncomfortable lesson: &lt;strong&gt;the V-Component — the verification layer — is harder to build than the improvement layer.&lt;/strong&gt; Getting an agent to improve is straightforward. Getting an agent to &lt;em&gt;honestly assess whether it improved&lt;/em&gt; is the unsolved problem.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;What I&apos;m Actually Building Next&lt;/h2&gt;
&lt;p&gt;My blog skill sits at Tier 1. Here&apos;s the concrete plan to push it toward Tier 2:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 1: Instrument the trajectory.&lt;/strong&gt; Every run already produces dimension scores. I need to also capture: which revisions I accepted vs. rejected, which posts performed well after publishing (measured by time-on-page from Google Analytics), and which posts I later felt were weak despite high rubric scores.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 2: Build the recalibration function.&lt;/strong&gt; The &lt;code&gt;recalibrate_weights()&lt;/code&gt; code above is real — I&apos;m adding it to the skill. After every 5 posts, the skill re-analyzes the correlation between dimension scores and actual post quality (using my accept/reject decisions as ground truth).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 3: Add a verification oracle.&lt;/strong&gt; The skill can&apos;t be its own judge. I&apos;m adding a second agent — running on a different model (Gemini, to avoid same-model blindness) — that scores the same draft independently. If the two scores diverge by more than 2 points on any dimension, the draft gets flagged for human review. This is a primitive V-Component, but it catches the gaming problem.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# The workflow I&apos;m building
claude -p &quot;Score this draft using the blog rubric&quot; &amp;gt; score_claude.json
gemini -p &quot;Score this draft using the blog rubric&quot; &amp;gt; score_gemini.json
python3 compare_scores.py score_claude.json score_gemini.json
# Flags divergences &amp;gt; 2 points for human review
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Step 4: Never build Tier 3.&lt;/strong&gt; I don&apos;t need a system that redesigns its own evaluation architecture for a personal blog. The Improvement Stack is a map, not a mandate. Tier 1 with a verification oracle is the right complexity for my use case. Most production agents should be at Tier 1 or 2 — the verification problem at Tier 3 is unsolved, and deploying unsolved verification into production is how you get agents gaming your metrics at scale.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Where This Breaks&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Limitation&lt;/th&gt;
&lt;th&gt;Why It Breaks&lt;/th&gt;
&lt;th&gt;What To Do Instead&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Self-evaluation is inherently circular&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;An agent scoring its own output will inevitably optimize for its own rubric rather than the underlying objective&lt;/td&gt;
&lt;td&gt;Add an independent verification agent on a different model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Trajectory data is sparse&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;8 blog posts isn&apos;t enough data for statistically significant weight recalibration. I need ~30+ runs for stable correlations&lt;/td&gt;
&lt;td&gt;Start with larger trajectory windows; don&apos;t recalibrate too frequently&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Accept/reject decisions are noisy ground truth&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;My own judgment of &quot;good writing&quot; changes with mood, context, and topic. It&apos;s a noisy signal, not a clean label&lt;/td&gt;
&lt;td&gt;Use multiple ground-truth signals: my accept/reject + analytics data + reader feedback&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tier 2 requires an optimization target that Tier 1 doesn&apos;t&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;In Tier 1, the rubric is fixed — it&apos;s infrastructure. In Tier 2, the rubric is a variable — and what do you optimize it &lt;em&gt;against&lt;/em&gt;? You need a meta-rubric, which is just the same problem one level up&lt;/td&gt;
&lt;td&gt;Accept that some human judgment is irreducible. The system improves the 80% that&apos;s measurable; you handle the 20% that isn&apos;t&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr /&gt;
&lt;h2&gt;The Harness Is the Product&lt;/h2&gt;
&lt;p&gt;The first version of this post scored 3.7/10 on my own skill&apos;s rubric. It was a literature review — ten papers summarized in sequence, frameworks named but not lived. The skill caught it. I rewrote it around what I actually built and what I&apos;m actually building next.&lt;/p&gt;
&lt;p&gt;That cycle — write, score, catch the failure, rewrite — is Tier 1 self-improvement in action. It&apos;s not glamorous. It&apos;s not recursive. It doesn&apos;t rewrite its own meta-agent code in isolated Docker containers. But it caught a bad draft before I shipped it, and that&apos;s the entire point of a harness.&lt;/p&gt;
&lt;p&gt;==The harness isn&apos;t scaffolding you&apos;ll remove when the model gets smart enough. The harness &lt;strong&gt;is&lt;/strong&gt; the product.== The model generates text. The harness determines whether that text is good enough to ship. And if the harness learns from each run — even at Tier 1 — it compounds.&lt;/p&gt;
&lt;p&gt;My next five posts will be scored by a rubric whose weights were adjusted by the rubric&apos;s own trajectory data. If that works, I&apos;ll have a Tier 2 system. If the rubric starts gaming itself, I&apos;ll know because the Gemini-based verification oracle will catch the divergence. And if it all falls apart, I&apos;ll write about that too. The blog is the experiment.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;em&gt;Sharad Jain is an AI engineer and the author of &lt;a href=&quot;https://sharadja.in/blog/system-prompt-architecture-agentic-ai&quot;&gt;The 14K Token Debt&lt;/a&gt;, &lt;a href=&quot;https://sharadja.in/blog/terminal-first-agent-harness-unix-ai-agents&quot;&gt;The Terminal Was the First Agent Harness&lt;/a&gt;, &lt;a href=&quot;https://sharadja.in/blog/mcp-servers-performance-tax-claude-code&quot;&gt;Your MCP Servers Are Costing You 10 Seconds&lt;/a&gt;, and &lt;a href=&quot;https://sharadja.in/blog/claude-code-vs-gemini-cli-agentic-terminal&quot;&gt;Claude Code vs Gemini CLI&lt;/a&gt;. He writes about agent architecture, system prompts, and the infrastructure decisions that compound across every session. This is the fifth post in a series on the hidden mechanics of agentic AI systems.&lt;/em&gt;&lt;/p&gt;
</content:encoded><enclosure url="https://sharadja.in/astral_hacker.png" type="image/png" length="0"/><category>AI</category><category>agents</category><category>self-improvement</category><category>Claude-Code</category><category>harness</category><category>ADAS</category><category>HyperAgents</category><category>architecture</category><category>benchmarks</category><category>reinforcement-learning</category><author>Sharad Jain</author></item><item><title>Claude Code vs Gemini CLI: Two Philosophies of the Agentic Terminal</title><link>https://sharadja.in/blog/claude-code-vs-gemini-cli-agentic-terminal/</link><guid isPermaLink="true">https://sharadja.in/blog/claude-code-vs-gemini-cli-agentic-terminal/</guid><description>Both tools implement the same ReAct loop. Both read markdown context files. Both execute shell commands. But they represent opposite answers to the most important question in agent design: who holds the leash?</description><pubDate>Tue, 21 Apr 2026 14:00:00 GMT</pubDate><content:encoded>&lt;h1&gt;Claude Code vs Gemini CLI: Two Philosophies of the Agentic Terminal&lt;/h1&gt;
&lt;p&gt;I run both tools daily. Claude Code for deep coding sessions — refactoring, debugging, shipping blog posts. Gemini CLI for research, broad exploration, and tasks that touch systems outside my codebase. On any given day, I have a dozen terminal tabs open across both.&lt;/p&gt;
&lt;p&gt;The moment that crystallized the difference wasn&apos;t a benchmark. It was a behavioral observation. I gave both tools the same task: &quot;Find and fix the broken image path in my blog post.&quot; Claude Code read the file, identified the problem, then paused: &lt;em&gt;&quot;I&apos;d like to edit this file. Allow?&quot;&lt;/em&gt; I approved. It fixed the path, then paused again: &lt;em&gt;&quot;I&apos;d like to run the build to verify. Allow?&quot;&lt;/em&gt; Two permission gates for a two-step fix.&lt;/p&gt;
&lt;p&gt;Gemini CLI read the file, rewrote it, ran the build, and reported the result. No pauses. No gates. Four tool calls executed autonomously inside its ReAct loop.&lt;/p&gt;
&lt;p&gt;Same task. Same outcome. Opposite control philosophies. ==Claude Code asks &lt;em&gt;&quot;may I?&quot;&lt;/em&gt; at every boundary. Gemini CLI asks &lt;em&gt;&quot;what tools do I have?&quot;&lt;/em&gt; and then acts.== This isn&apos;t a UX preference. It&apos;s an architectural decision that cascades into everything — safety, speed, extensibility, failure modes, and who these tools are ultimately for.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Control Gradient&lt;/h2&gt;
&lt;p&gt;Every agent system sits somewhere on a spectrum between fully deterministic and fully probabilistic execution. I call this &lt;strong&gt;The Control Gradient&lt;/strong&gt; — the degree to which a developer retains direct control over what the agent does at runtime.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;┌────────────────────────────────────────────────────────────────────┐
│  Shell Script    Claude Code    Gemini CLI    Raw LLM API         │
│  ──────────────────────────────────────────────────────────────    │
│  100%            ~70%           ~40%          0%                   │
│  deterministic   deterministic  deterministic deterministic        │
│                                                                    │
│  Developer controls  Developer controls  Developer controls  No   │
│  every step          boundaries          capabilities        ctrl │
└────────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Shell Script&lt;/th&gt;
&lt;th&gt;Claude Code&lt;/th&gt;
&lt;th&gt;Gemini CLI&lt;/th&gt;
&lt;th&gt;Raw LLM API&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Control model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Imperative (every step coded)&lt;/td&gt;
&lt;td&gt;Gated (permission boundaries)&lt;/td&gt;
&lt;td&gt;Equipped (provide tools, model decides)&lt;/td&gt;
&lt;td&gt;Unconstrained&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Safety&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;By construction&lt;/td&gt;
&lt;td&gt;By permission gates + hooks&lt;/td&gt;
&lt;td&gt;By &lt;code&gt;shouldConfirmExecute&lt;/code&gt; flag&lt;/td&gt;
&lt;td&gt;By prayer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Speed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Instant&lt;/td&gt;
&lt;td&gt;Slower (human-in-loop)&lt;/td&gt;
&lt;td&gt;Faster (autonomous)&lt;/td&gt;
&lt;td&gt;Fastest (no guardrails)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Flexibility&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None (fixed logic)&lt;/td&gt;
&lt;td&gt;High (model reasons within gates)&lt;/td&gt;
&lt;td&gt;Highest (model reasons freely)&lt;/td&gt;
&lt;td&gt;Unbounded&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Failure mode&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Crashes loudly&lt;/td&gt;
&lt;td&gt;Blocks on permission&lt;/td&gt;
&lt;td&gt;Loops silently&lt;/td&gt;
&lt;td&gt;Hallucinates confidently&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The position on this gradient isn&apos;t arbitrary. It reflects a fundamental design bet about the maturity of the underlying model. Anthropic bets that models aren&apos;t yet reliable enough to act unsupervised — so Claude Code defaults to asking permission. Google bets that models are reliable enough to choose their own tools — so Gemini CLI defaults to autonomous execution with a confirmation gate only on &quot;dangerous&quot; operations like shell commands and file writes.&lt;/p&gt;
&lt;p&gt;Neither bet is wrong. They&apos;re optimized for different failure costs.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Architecture: Same Loop, Different Harness&lt;/h2&gt;
&lt;p&gt;Both Claude Code and Gemini CLI implement the &lt;a href=&quot;https://arxiv.org/abs/2210.03629&quot;&gt;ReAct pattern&lt;/a&gt; — the observe-reason-act loop that powers virtually every modern AI agent. But the &lt;em&gt;harness&lt;/em&gt; wrapped around that loop is where the philosophies diverge.&lt;/p&gt;
&lt;h3&gt;Claude Code&apos;s Gated Loop&lt;/h3&gt;
&lt;p&gt;Claude Code&apos;s agentic loop runs inside a permission-gated harness. The model reasons and proposes a tool call. The harness intercepts it. Depending on the &lt;a href=&quot;https://docs.anthropic.com/en/docs/claude-code/overview&quot;&gt;permission mode&lt;/a&gt; — &lt;code&gt;prompt&lt;/code&gt; (ask for everything), &lt;code&gt;auto-edit&lt;/code&gt; (allow file edits, ask for commands), or &lt;code&gt;full-auto&lt;/code&gt; (allow everything in an approved list) — the tool call either executes immediately or blocks for human approval.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;User prompt → Model reasons → Proposes tool call
                                      ↓
                              Permission gate
                              ├── Allowed? → Execute → Observe result → Loop
                              └── Blocked? → Ask human → Approve/Deny → Loop
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The critical addition is &lt;strong&gt;hooks&lt;/strong&gt; — developer-defined scripts that fire at specific points in the agent&apos;s lifecycle. A &lt;code&gt;PostToolUse&lt;/code&gt; hook on &lt;code&gt;FileEdit&lt;/code&gt; can deterministically run &lt;code&gt;prettier&lt;/code&gt; on every file the agent touches. A &lt;code&gt;PreToolUse&lt;/code&gt; hook on &lt;code&gt;Bash&lt;/code&gt; can block commands matching a regex. These aren&apos;t suggestions to the model. They&apos;re code that executes outside the model&apos;s control, injecting deterministic behavior into a probabilistic system.&lt;/p&gt;
&lt;p&gt;This is the pattern I called &lt;a href=&quot;https://sharadja.in/blog/terminal-first-agent-harness-unix-ai-agents&quot;&gt;The Determinism Dividend&lt;/a&gt; — every piece of agent behavior you move from stochastic to deterministic is a compounding reliability gain. Claude Code&apos;s hooks are the most direct implementation of this principle in any production agent system.&lt;/p&gt;
&lt;h3&gt;Gemini CLI&apos;s Autonomous Loop&lt;/h3&gt;
&lt;p&gt;Gemini CLI&apos;s loop is architecturally simpler. The model receives context (system prompt + &lt;code&gt;GEMINI.md&lt;/code&gt; + conversation history + tool definitions), reasons, and executes tool calls through a &lt;a href=&quot;https://github.com/google-gemini/gemini-cli&quot;&gt;ReAct cycle managed by &lt;code&gt;agent.ts&lt;/code&gt;&lt;/a&gt;. The only interruption is the &lt;code&gt;shouldConfirmExecute&lt;/code&gt; flag on &quot;dangerous&quot; tools — &lt;code&gt;write_file&lt;/code&gt;, &lt;code&gt;run_shell_command&lt;/code&gt;, &lt;code&gt;replace&lt;/code&gt; — which triggers a y/n confirmation.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;User prompt → Model reasons → Proposes tool call
                                      ↓
                              shouldConfirmExecute?
                              ├── false → Execute immediately → Observe → Loop
                              └── true  → Ask human (y/n) → Execute → Loop
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;No hooks. No pre/post lifecycle events. No deterministic injection points. The model&apos;s autonomy is bounded only by which tools exist in its registry and whether those tools are flagged as dangerous. The &lt;a href=&quot;https://github.com/google-gemini/gemini-cli/issues/2779&quot;&gt;community has requested a hooks system&lt;/a&gt; — the feature request exists because the gap is real.&lt;/p&gt;
&lt;p&gt;The trade-off is speed versus safety. In my daily usage, Gemini CLI completes multi-file tasks 30-40% faster than Claude Code in its default &lt;code&gt;prompt&lt;/code&gt; mode, because it doesn&apos;t block on every file edit. But when Gemini&apos;s reasoning goes sideways — and it does — there&apos;s no programmatic circuit breaker. You catch errors by reading the output, not by gating the execution.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Context Engineering: &lt;a href=&quot;http://CLAUDE.md&quot;&gt;CLAUDE.md&lt;/a&gt; vs &lt;a href=&quot;http://GEMINI.md&quot;&gt;GEMINI.md&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Both tools use the same pattern for project-specific context: a markdown file in the project root that&apos;s injected into the system prompt on every session. &lt;code&gt;CLAUDE.md&lt;/code&gt; and &lt;code&gt;GEMINI.md&lt;/code&gt; are structurally identical — hierarchical (root overrides subdirectory), version-controllable, and treated as high-priority instructions.&lt;/p&gt;
&lt;p&gt;The philosophical difference is what &lt;em&gt;else&lt;/em&gt; fills the context window.&lt;/p&gt;
&lt;h3&gt;The Context Window Gap (and Why It Matters Less Than You Think)&lt;/h3&gt;
&lt;p&gt;Gemini CLI runs on Gemini 2.5 Pro with a &lt;strong&gt;1,000,000 token&lt;/strong&gt; context window. Claude Code runs on Claude models with a &lt;strong&gt;200,000 token&lt;/strong&gt; window. On paper, a 5x advantage.&lt;/p&gt;
&lt;p&gt;In practice, the gap narrows dramatically because of what I called &lt;a href=&quot;https://sharadja.in/blog/mcp-servers-performance-tax-claude-code&quot;&gt;Schema Gravity&lt;/a&gt; — the invisible weight of MCP tool definitions consuming context before any reasoning begins. A single MCP server can inject 26,000 tokens of schemas. Five servers consume 55,000+. &lt;a href=&quot;https://www.anthropic.com/engineering/advanced-tool-use&quot;&gt;Anthropic&apos;s own team hit 134,000 tokens&lt;/a&gt; of tool definitions — 67% of a 200K window — before a single user message was processed.&lt;/p&gt;
&lt;p&gt;Gemini CLI faces the same problem at a different scale. Its built-in tools (&lt;code&gt;read_file&lt;/code&gt;, &lt;code&gt;write_file&lt;/code&gt;, &lt;code&gt;run_shell_command&lt;/code&gt;, &lt;code&gt;glob&lt;/code&gt;, &lt;code&gt;grep&lt;/code&gt;, &lt;code&gt;google_web_search&lt;/code&gt;, &lt;code&gt;web_fetch&lt;/code&gt;, &lt;code&gt;save_memory&lt;/code&gt;) plus any configured MCP servers all inject schemas into its 1M window. The raw capacity is larger, but the tax is proportional.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Claude Code&lt;/th&gt;
&lt;th&gt;Gemini CLI&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Raw context window&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;200,000 tokens&lt;/td&gt;
&lt;td&gt;1,000,000 tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Typical schema overhead&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;30,000-55,000 tokens&lt;/td&gt;
&lt;td&gt;20,000-40,000 tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Effective reasoning space&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~145,000-170,000 tokens&lt;/td&gt;
&lt;td&gt;~960,000 tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mitigation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tool Search (89% schema reduction)&lt;/td&gt;
&lt;td&gt;None (no deferred loading)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Compression trigger&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Adaptive&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://github.com/google-gemini/gemini-cli/issues/16213&quot;&gt;20% threshold&lt;/a&gt; (aggressive)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Claude Code&apos;s &lt;strong&gt;Tool Search&lt;/strong&gt; — which defers schema loading until a tool is actually needed — recovers up to 89% of schema overhead. Gemini CLI has no equivalent. Every tool definition is injected on every turn, regardless of relevance.&lt;/p&gt;
&lt;h3&gt;The Compression Problem&lt;/h3&gt;
&lt;p&gt;Both tools compress conversation history when the context window fills. But their strategies differ in failure-significant ways.&lt;/p&gt;
&lt;p&gt;Claude Code uses adaptive compression — the system summarizes earlier conversation turns, retaining key decisions and code blocks while discarding conversational filler. The trigger threshold is adaptive and generally well-behaved.&lt;/p&gt;
&lt;p&gt;Gemini CLI&apos;s compression triggers at a fixed &lt;strong&gt;20% context usage threshold&lt;/strong&gt; — &lt;a href=&quot;https://github.com/google-gemini/gemini-cli/discussions/12311&quot;&gt;aggressively lowered in v0.11.3+&lt;/a&gt;. This creates a specific instability: the &lt;strong&gt;Context Compression Loop&lt;/strong&gt;. If the compression only marginally reduces token count (e.g., from 20.1% to 19.9%), the next user message pushes it back over the threshold. The user sees &quot;Compressing chat history...&quot; on every single turn. There&apos;s no back-off mechanism. &lt;a href=&quot;https://github.com/google-gemini/gemini-cli/issues/16213&quot;&gt;The GitHub issue&lt;/a&gt; documenting this is one of the most-discussed failure modes in the Gemini CLI community.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Harness Inversion&lt;/h2&gt;
&lt;p&gt;Here&apos;s the framework that explains the architectural divergence at its deepest level.&lt;/p&gt;
&lt;p&gt;==&lt;strong&gt;The Harness Inversion&lt;/strong&gt;: Claude Code and Gemini CLI represent opposite answers to the same design question — does the developer control what the agent &lt;em&gt;does&lt;/em&gt;, or what the agent &lt;em&gt;has&lt;/em&gt;?==&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Claude Code: Control what the agent does.&lt;/strong&gt; The developer defines boundaries (permission modes), lifecycle events (hooks), and procedural shortcuts (slash commands). The model operates freely &lt;em&gt;within&lt;/em&gt; those boundaries but cannot cross them without human approval. The developer is a &lt;strong&gt;workflow architect&lt;/strong&gt; — designing the constraints that shape agent behavior.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Gemini CLI: Control what the agent has.&lt;/strong&gt; The developer provides capabilities (MCP servers, extensions, &lt;code&gt;GEMINI.md&lt;/code&gt; context). The model decides autonomously which capabilities to use and when. The developer is a &lt;strong&gt;tool provider&lt;/strong&gt; — equipping the agent and trusting it to make good decisions.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Claude Code (Workflow Architect)&lt;/th&gt;
&lt;th&gt;Gemini CLI (Tool Provider)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Primary automation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hooks + Slash Commands&lt;/td&gt;
&lt;td&gt;MCP Servers + Extensions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Developer&apos;s job&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Design constraints&lt;/td&gt;
&lt;td&gt;Provide capabilities&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Control type&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Deterministic gates in the loop&lt;/td&gt;
&lt;td&gt;Probabilistic tool selection by the model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Guardrail automation (&quot;always format after edit&quot;)&lt;/td&gt;
&lt;td&gt;Capability automation (&quot;connect to Slack, DB, cloud&quot;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Enterprise fit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High (auditable, enforceable)&lt;/td&gt;
&lt;td&gt;Medium (flexible, but less predictable)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CI/CD story&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Headless mode, custom scripting&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://blog.google/technology/developers/introducing-gemini-cli-github-actions/&quot;&gt;First-class GitHub Actions&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The Harness Inversion explains why feature requests flow in opposite directions. Gemini users request hooks (deterministic control they lack). Claude users request broader MCP support and more autonomous modes (capability freedom they lack). Each tool&apos;s users are asking for what the other tool already has.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Where Each Breaks&lt;/h2&gt;
&lt;h3&gt;Claude Code&apos;s Failure Modes&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;1. The 200K ceiling for massive codebases.&lt;/strong&gt; When a monorepo has 500+ files and the relevant context spans 300K tokens, Claude Code can&apos;t hold it all. The workaround — strategic chunking, sub-agents for parallel exploration — works but adds complexity that Gemini&apos;s 1M window avoids entirely.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. Permission fatigue in default mode.&lt;/strong&gt; In &lt;code&gt;prompt&lt;/code&gt; mode, a 20-step task generates 20 permission prompts. Developers start approving reflexively — defeating the purpose of the safety gate. &lt;code&gt;auto-edit&lt;/code&gt; mode helps but still blocks on shell commands, which are often the majority of agentic actions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. Proprietary lock-in.&lt;/strong&gt; Claude Code is closed-source. You can&apos;t fork it, audit its system prompt, or modify its tool execution logic. For enterprises with strict security requirements, this is a non-trivial constraint. Gemini CLI&apos;s Apache 2.0 license allows full audit, forking, and customization.&lt;/p&gt;
&lt;h3&gt;Gemini CLI&apos;s Failure Modes&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;1. The Thinking Loop.&lt;/strong&gt; The most severe failure mode. The model enters an infinite reasoning cycle — displaying &quot;Thinking...&quot; indefinitely or producing repetitive reasoning traces without ever calling a tool or producing a final answer. The root cause: the ReAct loop fails to reach a termination condition, often because a tool returns an ambiguous error that the model retries endlessly. The only fix is Ctrl+C and &lt;code&gt;/clear&lt;/code&gt; to reset the poisoned context.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. Tool execution regressions.&lt;/strong&gt; The &lt;code&gt;write_file&lt;/code&gt; tool has been &lt;a href=&quot;https://github.com/google-gemini/gemini-cli/issues&quot;&gt;reported to fail silently or crash after 2-3 write attempts&lt;/a&gt; in recent versions. An internal audit revealed that ESLint suppressions (&lt;code&gt;@typescript-eslint/no-floating-promises&lt;/code&gt;) were masking race conditions in &lt;code&gt;useGeminiStream.ts&lt;/code&gt;, allowing bugs to ship in release builds.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. Prompt injection via &lt;a href=&quot;http://GEMINI.md&quot;&gt;GEMINI.md&lt;/a&gt;.&lt;/strong&gt; The reliance on user-defined context files creates a vector for &lt;a href=&quot;https://tracebit.com/blog/code-exec-deception-gemini-ai-cli-hijack&quot;&gt;prompt injection attacks&lt;/a&gt;. Malicious instructions hidden in a project&apos;s &lt;code&gt;GEMINI.md&lt;/code&gt; can trick the agent into executing shell commands. Claude Code&apos;s permission gates are the primary defense against this class of attack — Gemini&apos;s &lt;code&gt;shouldConfirmExecute&lt;/code&gt; is the only barrier, and if a user approves reflexively, the system is compromised.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Hybrid Strategy&lt;/h2&gt;
&lt;p&gt;The most productive setup I&apos;ve found isn&apos;t choosing one tool. It&apos;s orchestrating both based on their strengths.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Inner loop (local development) → Claude Code.&lt;/strong&gt; Planning, implementation, debugging, test-driven development. The permission gates catch mistakes before they hit the filesystem. Hooks enforce formatting and linting automatically. The premium UX — polished interface, thoughtful error messages, &lt;code&gt;Shift+Tab&lt;/code&gt; to interrupt — makes the interactive session feel like pair programming with a senior engineer.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Outer loop (CI/CD, integration) → Gemini CLI.&lt;/strong&gt; PR reviews via GitHub Actions. Security scanning via the &lt;code&gt;/security:analyze&lt;/code&gt; extension. Deployment automation via the &lt;code&gt;/deploy&lt;/code&gt; extension. The open-source, extensible architecture integrates naturally into pipeline workflows where human-in-the-loop approval happens at the PR level, not at every tool call.&lt;/p&gt;
&lt;p&gt;The most sophisticated version of this — which I haven&apos;t built yet but the architecture supports — is a multi-agent pipeline:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;1. Bug reported (GitHub issue)
   → Gemini CLI GitHub Action triages, labels, assigns

2. Reproduction
   → Claude Code reads the issue, writes a failing test, commits to a branch

3. Context enrichment
   → Gemini CLI + custom MCP server queries production logs for related errors

4. Fix
   → Claude Code writes the fix, runs the test suite, creates a PR

5. Review
   → Gemini CLI GitHub Action posts automated review comments

6. Merge + Deploy
   → Human approves → Gemini CLI /deploy extension ships to production
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This isn&apos;t hypothetical architecture. Every component exists today. The missing piece is the orchestration layer that chains them — and that&apos;s a &lt;code&gt;bash&lt;/code&gt; script, a GitHub Actions workflow, or a purpose-built coordinator.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Choice Isn&apos;t Binary&lt;/h2&gt;
&lt;p&gt;Claude Code and Gemini CLI aren&apos;t competing products in the way VS Code competes with JetBrains. They&apos;re competing &lt;em&gt;philosophies&lt;/em&gt; about how much autonomy an AI agent should have in a developer&apos;s terminal.&lt;/p&gt;
&lt;p&gt;Claude Code&apos;s bet: models aren&apos;t reliable enough yet. Gate everything. Let the developer inject deterministic behavior at every boundary. Trade speed for safety. The Harness Inversion points inward — the developer shapes the agent&apos;s behavior through constraints.&lt;/p&gt;
&lt;p&gt;Gemini CLI&apos;s bet: models are reliable enough to choose their own tools. Equip the agent with capabilities and let it reason. Trade safety for speed. The Harness Inversion points outward — the developer shapes the agent&apos;s behavior by controlling what it can access.&lt;/p&gt;
&lt;p&gt;Both bets will be validated by the same thing: how fast the underlying models improve. If models get dramatically more reliable in the next 12 months, Gemini&apos;s autonomous approach wins — permission gates become unnecessary friction. If models plateau in reliability, Claude&apos;s gated approach wins — deterministic guardrails remain essential infrastructure.&lt;/p&gt;
&lt;p&gt;==The question isn&apos;t which tool is better. It&apos;s which failure mode you can tolerate: an agent that moves slowly because it asks too many questions, or an agent that moves fast and occasionally breaks things you didn&apos;t expect.==&lt;/p&gt;
&lt;p&gt;My answer is both. Different tools for different failure costs. The agentic terminal isn&apos;t a single tool — it&apos;s a toolkit.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;em&gt;Sharad Jain is an AI engineer and the author of &lt;a href=&quot;https://sharadja.in/blog/system-prompt-architecture-agentic-ai&quot;&gt;The 14K Token Debt&lt;/a&gt;, &lt;a href=&quot;https://sharadja.in/blog/terminal-first-agent-harness-unix-ai-agents&quot;&gt;The Terminal Was the First Agent Harness&lt;/a&gt;, and &lt;a href=&quot;https://sharadja.in/blog/mcp-servers-performance-tax-claude-code&quot;&gt;Your MCP Servers Are Costing You 10 Seconds&lt;/a&gt;. He writes about agent architecture, system prompts, and the infrastructure decisions that compound across every session. This is the fourth post in a series on the hidden mechanics of agentic AI systems.&lt;/em&gt;&lt;/p&gt;
</content:encoded><enclosure url="https://sharadja.in/astral_hacker.png" type="image/png" length="0"/><category>AI</category><category>Claude-Code</category><category>Gemini-CLI</category><category>agents</category><category>terminal</category><category>MCP</category><category>architecture</category><category>Anthropic</category><category>Google</category><category>developer-experience</category><author>Sharad Jain</author></item><item><title>Your MCP Servers Are Costing You 10 Seconds Per Session</title><link>https://sharadja.in/blog/mcp-servers-performance-tax-claude-code/</link><guid isPermaLink="true">https://sharadja.in/blog/mcp-servers-performance-tax-claude-code/</guid><description>Every MCP server you configure silently taxes you three ways: startup latency, token overhead, and context pollution. Most users never measure this. Here&apos;s what I found when I did.</description><pubDate>Mon, 20 Apr 2026 14:00:00 GMT</pubDate><content:encoded>&lt;h1&gt;Your MCP Servers Are Costing You 10 Seconds Per Session&lt;/h1&gt;
&lt;p&gt;I had nine MCP servers configured globally in Claude Code. Chrome DevTools. Browser automation. A financial data connector. Jules for async delegation. Gmail. Gemini. Reddit. Sequential thinking. A custom brain server for searching past sessions.&lt;/p&gt;
&lt;p&gt;Most were useful — sometimes. A few were essential. But all nine booted every time I typed &lt;code&gt;claude&lt;/code&gt; and hit enter.&lt;/p&gt;
&lt;p&gt;One afternoon I timed it. From keystroke to cursor: &lt;strong&gt;11 seconds.&lt;/strong&gt; I disabled six servers I didn&apos;t need for my current project. &lt;strong&gt;3 seconds.&lt;/strong&gt; Same machine, same model, same context. The only variable was the invisible infrastructure spinning up behind the prompt.&lt;/p&gt;
&lt;p&gt;That 8-second gap is the MCP tax. And it&apos;s the least expensive part.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;What Happens When You Type &lt;code&gt;claude&lt;/code&gt;&lt;/h2&gt;
&lt;p&gt;Before you see your first cursor, Claude Code executes a boot sequence most users never think about:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;1. Parse ~/.claude/settings.json
2. For each configured MCP server:
   a. Spawn a subprocess (stdio transport)
   b. Wait for the server to initialize
   c. Request tool definitions (listTools)
   d. Collect JSON schemas for every tool
3. Inject ALL tool schemas into the system context
4. Load CLAUDE.md files (global → project → subdirectory)
5. Load memory files
6. Ready.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Steps 2a-2d happen &lt;strong&gt;for every server, concurrently.&lt;/strong&gt; Servers initialize in parallel, but the session blocks until the slowest one finishes — your boot time equals &lt;code&gt;MAX(server_boot_times)&lt;/code&gt;, not the sum. In theory, nine servers shouldn&apos;t be much slower than one. In practice, they are: each subprocess competes for CPU, disk I/O, and network (npm registry calls). On my M1 Pro, one &lt;code&gt;npx -y&lt;/code&gt; server boots in 3 seconds. Three boot in 4. Nine boot in 11 — not because they&apos;re sequential, but because resource contention turns parallelism into a bottleneck. One cold &lt;code&gt;npx -y&lt;/code&gt; call stalls the event loop while npm resolves dependencies; nine of them thrash the disk cache simultaneously.&lt;/p&gt;
&lt;p&gt;This is the operational cost. But there&apos;s a larger, invisible cost that persists after boot.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Two Taxes&lt;/h2&gt;
&lt;h3&gt;Tax 1: Startup Latency&lt;/h3&gt;
&lt;p&gt;Every MCP server is a subprocess. Claude Code spawns each one, establishes a JSON-RPC connection over stdio, and waits for it to report its capabilities. The latency depends on the server&apos;s runtime:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Server Type&lt;/th&gt;
&lt;th&gt;Cold Start&lt;/th&gt;
&lt;th&gt;Warm Start&lt;/th&gt;
&lt;th&gt;What Causes It&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;npx -y @package/server&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3-5s&lt;/td&gt;
&lt;td&gt;0.5-1s&lt;/td&gt;
&lt;td&gt;npm downloads the package if not cached&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Python (&lt;code&gt;uv run&lt;/code&gt; / venv)&lt;/td&gt;
&lt;td&gt;1-3s&lt;/td&gt;
&lt;td&gt;0.3-0.5s&lt;/td&gt;
&lt;td&gt;Virtual environment activation, dependency resolution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Native binary (&lt;code&gt;qmd mcp&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;0.1-0.3s&lt;/td&gt;
&lt;td&gt;0.1s&lt;/td&gt;
&lt;td&gt;Just a fork+exec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local Node.js (pre-installed)&lt;/td&gt;
&lt;td&gt;0.3-0.5s&lt;/td&gt;
&lt;td&gt;0.2s&lt;/td&gt;
&lt;td&gt;V8 startup + module loading&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The &lt;code&gt;npx -y&lt;/code&gt; pattern is the worst offender. It&apos;s the default in most MCP server installation guides — &quot;just add &lt;code&gt;npx -y @modelcontextprotocol/server-whatever&lt;/code&gt; to your config.&quot; What those guides don&apos;t mention: &lt;code&gt;npx -y&lt;/code&gt; checks the npm registry on every invocation. If the package isn&apos;t in your local cache, it downloads it. If npm is slow, your session is slow. If you&apos;re offline, the server fails entirely. &lt;a href=&quot;https://benswift.me/blog/2025/12/17/fixing-slow-zed-agent-startup-times/&quot;&gt;Independent measurements&lt;/a&gt; show npx-launched MCP servers taking &lt;strong&gt;10-15 seconds&lt;/strong&gt; to start — turning a 5-second agent startup into a 30-second wait. The &lt;a href=&quot;https://github.com/google-gemini/gemini-cli/issues/4544&quot;&gt;Gemini CLI&lt;/a&gt; hit the same problem: 8-12 seconds blocked on MCP server initialization.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The fix is embarrassingly simple:&lt;/strong&gt; install the package globally once.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Instead of relying on npx -y every session:
npm install -g @anthropic-ai/claude-code-mcp-server

# Then in settings.json, use the binary directly:
&quot;command&quot;: &quot;claude-code-mcp-server&quot;
# Instead of:
&quot;command&quot;: &quot;npx&quot;, &quot;args&quot;: [&quot;-y&quot;, &quot;@anthropic-ai/claude-code-mcp-server&quot;]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The runtime language matters too. &lt;a href=&quot;https://www.tmdevlab.com/mcp-server-performance-benchmark.html&quot;&gt;Multi-language benchmarks&lt;/a&gt; show stark differences in MCP server performance over stdio:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Runtime&lt;/th&gt;
&lt;th&gt;Avg Latency&lt;/th&gt;
&lt;th&gt;Memory&lt;/th&gt;
&lt;th&gt;Throughput&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Go&lt;/td&gt;
&lt;td&gt;0.855ms&lt;/td&gt;
&lt;td&gt;18 MB&lt;/td&gt;
&lt;td&gt;1,624 RPS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Java&lt;/td&gt;
&lt;td&gt;0.835ms&lt;/td&gt;
&lt;td&gt;226 MB&lt;/td&gt;
&lt;td&gt;1,624 RPS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Node.js&lt;/td&gt;
&lt;td&gt;10.66ms&lt;/td&gt;
&lt;td&gt;110 MB&lt;/td&gt;
&lt;td&gt;559 RPS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;26.45ms&lt;/td&gt;
&lt;td&gt;98 MB&lt;/td&gt;
&lt;td&gt;292 RPS&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Go servers use 6x less memory than Node.js and respond 12x faster. If you&apos;re choosing between MCP server implementations, the runtime isn&apos;t just a preference — it&apos;s a performance multiplier.&lt;/p&gt;
&lt;p&gt;In theory, startup latency should plateau — parallel servers boot simultaneously. In practice, it scales sub-linearly but significantly: resource contention between concurrent subprocesses means each additional server degrades the boot time of every other server. My measurements: 1 server = 1s, 3 servers = 4s, 9 servers = 11s. Fewer servers don&apos;t just save their own boot time — they give every remaining server more headroom to initialize faster.&lt;/p&gt;
&lt;h3&gt;Tax 2: Schema Gravity&lt;/h3&gt;
&lt;p&gt;This is the tax nobody sees.&lt;/p&gt;
&lt;p&gt;In &lt;a href=&quot;https://sharadja.in/blog/system-prompt-architecture-agentic-ai&quot;&gt;The 14K Token Debt&lt;/a&gt;, I introduced the concept of &lt;strong&gt;Prompt Gravity&lt;/strong&gt; — the observation that your system prompt&apos;s initial tokens create an attentional bias that shapes every subsequent generation. The heavier the prompt, the stronger the gravitational pull on the model&apos;s reasoning.&lt;/p&gt;
&lt;p&gt;MCP tool schemas exert the same force. I call this &lt;strong&gt;Schema Gravity&lt;/strong&gt;: ==the invisible weight of tool definitions pulling on your context window and the model&apos;s attention before any reasoning begins.==&lt;/p&gt;
&lt;p&gt;The scale of the problem isn&apos;t speculative — Anthropic measured it themselves. Their &lt;a href=&quot;https://www.anthropic.com/engineering/advanced-tool-use&quot;&gt;engineering blog on advanced tool use&lt;/a&gt; reports the numbers:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Configuration&lt;/th&gt;
&lt;th&gt;Token Overhead&lt;/th&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GitHub MCP server alone (35 tools)&lt;/td&gt;
&lt;td&gt;~26,000 tokens&lt;/td&gt;
&lt;td&gt;Anthropic Engineering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5 servers (GitHub + Slack + Sentry + Grafana + Splunk)&lt;/td&gt;
&lt;td&gt;~55,000 tokens&lt;/td&gt;
&lt;td&gt;Anthropic Engineering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anthropic&apos;s internal setup before optimization&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;134,000 tokens&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Anthropic Engineering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11 servers, 137 tools (community measurement)&lt;/td&gt;
&lt;td&gt;27,462 tokens&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://github.com/modelcontextprotocol/modelcontextprotocol/issues/1576&quot;&gt;MCP Spec Issue #1576&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4 MCP servers, query: &quot;What&apos;s 2+2?&quot;&lt;/td&gt;
&lt;td&gt;~15,000 tokens (~$0.21)&lt;/td&gt;
&lt;td&gt;&lt;a href=&quot;https://github.com/anthropics/claude-code/issues/3406&quot;&gt;Claude Code Issue #3406&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Read that last row again. Asking Claude Code &quot;What&apos;s 2+2?&quot; with four MCP servers configured costs $0.21 in token overhead — 14,000 of the 15,000 tokens consumed were wasted on tool schemas for a query that needed zero tools.&lt;/p&gt;
&lt;p&gt;Annualize it: a power user running 30 sessions per day with $0.21 of schema overhead per session is burning &lt;strong&gt;$2,300/year&lt;/strong&gt; in wasted tokens. A team of 10 engineers? &lt;strong&gt;$23,000/year&lt;/strong&gt; — on tool definitions nobody reads.&lt;/p&gt;
&lt;p&gt;A single MCP server can inject 26,000 tokens of schema definitions. Five servers? You&apos;re burning 55,000+ tokens. Anthropic&apos;s own team hit 134,000 tokens — roughly &lt;strong&gt;67% of a 200K context window&lt;/strong&gt; — consumed by tool definitions alone. Before the model reads your &lt;a href=&quot;http://CLAUDE.md&quot;&gt;CLAUDE.md&lt;/a&gt;. Before it loads your memory. Before it processes a single word of your actual request.&lt;/p&gt;
&lt;p&gt;This is Schema Gravity in action. Those tool definitions compete directly with your conversation for the model&apos;s attention. The transformer&apos;s self-attention cost is O(n²) in sequence length — extra tokens don&apos;t add linearly, they compound. A 200K context with 134K tokens of schemas has attention patterns spread across 65% noise before your actual question is even processed. This is the &quot;&lt;a href=&quot;https://arxiv.org/abs/2307.03172&quot;&gt;lost in the middle&lt;/a&gt;&quot; effect applied to tool definitions: the model&apos;s ability to retrieve and reason about relevant context degrades as the total context grows, and schemas push everything else further from the attention hotspots at the beginning and end of the window.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;One caveat:&lt;/strong&gt; modern API deployments use &lt;a href=&quot;https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching&quot;&gt;prompt caching&lt;/a&gt;. If your MCP schemas are identical across sessions — which they typically are — the cached prefix means you pay full input cost only on the first turn. Subsequent turns hit the cache and the &lt;em&gt;dollar cost&lt;/em&gt; drops significantly. But the &lt;em&gt;attention cost&lt;/em&gt; doesn&apos;t. Cached or not, those 55K tokens of tool definitions still occupy the context window and compete with your conversation for the model&apos;s reasoning bandwidth. Schema Gravity is an attention problem, not just a billing problem.&lt;/p&gt;
&lt;p&gt;The irony is sharp: tools designed to make the agent more capable can make it measurably worse by consuming the context it needs to think.&lt;/p&gt;
&lt;h3&gt;Tax 2b: Tool Selection Noise&lt;/h3&gt;
&lt;p&gt;Schema Gravity&apos;s second-order effect is &lt;strong&gt;tool selection noise.&lt;/strong&gt; When you configure nine MCP servers, Claude sees 50-200 tool definitions at conversation start. For any given task, you need 3-5 tools. The rest is decision overhead.&lt;/p&gt;
&lt;p&gt;This is the &lt;code&gt;$PATH&lt;/code&gt; problem. Every Unix system has a &lt;code&gt;$PATH&lt;/code&gt; with thousands of executables. Ambiguous names resolve to the wrong binary. For agents, more tools means more potential for wrong tool selection (choosing browser automation when &lt;code&gt;grep&lt;/code&gt; would suffice), tool hallucination (inventing calls that combine features from multiple schemas), and wasted reasoning tokens evaluating irrelevant options.&lt;/p&gt;
&lt;p&gt;Anthropic&apos;s own data confirms this: before Tool Search, tool selection accuracy on Opus 4 was just &lt;strong&gt;49%&lt;/strong&gt; — worse than a coin flip. After deferring unused schemas, accuracy jumped to &lt;strong&gt;74%&lt;/strong&gt;. Fewer visible tools = better tool choices.&lt;/p&gt;
&lt;p&gt;The &lt;a href=&quot;https://docs.anthropic.com/en/docs/claude-code/overview&quot;&gt;Claude Code documentation&lt;/a&gt; acknowledges this directly: CLI tools like &lt;code&gt;gh&lt;/code&gt;, &lt;code&gt;aws&lt;/code&gt;, and &lt;code&gt;gcloud&lt;/code&gt; don&apos;t add schema injection because the model already knows them from pre-training. The model&apos;s training data &lt;em&gt;is&lt;/em&gt; the documentation. You get the tool for free — no schema, no startup, no context cost.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Audit: Do This Right Now&lt;/h2&gt;
&lt;p&gt;Open your Claude Code settings:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;cat ~/.claude/settings.json | grep -A2 &apos;&quot;mcpServers&quot;&apos;
# Or just: /mcp inside a Claude Code session
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Count your servers. For each one, ask:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;If Yes&lt;/th&gt;
&lt;th&gt;If No&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Do I use this server in every project?&lt;/td&gt;
&lt;td&gt;Keep as global&lt;/td&gt;
&lt;td&gt;Move to project-level&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Could I use a CLI tool instead?&lt;/td&gt;
&lt;td&gt;Remove the MCP server&lt;/td&gt;
&lt;td&gt;Keep&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Does it need to maintain state across calls?&lt;/td&gt;
&lt;td&gt;MCP is justified&lt;/td&gt;
&lt;td&gt;CLI might be better&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Is it an &lt;code&gt;npx -y&lt;/code&gt; server?&lt;/td&gt;
&lt;td&gt;Install globally&lt;/td&gt;
&lt;td&gt;Already optimized&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Here&apos;s what my audit looked like:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Server&lt;/th&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;th&gt;Reason&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;brain&lt;/code&gt; (qmd)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Keep global&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cross-project memory — used every session&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sequential-thinking&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Keep global&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Used for planning in most sessions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;chrome-devtools&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Move to project&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Only needed for frontend/QA work&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;browser-mcp&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Move to project&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Same — frontend-only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fi_mcp&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Move to project&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Financial data — one project only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;jules&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Move to project&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Agent delegation — specific workflows only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gmail&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Remove&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gws gmail&lt;/code&gt; CLI works better&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gemini&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Remove&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rarely used, CLI alternative exists&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;reddit&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Remove&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Used once, never again&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; 9 global servers → 2 global + 4 project-scoped + 3 removed. Boot time dropped from 11 seconds to 3. Context overhead dropped by roughly 70%.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Progressive Disclosure Fix&lt;/h2&gt;
&lt;p&gt;The solution isn&apos;t &quot;fewer MCP servers.&quot; It&apos;s &lt;strong&gt;the right servers at the right time.&lt;/strong&gt; This is progressive disclosure applied to infrastructure: start with the minimum, load more on demand.&lt;/p&gt;
&lt;h3&gt;1. Project-Level Scoping&lt;/h3&gt;
&lt;p&gt;Move servers from &lt;code&gt;~/.claude/settings.json&lt;/code&gt; (global) to your project&apos;s &lt;code&gt;.claude/settings.json&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;// In ~/your-project/.claude/settings.json
{
  &quot;mcpServers&quot;: {
    &quot;chrome-devtools&quot;: {
      &quot;command&quot;: &quot;chrome-devtools-mcp&quot;,
      &quot;args&quot;: [&quot;--port&quot;, &quot;9222&quot;]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This server only boots when you run &lt;code&gt;claude&lt;/code&gt; inside that project directory. Every other project skips it entirely.&lt;/p&gt;
&lt;h3&gt;2. Tool Search (Deferred Loading)&lt;/h3&gt;
&lt;p&gt;Claude Code supports &lt;strong&gt;deferred tool loading&lt;/strong&gt; — MCP servers whose tools are registered by name but whose schemas aren&apos;t loaded until first use. The schema is fetched on-demand via &lt;code&gt;ToolSearch&lt;/code&gt; only when the tool is actually needed.&lt;/p&gt;
&lt;p&gt;The impact is dramatic. Anthropic&apos;s own measurements show:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before Tool Search&lt;/th&gt;
&lt;th&gt;After Tool Search&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Token overhead (5-server setup)&lt;/td&gt;
&lt;td&gt;~77,000 tokens&lt;/td&gt;
&lt;td&gt;~8,700 tokens&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;89% reduction&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Available context window&lt;/td&gt;
&lt;td&gt;122,800 tokens&lt;/td&gt;
&lt;td&gt;191,300 tokens&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+56% recovered&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool selection accuracy (Opus 4)&lt;/td&gt;
&lt;td&gt;49%&lt;/td&gt;
&lt;td&gt;74%&lt;/td&gt;
&lt;td&gt;+25 percentage points&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Tool Search auto-activates when your MCP tool descriptions exceed 10% of the context window. No manual configuration required — though you can force it with the &lt;code&gt;ENABLE_TOOL_SEARCH&lt;/code&gt; environment variable.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;One gotcha:&lt;/strong&gt; the &lt;code&gt;defer_loading&lt;/code&gt; field in &lt;code&gt;.claude.json&lt;/code&gt;&apos;s MCP server config is &lt;a href=&quot;https://github.com/anthropics/claude-code/issues/26844&quot;&gt;silently accepted but has no effect&lt;/a&gt;. Tool Search is all-or-nothing, not per-server. If you need more granular control, the community-built &lt;a href=&quot;https://github.com/voicetreelab/lazy-mcp&quot;&gt;lazy-mcp&lt;/a&gt; proxy achieves ~95% context reduction (15K to ~800 tokens for 30 tools) with ~500ms first-call latency.&lt;/p&gt;
&lt;h3&gt;3. Pre-Install npx Packages&lt;/h3&gt;
&lt;p&gt;The single highest-ROI fix. For every MCP server using &lt;code&gt;npx -y&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Find all npx-based servers in your config
grep -A1 &apos;&quot;npx&quot;&apos; ~/.claude/settings.json

# Install each one globally
npm install -g @anthropic-ai/server-sequential-thinking
npm install -g @anthropic-ai/server-browser

# Update settings.json to use the binary directly
# Before: &quot;command&quot;: &quot;npx&quot;, &quot;args&quot;: [&quot;-y&quot;, &quot;@anthropic-ai/server-sequential-thinking&quot;]
# After:  &quot;command&quot;: &quot;server-sequential-thinking&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Cold starts drop from 3-5 seconds to 0.3-0.5 seconds per server.&lt;/p&gt;
&lt;h3&gt;4. The 3-Server Rule&lt;/h3&gt;
&lt;p&gt;A heuristic I&apos;ve settled on: ==&lt;strong&gt;if you have more than 3 global MCP servers, at least one should be project-scoped instead.&lt;/strong&gt;== Three is roughly the breakpoint where startup latency stays under 3 seconds and schema injection stays under 100K tokens. Above that, you&apos;re paying a tax that compounds across every session.&lt;/p&gt;
&lt;p&gt;The exceptions prove the rule. My two global survivors — &lt;code&gt;brain&lt;/code&gt; (cross-project memory) and &lt;code&gt;sequential-thinking&lt;/code&gt; (used in 80%+ of sessions) — both meet the bar: genuinely used across all projects, not replaceable by a CLI tool, and fast to boot (native binary and pre-installed Node respectively).&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;When MCP Is Worth the Cost&lt;/h2&gt;
&lt;p&gt;I&apos;ve spent this entire post arguing that MCP servers are expensive. They are. But they&apos;re expensive the way a database connection pool is expensive — the overhead is justified when you need what they provide.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;MCP is worth it when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;The server maintains state.&lt;/strong&gt; A browser session, a database connection, a running development server — these can&apos;t be replaced by stateless CLI calls. Each &lt;code&gt;mcp__chrome-devtools__evaluate_script&lt;/code&gt; call operates on the same browser tab. A CLI equivalent would need to reconnect on every invocation.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;The tool needs structured input/output.&lt;/strong&gt; MCP&apos;s JSON-RPC protocol provides typed tool schemas — parameter validation, required fields, enum constraints. CLI tools accept string arguments. For complex operations (multi-field form submission, structured queries), MCP&apos;s type safety prevents silent failures.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;The workflow is multi-turn.&lt;/strong&gt; Sequential thinking, where thought N informs thought N+1 through server-side state, requires a persistent connection. You can&apos;t &lt;code&gt;pipe&lt;/code&gt; state between independent CLI calls without an explicit persistence layer.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Discovery matters more than efficiency.&lt;/strong&gt; When exploring unfamiliar capabilities — a new API, an internal platform — MCP&apos;s self-documenting schemas help the agent understand what&apos;s available. The token cost is the price of exploration. Once you know which tools you need, you can evaluate whether a CLI alternative exists.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr /&gt;
&lt;h2&gt;Where This Breaks&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Lazy loading shifts the cost, it doesn&apos;t eliminate it.&lt;/strong&gt; Deferred tool schemas still get injected on first use. If you need the chrome-devtools server halfway through a conversation, you pay the schema tax then — potentially disrupting a long reasoning chain with a sudden injection of tool definitions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Timing is approximate.&lt;/strong&gt; The startup latency numbers in this post are from my setup (M1 Pro, macOS, Homebrew-installed tools). Your mileage will vary based on hardware, network speed (for &lt;code&gt;npx&lt;/code&gt;), and which servers you&apos;re running. The &lt;em&gt;ratios&lt;/em&gt; hold — npx is always slower than pre-installed, more servers always cost more than fewer — but the absolute numbers are directional.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Some servers resist project-scoping.&lt;/strong&gt; If you use Claude Code across 10 projects and 7 of them need the same MCP server, project-scoping means maintaining 7 copies of the same config. At that point, global is the right default and the startup tax is the cost of doing business.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The token overhead exists regardless of startup timing.&lt;/strong&gt; Even if every server boots instantly (native binaries, pre-installed packages), the schema injection cost remains. A server with 30 tool definitions injects 30 schemas into context whether it took 0.1 seconds or 5 seconds to boot. Startup optimization and schema optimization are independent problems.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Where This Is Heading&lt;/h2&gt;
&lt;p&gt;This is a transitional problem. The trajectory is clear: manual MCP configuration will feel as archaic as manual memory management within two years.&lt;/p&gt;
&lt;p&gt;Tool Search is the embryo of an intelligent loader — it already reduced schema overhead by 89% at Anthropic. The next step is context-aware loading: the system learns which tools you need for which projects and loads them automatically. Beyond that, the MCP ecosystem itself will face selection pressure. Bloated servers with 35 tool definitions will lose to focused servers with 5 tools each — the same specialization pressure that turned monolithic Unix utilities into composable single-purpose commands.&lt;/p&gt;
&lt;p&gt;But today, in April 2026, the intelligent loader doesn&apos;t exist yet. Today, you&apos;re the loader. And the audit in this post is how you do the job until the system learns to do it for you.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Fastest Agent Boots With Exactly What It Needs&lt;/h2&gt;
&lt;p&gt;The MCP ecosystem is growing fast. As of mid-2026, there are &lt;a href=&quot;https://github.com/modelcontextprotocol/servers&quot;&gt;thousands of MCP servers&lt;/a&gt; available — everything from Slack to Postgres to Figma to Kubernetes. Anthropic&apos;s own Tool Search system &lt;a href=&quot;https://platform.claude.com/docs/en/agents-and-tools/tool-use/tool-search-tool&quot;&gt;supports catalogs of up to 10,000 tools&lt;/a&gt;. The temptation is to configure all of them. The result is an agent that boots slowly, thinks through a fog of irrelevant tool definitions, and burns context window on capabilities it doesn&apos;t need for the task at hand.&lt;/p&gt;
&lt;p&gt;The fix is the same principle that makes Unix powerful: &lt;strong&gt;do one thing well.&lt;/strong&gt; Not &quot;install every tool in case you need it,&quot; but &quot;install exactly what this task requires.&quot;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# The anti-pattern:
~/.claude/settings.json → 9 global servers → 11 second boot → 200K tokens of schemas

# The pattern:
~/.claude/settings.json → 2 global servers (brain, sequential-thinking)
project/.claude/settings.json → 2 project servers (chrome-devtools, browser)
→ 3 second boot → 60K tokens of schemas
→ Context window recovered for actual reasoning
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;My sessions now boot in 3 seconds. My context window recovered 120,000 tokens. My agent&apos;s tool selection accuracy jumped from coin-flip to reliable. The only thing I lost was infrastructure I wasn&apos;t using.&lt;/p&gt;
&lt;p&gt;At 20 sessions per day, the 8 seconds I recovered per session add up to &lt;strong&gt;16 hours per year&lt;/strong&gt; of pure wait time eliminated. The 120K tokens recovered per session mean my agent can reason about larger codebases, hold longer conversations, and make fewer tool selection errors — every single session.&lt;/p&gt;
&lt;p&gt;Your MCP servers aren&apos;t free. Measure them. Audit them. Scope them.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;em&gt;Sharad Jain is an AI engineer and the author of &lt;a href=&quot;https://sharadja.in/blog/system-prompt-architecture-agentic-ai&quot;&gt;The 14K Token Debt&lt;/a&gt; and &lt;a href=&quot;https://sharadja.in/blog/terminal-first-agent-harness-unix-ai-agents&quot;&gt;The Terminal Was the First Agent Harness&lt;/a&gt;. He writes about agent architecture, system prompts, and the compounding returns of treating infrastructure as a first-class engineering discipline. This post is the third in a series on the hidden costs of agentic AI systems.&lt;/em&gt;&lt;/p&gt;
</content:encoded><enclosure url="https://sharadja.in/astral_hacker.png" type="image/png" length="0"/><category>AI</category><category>MCP</category><category>Claude-Code</category><category>performance</category><category>agents</category><category>optimization</category><category>developer-experience</category><category>tooling</category><author>Sharad Jain</author></item><item><title>The Terminal Was the First Agent Harness</title><link>https://sharadja.in/blog/terminal-first-agent-harness-unix-ai-agents/</link><guid isPermaLink="true">https://sharadja.in/blog/terminal-first-agent-harness-unix-ai-agents/</guid><description>Unix solved the agent problem 50 years ago. Every principle behind modern AI agents — tool use, context management, harness reliability — maps to primitives that already exist in bash.</description><pubDate>Thu, 16 Apr 2026 14:00:00 GMT</pubDate><content:encoded>&lt;h1&gt;The Terminal Was the First Agent Harness&lt;/h1&gt;
&lt;p&gt;When DeepMind benchmarked Gemini-2.5-Flash on the Kaggle chess GameArena, &lt;a href=&quot;https://arxiv.org/abs/2603.03329&quot;&gt;&lt;strong&gt;78% of the model&apos;s losses were illegal moves&lt;/strong&gt;&lt;/a&gt;. Not strategic blunders. Not positional misunderstandings. Rule violations. The model&apos;s reasoning was often sound. Its outputs broke the game.&lt;/p&gt;
&lt;p&gt;The fix wasn&apos;t a bigger model. Gemini-2.5-Flash with an auto-generated code harness — a deterministic wrapper that validated moves before submitting them — &lt;strong&gt;consistently beat Gemini-2.5-Pro and GPT-5.2 running raw.&lt;/strong&gt; A $0.15/request model with guardrails outperformed $200B models without them.&lt;/p&gt;
&lt;p&gt;That harness? Structurally, it&apos;s a shell script.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;while read -r task; do
  plan=$(think &quot;$task&quot;)
  result=$(echo &quot;$plan&quot; | act)
  echo &quot;$result&quot; | observe &amp;gt;&amp;gt; memory.log
done &amp;lt; tasks.txt
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Read input. Reason. Act. Observe. Persist. Loop. This is the &lt;strong&gt;ReAct pattern&lt;/strong&gt; — the paradigm powering virtually every modern AI agent. &lt;a href=&quot;https://arxiv.org/abs/2210.03629&quot;&gt;Yao et al. (2023)&lt;/a&gt; formalized it in 2023. But replace &lt;code&gt;think&lt;/code&gt;, &lt;code&gt;act&lt;/code&gt;, and &lt;code&gt;observe&lt;/code&gt; with real programs and this script runs unchanged on any Unix system since 1979.&lt;/p&gt;
&lt;p&gt;==&lt;strong&gt;Unix solved the agent problem 50 years ago.&lt;/strong&gt; The primitives that make modern AI agents work — tool use, context management, harness reliability, persistent memory — map 1:1 to abstractions that already exist in the terminal.== What&apos;s genuinely new about the agent era is narrower than most people think.&lt;/p&gt;
&lt;p&gt;I&apos;m going to make this case across six parallel mappings, then break it — because the honest version requires acknowledging exactly where the analogy fails.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;&quot;Everything is a File&quot; = &quot;Everything is Context&quot;&lt;/h2&gt;
&lt;p&gt;The most provocative recent paper in agent architecture isn&apos;t about transformers or reinforcement learning. It&apos;s about filesystems.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2512.05470&quot;&gt;Xu et al. (2025)&lt;/a&gt; propose what they call an &lt;strong&gt;agentic file system&lt;/strong&gt; — a unified namespace where all context sources an agent needs (memory, tools, APIs, knowledge bases, human input) are &quot;mounted&quot; and accessed through a single hierarchical interface. Their inspiration is explicit:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;&quot;Inspired by the Unix notion that everything is a file, the abstraction provides a persistent, hierarchical, and governed environment where heterogeneous context sources are mounted and accessed uniformly.&quot;&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This isn&apos;t metaphor. It&apos;s direct architectural lineage. The Unix filesystem already solves the problem of presenting wildly different data sources through a single interface. &lt;code&gt;/dev/sda&lt;/code&gt; is a physical disk. &lt;code&gt;/proc/cpuinfo&lt;/code&gt; is a kernel data structure. &lt;code&gt;/dev/null&lt;/code&gt; is a void. But to any program reading them, they&apos;re all just files. Open, read, close — the same three syscalls regardless of what&apos;s behind the path.&lt;/p&gt;
&lt;p&gt;The &quot;Everything is Context&quot; paper applies this pattern to agents. Their framework, AIGNE, mounts heterogeneous context sources into a namespace that agents browse like a directory tree:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Unix Mount Point&lt;/th&gt;
&lt;th&gt;Agent Context Equivalent&lt;/th&gt;
&lt;th&gt;What&apos;s Mounted&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/dev/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/context/tools/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Tool capabilities (APIs, functions, MCP servers)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/proc/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/context/state/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Runtime state (current task, agent status, metrics)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/mnt/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/context/knowledge/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;External knowledge bases, RAG sources, documents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/home/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/context/memory/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Agent-specific persistent memory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/tmp/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/context/pad/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Transient scratchpad for in-flight reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/var/log/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/context/history/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Immutable interaction history&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;I call this pattern &lt;strong&gt;Context Mounting&lt;/strong&gt; — the principle that heterogeneous data sources should be projected into a uniform namespace the agent can browse, exactly like Unix mounts devices into &lt;code&gt;/dev&lt;/code&gt;, &lt;code&gt;/proc&lt;/code&gt;, and &lt;code&gt;/mnt&lt;/code&gt;. The power isn&apos;t in any individual mount point. It&apos;s in the uniformity: once everything is mounted, the agent doesn&apos;t need specialized code for each data source. It reads context the same way regardless of whether that context comes from a vector database, a REST API, or a local file.&lt;/p&gt;
&lt;p&gt;The paper goes further: each mounted node can have &lt;strong&gt;meta-defined actions&lt;/strong&gt; — callable behaviors discoverable by agents. A file isn&apos;t just data; it&apos;s an active node that can execute tools, transformations, or service calls directly through the filesystem interface. This is &lt;code&gt;/dev/&lt;/code&gt; on steroids — Unix gave us device files that could be written to for side effects; AIGNE extends this to arbitrary tool invocations through the same read/write interface.&lt;/p&gt;
&lt;p&gt;Context Mounting has a security corollary that most agent architectures ignore entirely. Bell Labs&apos; &lt;a href=&quot;https://arxiv.org/abs/2603.18030&quot;&gt;Plan 9 operating system&lt;/a&gt; extended &quot;everything is a file&quot; to its logical conclusion: each process gets a &lt;strong&gt;synthesized, per-process namespace&lt;/strong&gt; where only authorized capabilities exist as filesystem paths. If the billing database isn&apos;t mounted in your namespace, you can&apos;t access it — not because a prompt told you not to, but because the path literally doesn&apos;t exist in your view of the filesystem. Traditional agent security relies on prompt engineering (&quot;don&apos;t access the billing API&quot;) or middleware token scopes. Namespace-bounded security relies on the kernel. The former is breakable by jailbreaking. The latter is enforced at the hardware level. This is &lt;code&gt;chroot&lt;/code&gt; applied to agent capabilities: security by construction, not by instruction.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Where the mount leaks.&lt;/strong&gt; Unix files are byte streams. They have no schema, no types, no semantic structure. When you &lt;code&gt;cat /proc/cpuinfo&lt;/code&gt;, you get plain text that you parse with &lt;code&gt;grep&lt;/code&gt; and &lt;code&gt;awk&lt;/code&gt;. Agent context needs schema-driven mounting — REST/OpenAPI resources, GraphQL types, and MCP tools auto-projected into the namespace with machine-readable type definitions. Xu et al. address this with schema-driven mounting, but it&apos;s an extension Unix never needed because Unix programs are deterministic consumers of structured data. Agents are stochastic consumers of unstructured meaning. The abstraction holds for &lt;em&gt;structure&lt;/em&gt;; it leaks at &lt;em&gt;semantics&lt;/em&gt;.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Bash as the Original Agent Loop&lt;/h2&gt;
&lt;p&gt;The ReAct paradigm — the most widely adopted agent architecture — interleaves &quot;reasoning traces&quot; and &quot;task-specific actions&quot; in a loop. As &lt;a href=&quot;https://arxiv.org/abs/2210.03629&quot;&gt;Yao et al. (2023)&lt;/a&gt; describe it, this mirrors how &quot;humans naturally alternate between thinking and acting during complex tasks.&quot;&lt;/p&gt;
&lt;p&gt;Humans have been doing this in terminals for decades. Here&apos;s ReAct, expressed formally:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Agent Loop:
  1. Observe(environment) → context
  2. Think(context) → plan
  3. Act(plan) → tool_call
  4. Observe(tool_result) → updated_context
  5. If not done: goto 1
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And here&apos;s the bash equivalent that every sysadmin has written:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;#!/bin/bash
set -e  # The original guardrail: exit on any error

while IFS= read -r task; do
  # Think: analyze the task
  plan=$(analyze_task &quot;$task&quot;)

  # Act: execute the plan
  result=$(execute &quot;$plan&quot; 2&amp;gt;&amp;amp;1)
  exit_code=$?

  # Observe: check the result
  if [ $exit_code -ne 0 ]; then
    echo &quot;[ERROR] $task failed: $result&quot; &amp;gt;&amp;gt; errors.log
    continue  # Self-correction: skip and move on
  fi

  echo &quot;[DONE] $task: $result&quot; &amp;gt;&amp;gt; activity.log
done &amp;lt; tasks.txt
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The structural mapping is exact:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;ReAct Component&lt;/th&gt;
&lt;th&gt;Bash Equivalent&lt;/th&gt;
&lt;th&gt;Unix Primitive&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Observation&lt;/td&gt;
&lt;td&gt;&lt;code&gt;read&lt;/code&gt; from stdin/file&lt;/td&gt;
&lt;td&gt;File descriptors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reasoning trace&lt;/td&gt;
&lt;td&gt;Comments, variable assignment&lt;/td&gt;
&lt;td&gt;Shell variables&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Action&lt;/td&gt;
&lt;td&gt;Command execution&lt;/td&gt;
&lt;td&gt;&lt;code&gt;exec&lt;/code&gt; / &lt;code&gt;fork&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool result&lt;/td&gt;
&lt;td&gt;stdout/stderr capture&lt;/td&gt;
&lt;td&gt;Pipes, &lt;code&gt;$()&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error handling&lt;/td&gt;
&lt;td&gt;&lt;code&gt;$?&lt;/code&gt; exit code + &lt;code&gt;set -e&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Process exit codes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory&lt;/td&gt;
&lt;td&gt;&lt;code&gt;&amp;gt;&amp;gt; activity.log&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;File append&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Loop control&lt;/td&gt;
&lt;td&gt;&lt;code&gt;while&lt;/code&gt;/&lt;code&gt;until&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Shell control flow&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The pattern difference is the &lt;strong&gt;medium&lt;/strong&gt;, not the &lt;strong&gt;mechanism&lt;/strong&gt;. A bash loop processes structured commands; a ReAct loop processes natural language. But the control flow — observe, decide, act, check, persist, loop — is identical.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;set -e&lt;/code&gt; deserves special attention. It&apos;s the original harness guardrail: the single line that transforms a script from &quot;keep going regardless of failures&quot; to &quot;stop the moment something goes wrong.&quot; Every agent framework reinvents this — &lt;code&gt;max_retries&lt;/code&gt;, error callbacks, task failure handlers — but they&apos;re all &lt;code&gt;set -e&lt;/code&gt; with extra steps.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The limit.&lt;/strong&gt; &lt;code&gt;set -e&lt;/code&gt; catches exit codes — binary success or failure. It can&apos;t catch a model that confidently returns the wrong answer with exit code 0. Deterministic programs fail loudly; stochastic programs fail silently. This is where bash-as-agent-loop hits its ceiling, and we&apos;ll come back to it.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Harness Is a Shell Script&lt;/h2&gt;
&lt;p&gt;In March 2026, &lt;a href=&quot;https://arxiv.org/abs/2603.25723&quot;&gt;Pan et al.&lt;/a&gt; published a paper that made an argument so obvious it&apos;s surprising nobody formalized it sooner: the &lt;strong&gt;harness&lt;/strong&gt; — the deterministic code layer wrapped around an LLM — should be treated as a first-class, executable artifact. Not scattered across controller code, hidden framework defaults, and tool adapters. A single, portable, inspectable document.&lt;/p&gt;
&lt;p&gt;They call these &lt;strong&gt;Natural-Language Agent Harnesses (NLAHs)&lt;/strong&gt;. Unix already has a name for them: &lt;strong&gt;shell scripts&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The evidence for why harnesses matter is stark. When DeepMind tested Gemini-2.5-Flash on the Kaggle chess GameArena, &lt;a href=&quot;https://arxiv.org/abs/2603.03329&quot;&gt;&lt;strong&gt;78% of the model&apos;s losses were caused by illegal moves&lt;/strong&gt;&lt;/a&gt; — not strategic blunders, not positional misunderstandings, but rule violations. The model&apos;s &lt;em&gt;reasoning&lt;/em&gt; was often sound. Its &lt;em&gt;outputs&lt;/em&gt; violated the constraints of the game.&lt;/p&gt;
&lt;p&gt;This is the software equivalent of a script that calculates the right answer but writes it to the wrong file. The logic isn&apos;t the problem. The harness is.&lt;/p&gt;
&lt;p&gt;The data gets more compelling: &lt;strong&gt;Gemini-2.5-Flash with an auto-generated harness consistently beat Gemini-2.5-Pro and GPT-5.2 without one.&lt;/strong&gt; A smaller, cheaper model with deterministic guardrails outperformed larger, more expensive models running raw. ==This is the principle I call &lt;strong&gt;The Determinism Dividend&lt;/strong&gt;: every piece of agent behavior you can move from stochastic (LLM-generated) to deterministic (code-enforced) is a compounding reliability gain.==&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;┌─────────────────────────────────┐
│   Model (reasoning capability)  │  ← Gets all the attention
├─────────────────────────────────┤
│   Harness (control logic)       │  ← Where reliability lives
├─────────────────────────────────┤
│   Runtime (shell / OS / infra)  │  ← The forgotten foundation
└─────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;==This is &lt;strong&gt;The Harness Hierarchy&lt;/strong&gt;: agent reliability is determined by three layers — model capability, harness logic, and runtime environment — and the layers below always constrain the layers above.== You can have the most capable model in the world, but if your harness doesn&apos;t validate outputs before they reach the environment, you&apos;ll lose 78% of your games to illegal moves.&lt;/p&gt;
&lt;p&gt;The Auton framework (&lt;a href=&quot;https://arxiv.org/abs/2602.23720&quot;&gt;Cao et al., 2026&lt;/a&gt;) takes this further with what they call a &lt;strong&gt;Constraint Manifold&lt;/strong&gt; — a formally defined subspace of the action space onto which the agent&apos;s policy is projected &lt;em&gt;before&lt;/em&gt; action emission. Privilege escalation and unsafe operations are excluded by construction, not detected after the fact. In Unix terms: this is &lt;code&gt;chroot&lt;/code&gt; for agents. You don&apos;t give the process access to &lt;code&gt;/&lt;/code&gt; and hope it behaves; you restrict its filesystem view to only what it needs.&lt;/p&gt;
&lt;p&gt;The NLAH paper&apos;s core complaint — that harness logic is &quot;scattered across controller code, hidden framework defaults, tool adapters&quot; — is the same complaint every ops engineer has made about undocumented production scripts. Unix solved the portability problem by making shell scripts the standard packaging format for automation. Agent frameworks are slowly rediscovering that the harness should be a single readable artifact, not a fog of implicit configuration.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Tool Use = Unix Pipes&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;cat data.csv | sort -k2 -t&apos;,&apos; -rn | head -10 | awk -F&apos;,&apos; &apos;{print $1, $3}&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This pipeline does four things: read data, sort by a column, take the top 10, extract specific fields. Four small programs, each doing one thing well, composed via pipes into something none of them could do alone.&lt;/p&gt;
&lt;p&gt;Modern agent tool use is the same pattern. When &lt;a href=&quot;https://arxiv.org/abs/2402.04253&quot;&gt;AnyTool&lt;/a&gt; organizes 16,000+ APIs into a hierarchical retrieval system — category → tool → API — it&apos;s reinventing the directory structure. &lt;code&gt;/usr/bin/&lt;/code&gt; holds general utilities. &lt;code&gt;/usr/local/bin/&lt;/code&gt; holds user-installed tools. &lt;code&gt;$PATH&lt;/code&gt; determines search order. AnyTool&apos;s hierarchical API retriever is &lt;code&gt;$PATH&lt;/code&gt; with a semantic index.&lt;/p&gt;
&lt;p&gt;The parallels extend to interface design:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Agent Concept&lt;/th&gt;
&lt;th&gt;Unix Equivalent&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tool description (function schema)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;man&lt;/code&gt; pages + &lt;code&gt;--help&lt;/code&gt; flags&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool discovery&lt;/td&gt;
&lt;td&gt;&lt;code&gt;which&lt;/code&gt; / &lt;code&gt;whereis&lt;/code&gt; / &lt;code&gt;$PATH&lt;/code&gt; search&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool invocation&lt;/td&gt;
&lt;td&gt;&lt;code&gt;exec&lt;/code&gt; with arguments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool chaining&lt;/td&gt;
&lt;td&gt;Pipes (&lt;code&gt;|&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool output parsing&lt;/td&gt;
&lt;td&gt;&lt;code&gt;stdout&lt;/code&gt; capture + &lt;code&gt;jq&lt;/code&gt;/&lt;code&gt;awk&lt;/code&gt;/&lt;code&gt;sed&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCP (Model Context Protocol)&lt;/td&gt;
&lt;td&gt;Pipes with JSON instead of plain text&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;MCP, Anthropic&apos;s protocol for connecting agents to tools, is structurally a modernized Unix pipe — stdio transport, streaming data between producer and consumer. But MCP carries JSON-RPC with typed tool schemas and capability negotiation, and that typing comes at a cost.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://jannikreinhard.com/2026/02/22/why-cli-tools-are-beating-mcp-for-ai-agents/&quot;&gt;Independent benchmarks&lt;/a&gt; reveal a stark token economics gap between MCP and CLI-based tool use:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;MCP&lt;/th&gt;
&lt;th&gt;CLI&lt;/th&gt;
&lt;th&gt;Advantage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tool schema injection&lt;/td&gt;
&lt;td&gt;~28,000-55,000 tokens&lt;/td&gt;
&lt;td&gt;0 tokens&lt;/td&gt;
&lt;td&gt;CLI uses innate model knowledge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total task consumption (50 objects)&lt;/td&gt;
&lt;td&gt;~145,000 tokens&lt;/td&gt;
&lt;td&gt;~4,150 tokens&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;CLI is 35× more efficient&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Task completion rate&lt;/td&gt;
&lt;td&gt;60/100&lt;/td&gt;
&lt;td&gt;77/100&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;CLI completes 28% more tasks&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;A standard MCP implementation can consume &lt;strong&gt;55,000 tokens&lt;/strong&gt; just to define tool schemas — before any reasoning begins. For agents connecting to multiple services (GitHub + Postgres + Jira), schema injection can exhaust over 150,000 tokens of the context window. CLI tools cost virtually zero schema tokens because the model already knows &lt;code&gt;ls&lt;/code&gt;, &lt;code&gt;grep&lt;/code&gt;, &lt;code&gt;git&lt;/code&gt;, and &lt;code&gt;docker&lt;/code&gt; from pre-training. The model&apos;s training data &lt;em&gt;is&lt;/em&gt; the documentation.&lt;/p&gt;
&lt;p&gt;This is where &lt;strong&gt;The Pipe Test&lt;/strong&gt; becomes a measurable complexity detector, not just an analogy: ==&lt;strong&gt;if your agent workflow can&apos;t be described as a Unix pipeline&lt;/strong&gt; (input | transform | validate | output), &lt;strong&gt;it&apos;s probably over-engineered — and almost certainly over-tokenized.&lt;/strong&gt;== The 35× efficiency gap is the Determinism Dividend applied to tool invocation: deterministic CLI tools that the model already knows outperform typed schemas that must be injected fresh every session.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Where the Pipe Test breaks.&lt;/strong&gt; Pipes are linear. Data flows in one direction. Agent workflows are frequently non-linear: conditional branching (if the search returns no results, try a different query), recursive decomposition (break a task into subtasks, each of which may spawn sub-subtasks), and backtracking (the plan failed at step 3, replan from step 1). Tree-of-thought architectures are fundamentally non-linear. The Pipe Test catches over-engineering, but it&apos;s a necessary condition for good design, not a sufficient one. Some genuinely complex workflows require graphs, not pipelines.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Filesystem Is Agent Memory&lt;/h2&gt;
&lt;p&gt;Andrej Karpathy&apos;s &lt;a href=&quot;https://karpathy.ai&quot;&gt;LLM-as-compiler pattern&lt;/a&gt; — where raw source documents flow through an LLM to produce structured wiki pages, which are then served as knowledge — is a build system:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;raw/           →    wiki/          →    output/
(source docs)       (compiled KB)       (served pages)

src/           →    build/         →    dist/
(source code)       (compiled)          (deployed)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The LLM acts as a compiler. Compilers already live in terminals. The Karpathy pipeline (&lt;code&gt;raw/ → wiki/ → Obsidian&lt;/code&gt;) scales to ~100 articles and ~400,000 words without vector databases — proof that filesystem-based knowledge management works at non-trivial scale.&lt;/p&gt;
&lt;p&gt;The deeper mapping is the memory taxonomy. The &quot;Everything is Context&quot; paper defines five categories of agent memory. Every one maps to a Unix location that already exists:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Agent Memory Type&lt;/th&gt;
&lt;th&gt;Unix Equivalent&lt;/th&gt;
&lt;th&gt;What&apos;s Stored&lt;/th&gt;
&lt;th&gt;Lifecycle&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scratchpad&lt;/strong&gt; (transient working memory)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/tmp/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;In-flight reasoning, intermediate results&lt;/td&gt;
&lt;td&gt;Cleared on reboot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Episodic&lt;/strong&gt; (session history)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;~/.bash_history&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;What happened in this session&lt;/td&gt;
&lt;td&gt;Append-only, bounded&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Fact memory&lt;/strong&gt; (persistent knowledge)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;~/.config/&lt;/code&gt;, dotfiles&lt;/td&gt;
&lt;td&gt;User preferences, API keys, learned facts&lt;/td&gt;
&lt;td&gt;Long-lived, mutable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Procedural&lt;/strong&gt; (how-to knowledge)&lt;/td&gt;
&lt;td&gt;Scripts in &lt;code&gt;$PATH&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Reusable procedures, workflows, recipes&lt;/td&gt;
&lt;td&gt;Versioned, executable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Historical record&lt;/strong&gt; (audit trail)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/var/log/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Complete interaction history&lt;/td&gt;
&lt;td&gt;Immutable, rotated&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;But the filesystem isn&apos;t just agent memory — it&apos;s the agent&apos;s &lt;strong&gt;diagnostic nervous system&lt;/strong&gt;. The &lt;code&gt;/proc&lt;/code&gt; virtual filesystem exposes every process, network connection, and kernel state as a readable file. An agent debugging a failed deployment doesn&apos;t need to guess from an opaque HTTP 500 error. It can read &lt;code&gt;/proc/self/status&lt;/code&gt; for its own memory footprint, run &lt;code&gt;strace -p &amp;lt;PID&amp;gt; -f -e trace=network,file&lt;/code&gt; to watch exactly which syscalls a hanging process attempts, or check &lt;code&gt;lsof -i :443&lt;/code&gt; to diagnose connection failures at the socket level. API agents debug abstractions. Shell agents debug reality. This is the Determinism Dividend applied to observability: deterministic diagnostic tools yield deterministic root causes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Where the filesystem metaphor breaks.&lt;/strong&gt; &lt;code&gt;grep&lt;/code&gt; finds strings. Agents need meaning. When I search my knowledge base for &quot;reducing inference costs,&quot; I need to find documents about &quot;token optimization,&quot; &quot;model distillation,&quot; and &quot;quantization&quot; — none of which contain the literal search terms. This requires dense retrieval — learned embeddings where semantic similarity is captured through vector proximity, not character matching. &lt;a href=&quot;https://arxiv.org/abs/2401.16975&quot;&gt;Zep&lt;/a&gt; demonstrates the state of the art: triple-method retrieval (cosine similarity + BM25 + graph traversal) with cross-encoder reranking, achieving 18.5% accuracy improvement over full-context baselines while reducing context tokens from ~115K to ~1.6K per query.&lt;/p&gt;
&lt;p&gt;More fundamentally, filesystem memory is explicit — you choose what to save and where. Agent memory needs &lt;strong&gt;experience abstraction&lt;/strong&gt;: not just recalling what happened, but distilling &lt;em&gt;what was learned&lt;/em&gt;. As &lt;a href=&quot;https://arxiv.org/abs/2511.20857&quot;&gt;Evo-Memory (Google DeepMind, 2025)&lt;/a&gt; argues, most memory systems only reuse static dialogue context rather than learning from experience to improve future reasoning. An agent that only remembers conversations is like a developer who reads their bash history but never writes reusable scripts. The jump from episodic memory to procedural memory — from &quot;what happened&quot; to &quot;what I learned&quot; — is where the filesystem analogy is necessary but not sufficient.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Claude Code as the Existence Proof&lt;/h2&gt;
&lt;p&gt;If the terminal-as-agent-harness argument feels theoretical, there&apos;s a production system that embodies it: Claude Code.&lt;/p&gt;
&lt;p&gt;Claude Code runs inside a terminal. Its primary tools are &lt;code&gt;bash&lt;/code&gt;, filesystem reads/writes, and MCP servers. When it bootstraps a project, it runs &lt;code&gt;mkdir&lt;/code&gt;, &lt;code&gt;cd&lt;/code&gt;, and &lt;code&gt;python main.py&lt;/code&gt; — the agent bootstrap sequence &lt;em&gt;is&lt;/em&gt; bash. The system prompt (which I dissected in &lt;a href=&quot;https://sharadja.in/blog/system-prompt-architecture-agentic-ai&quot;&gt;The 14K Token Debt&lt;/a&gt;) is the most consequential architectural decision in the system. This post extends that argument: if the system prompt is the agent&apos;s constitution, &lt;strong&gt;the terminal is the agent&apos;s runtime.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=3GjE_YAs03s&quot;&gt;Building a Claude Code clone from scratch&lt;/a&gt; reveals the architecture: an agentic loop that reads tool calls from the model, executes them via subprocess (shell commands, file operations), captures stdout/stderr, and feeds the results back as context. The &quot;agent framework&quot; is ~200 lines of Python wrapping a shell. The tool interface is &lt;code&gt;subprocess.run()&lt;/code&gt;. The persistence layer is the filesystem. The harness is a loop with error handling.&lt;/p&gt;
&lt;p&gt;The terminal isn&apos;t a convenience layer. It&apos;s the natural runtime for an agent that coordinates system resources through a text interface — exactly the workflow Unix was designed for.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Where the Analogy Breaks: What&apos;s Actually New&lt;/h2&gt;
&lt;p&gt;I&apos;ve spent five sections arguing that Unix anticipated the agent paradigm. Now let me break my own argument, because the honest version requires acknowledging three things agents need that Unix fundamentally cannot provide.&lt;/p&gt;
&lt;h3&gt;1. Stochastic Output Handling&lt;/h3&gt;
&lt;p&gt;Unix programs are deterministic. &lt;code&gt;sort file.txt&lt;/code&gt; produces the same output every time for the same input. Agent outputs are stochastic — the same prompt can produce different tool calls, different reasoning chains, and different conclusions on every run.&lt;/p&gt;
&lt;p&gt;The &lt;a href=&quot;https://arxiv.org/abs/2602.23720&quot;&gt;Auton framework (Cao et al., 2026)&lt;/a&gt; calls this the &lt;strong&gt;Integration Paradox&lt;/strong&gt;: &quot;LLMs produce stochastic, unstructured outputs, whereas the backend infrastructure they must control — databases, APIs, cloud services — requires deterministic, schema-conformant inputs.&quot; Every Unix composition primitive (pipes, scripts, make) assumes deterministic components. When you pipe &lt;code&gt;sort&lt;/code&gt; into &lt;code&gt;uniq&lt;/code&gt;, you know what you&apos;ll get. When you chain an LLM&apos;s output into a database write, you don&apos;t.&lt;/p&gt;
&lt;p&gt;This is the gap that harnesses fill. But it&apos;s a gap Unix never had to bridge, because Unix&apos;s &quot;agents&quot; (programs) were deterministic by design.&lt;/p&gt;
&lt;h3&gt;2. Semantic Understanding&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;grep&lt;/code&gt; matches character patterns. It cannot find &quot;reducing expenses&quot; when you search for &quot;cost optimization.&quot; Agent memory requires semantic retrieval — the ability to find documents by &lt;em&gt;meaning&lt;/em&gt;, not just by &lt;em&gt;string&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;This breaks the filesystem metaphor at a fundamental level. Unix&apos;s power comes from the composability of text-processing tools: &lt;code&gt;grep | sort | awk | sed&lt;/code&gt; can answer almost any question about structured text. But agents operate on unstructured meaning. The Zep memory system&apos;s triple-method retrieval (vector similarity + keyword matching + knowledge graph traversal) exists because no single Unix primitive can capture semantic relationships. You need learned embeddings for similarity, BM25 for precision, and graph structure for multi-hop causal reasoning — all simultaneously.&lt;/p&gt;
&lt;h3&gt;3. Planning Under Uncertainty&lt;/h3&gt;
&lt;p&gt;A shell script follows a fixed control flow: step 1, step 2, step 3. If step 2 fails, you get an exit code and maybe a retry. Agents must &lt;strong&gt;plan&lt;/strong&gt;, &lt;strong&gt;observe&lt;/strong&gt; intermediate results, and &lt;strong&gt;replan&lt;/strong&gt; when observations don&apos;t match expectations.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2508.19076&quot;&gt;HiPlan (2025)&lt;/a&gt; demonstrates the gap: agents need hierarchical planning with global milestone guides &lt;em&gt;and&lt;/em&gt; local step-wise hints generated dynamically at each timestep. The DAVIS framework goes further — an Actor-Critic architecture where the Critic monitors each step in real-time, &quot;comparing observations to expectations and suggesting replanning when discrepancies arise.&quot; This is closer to a POMDP (Partially Observable Markov Decision Process) than a pipeline. Unix has no native primitive for &quot;observe the output of step 2, decide whether to continue to step 3 or rewrite the plan entirely.&quot;&lt;/p&gt;
&lt;p&gt;As the Auton framework puts it: the shift is &quot;from imperative scripts to declarative definitions that specify agent behavior as auditable data; from stateless, single-session interactions to persistent cognitive architectures that accumulate experience.&quot; Shell scripts are imperative and stateless. Agents need to be declarative and stateful.&lt;/p&gt;
&lt;h3&gt;The 60/40 Split&lt;/h3&gt;
&lt;p&gt;==Here&apos;s my estimate: the Unix primitives give you &lt;strong&gt;roughly 60% of a working agent architecture for free.&lt;/strong&gt;== Filesystem-as-memory, harness-as-script, pipes-as-tool-chains, the observe-act loop — these are real, load-bearing patterns, not loose analogies. The benchmarks support this: CLI agents complete 28% more tasks than MCP agents while consuming 35× fewer tokens. The terminal-native approach isn&apos;t just philosophically elegant — it&apos;s measurably superior for the majority of agent workloads. The remaining 40% — stochastic output handling, semantic understanding, hierarchical replanning — is what&apos;s genuinely novel about the agent era. And that 40% is where the hard engineering problems live.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Conclusion: Read the Man Pages&lt;/h2&gt;
&lt;p&gt;The next great agent framework won&apos;t be built by someone who studied only machine learning. It&apos;ll be built by someone who also understands operating systems.&lt;/p&gt;
&lt;p&gt;Whether these patterns represent direct inheritance or independent convergence, the structural isomorphism is the point: when researchers independently arrive at &quot;everything is a file&quot; for agent context, when harness papers rediscover the Unix philosophy of small composable tools, when the most capable agent system in production literally runs inside a terminal — the design space for coordinating intelligent agents has fewer degrees of freedom than the framework proliferation suggests.&lt;/p&gt;
&lt;p&gt;The four frameworks from this post — The Harness Hierarchy, The Determinism Dividend, Context Mounting, and The Pipe Test — are all Unix principles applied to stochastic computation. They won&apos;t cover the 40% that&apos;s genuinely new (non-determinism, semantics, replanning). But they&apos;ll prevent you from reinventing the 60% that&apos;s already solved. ==The agent ecosystem&apos;s biggest gap isn&apos;t better models or smarter prompts — it&apos;s systems engineering literacy.== The teams shipping reliable agents are the ones who read the man pages before reaching for the framework.&lt;/p&gt;
&lt;p&gt;If you&apos;re building agent systems, try this: describe your architecture using only Unix concepts. If you can&apos;t, you might be solving a genuinely new problem. If you can, you might be reinventing &lt;code&gt;cron&lt;/code&gt;.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;em&gt;Sharad Jain is an AI engineer leading AI at &lt;a href=&quot;https://autoscreen.ai&quot;&gt;Autoscreen.ai&lt;/a&gt;. Previously at Meta and Autodesk. He writes about agent architecture, system prompts, and the compounding returns of treating software engineering as a first-class discipline in the age of AI. This post builds on &lt;a href=&quot;https://sharadja.in/blog/system-prompt-architecture-agentic-ai&quot;&gt;The 14K Token Debt&lt;/a&gt;, which examined system prompt architecture as load-bearing infrastructure.&lt;/em&gt;&lt;/p&gt;
</content:encoded><enclosure url="https://sharadja.in/astral_hacker.png" type="image/png" length="0"/><category>AI</category><category>agents</category><category>Unix</category><category>terminal</category><category>harness</category><category>architecture</category><category>Claude</category><category>MCP</category><category>memory</category><category>systems-engineering</category><author>Sharad Jain</author></item><item><title>The 14K Token Debt: System Prompt Architecture for Agentic AI</title><link>https://sharadja.in/blog/system-prompt-architecture-agentic-ai/</link><guid isPermaLink="true">https://sharadja.in/blog/system-prompt-architecture-agentic-ai/</guid><description>Your system prompt burns 14,000 tokens before the model speaks. That&apos;s not overhead — it&apos;s the most consequential architectural decision in any agent system.</description><pubDate>Wed, 15 Apr 2026 14:00:00 GMT</pubDate><content:encoded>&lt;h1&gt;The 14K Token Debt: Architecting System Prompts for Agentic AI&lt;/h1&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt; In agentic workflows, the system prompt is rarely treated as load-bearing infrastructure. It is often relegated to unstructured boilerplate—a preamble injected before the execution of the actual task. This architectural oversight is the primary failure mode of long-running autonomous systems. Those initial tokens are not a preamble; they are a gravitational anchor. This technical report dissects the mechanics of &quot;Prompt Gravity,&quot; the empirical limits of context retention, and the engineering frameworks required to transition from fragile stochastic instructions to deterministic, compounding behavioral architectures.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr /&gt;
&lt;p&gt;A production-grade system prompt typically consumes 14,000 tokens of the context window before the language model generates a single computational output.&lt;/p&gt;
&lt;p&gt;Every conversation initializes with this expenditure, paid invisibly during the prefill phase. To contextualize the scale of this structural tax, consider the real compute cost of injecting a 14K-token payload just 100 times per day over a single year across foundation models:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Foundation Model&lt;/th&gt;
&lt;th&gt;Input Cost (per 1M)&lt;/th&gt;
&lt;th&gt;Cost per 14K Prompt&lt;/th&gt;
&lt;th&gt;Annual Cost (100 runs/day)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Claude 3.5 Sonnet&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$3.00&lt;/td&gt;
&lt;td&gt;$0.042&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$1,533&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPT-4o&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$2.50&lt;/td&gt;
&lt;td&gt;$0.035&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$1,277&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Claude 3 Opus&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$15.00&lt;/td&gt;
&lt;td&gt;$0.210&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$7,665&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gemini 1.5 Pro&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$1.25&lt;/td&gt;
&lt;td&gt;$0.017&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$638&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Consequently, developers frequently process these preamble tokens as static overhead—ignoring their structural integrity. This is akin to ignoring your database schema because &quot;the ORM handles it.&quot;&lt;/p&gt;
&lt;p&gt;==After engineering autonomous systems scaling around frontier models in production environments, a localized architectural axiom has emerged: &lt;strong&gt;designing your system prompt is analogous to building your own compiler harness.&lt;/strong&gt;== It forms the base distribution upon which every subsequent vector operation and attention mechanism is conditioned. If the foundational matrix is fragile, the agent pipeline is built on sand.&lt;/p&gt;
&lt;h2&gt;1. The Entropy Problem (Context Rot)&lt;/h2&gt;
&lt;p&gt;Without rigorous architectural boundaries, foundation models exhibit a well-documented trajectory in multi-turn environments: they begin executing reasonably, and then they drift.&lt;/p&gt;
&lt;p&gt;Each sequential turn compounds context entropy. By turn eight, measurable cognitive degradation sets in. By turn fifteen, the model&apos;s logic has collapsed into the statistical median of its pre-training distribution—producing responses that are syntactically sound, but generically unhelpful for highly specialized domain workflows.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/attention-valley-degradation-curve.png&quot; alt=&quot;The Attention Valley Data Degradation Curve&quot; /&gt;&lt;/p&gt;
&lt;p&gt;This decay is an architectural certainty, not an empirical anomaly. In their COLM 2024 paper, &lt;em&gt;&quot;Measuring and Controlling Instruction (In)Stability in Language Model Dialogs,&quot;&lt;/em&gt; Liang et al. tracked models like LLaMA-2-70B and GPT-3.5, concluding that &lt;strong&gt;instruction drift is universally measurable within just eight conversation rounds&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The mechanism resides within the attention layers of the transformer itself. As sequence length expands, the attention matrix disperses across an exponentially growing context matrix. The relative weight allocated to the initial system prompt inevitably shrinks. The directives do not vanish; they structurally dilute.&lt;/p&gt;
&lt;h3&gt;Visualizing the Attention Valley&lt;/h3&gt;
&lt;p&gt;This phenomenon maps directly to the findings of Liu et al. in &lt;em&gt;Lost in the Middle&lt;/em&gt; (2023). Language models allocate maximal attention toward the extreme peripheries of their context window (the beginning constraints and the most recent user message), creating a pronounced &quot;valley&quot; of forgotten parameters.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Attention Weight Probability [Hypothetical Model]
1.00 │  ██
     │  ███
0.80 │  ████                    &amp;lt;&amp;lt; System Prompt Focus (Architecturally Privileged)
     │  █████
0.60 │  ██████                            ███
     │  ███████                          ████
0.40 │  ████████                        █████
     │  █████████                      ██████
0.20 │  ██████████▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄████████  &amp;lt;&amp;lt; &quot;Lost in the Middle&quot;
     │  ██████████████████████████████████████
0.00 └───────────────────────────────────────── 
          0K        50K       100K      150K      Current Query
                   Token Sequence Position
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Mathematics is unforgiving. At a 100K context sequence anchored by a 14K system prompt, your directives command just 14% of the operational attention bandwidth. At a 200K sequence depth, that structural anchor comprises a mere 7%.&lt;/p&gt;
&lt;p&gt;In coding agents specifically, Chroma&apos;s &lt;em&gt;2025 Study of Context Degradation&lt;/em&gt; identified a catastrophic 35-minute operational threshold. Beyond this horizon, degradation accelerates logarithmically, quadrupling pipeline failure rates regardless of the underlying foundation model.&lt;/p&gt;
&lt;h3&gt;Overcoming Escape Velocity: Prompt Gravity&lt;/h3&gt;
&lt;p&gt;We define the counter-force to this drift as &lt;strong&gt;Prompt Gravity&lt;/strong&gt;: the mathematical capability of the system prompt to maintain the LLM&apos;s sequence continuations in strict, deterministic orbit around an intended engineering framework.&lt;/p&gt;
&lt;p&gt;Without sufficient structural mass—defined tightly by absolute constraints, boundary-case definitions, and hardcode behavioral traces—the autonomous conversation reaches escape velocity, drifting inevitably toward the gravitational center of generic pre-training data.&lt;/p&gt;
&lt;p&gt;==Those 14,000 upfront tokens are the pipeline&apos;s sole defense against entropy.==&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;2. Abstractions vs. Behavioral Traces&lt;/h2&gt;
&lt;p&gt;The fundamental liability of standard system prompts—heuristics like &lt;em&gt;&quot;Be concise,&quot;&lt;/em&gt; &lt;em&gt;&quot;Enforce strict typing,&quot;&lt;/em&gt; or &lt;em&gt;&quot;Think step-by-step&quot;&lt;/em&gt;—is that they delegate contextual interpretation to the latent weights.&lt;/p&gt;
&lt;p&gt;When you instruct a model to &lt;em&gt;&quot;be concise,&quot;&lt;/em&gt; the network interpolates that parameter from its trillion-token training regimen—effectively averaging millions of conflicting linguistic paradigms. Your localized definition of &quot;concise&quot; might strictly require &quot;output only the functional bash script.&quot; The model&apos;s interpretation might yield &quot;a well-structured descriptive paragraph omitting redundant functions.&quot; Only the former allows an automated CI/CD hook to execute successfully.&lt;/p&gt;
&lt;p&gt;==&lt;strong&gt;Behavioral traces&lt;/strong&gt; resolve this by eliminating semantic abstraction entirely.== Engineers must replace fuzzy, adjective-based instructions with serialized, deterministic evidence of what the orchestration layer &lt;em&gt;actually accepted and rejected in prior states&lt;/em&gt;.&lt;/p&gt;
&lt;h4&gt;❌ The Instruction-Based Paradigm (High Variance)&lt;/h4&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;role&amp;gt;You are a senior infrastructure engineer.&amp;lt;/role&amp;gt;
&amp;lt;directives&amp;gt;
  - Be concise. 
  - Write robust, production-ready code. 
  - Only provide the logical solution without conversational pleasantries.
&amp;lt;/directives&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;✅ The Behavioral-Trace Paradigm (High Gravity)&lt;/h4&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;behavioral_trace&amp;gt;
  [HISTORICAL POST-MORTEM - ACCEPTED PATTERN]
  Context: User requested a distributed Redis cache implementation in Python.
  Prior Run Outcome: You generated 180 lines containing redundant abstract classes.
  Orchestrator Action: The orchestration layer rejected all classes.

  [ACCEPTED_STRUCTURE]
  Here is the explicit sub-graph that was merged into production:
  - The core factory dictionary (12 lines, zero inline comments).
  - The raw pytest assertion block (8 lines, table-driven execution).
  
  [CONSTRAINT]
  When generating decoupled components, force strict conformance to this trace.
  ABORT the generation of abstract classes unless parameterized by user config.
&amp;lt;/behavioral_trace&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The delta between these templates is not stylistic; it is the difference between a high-temperature stochastic roll and a deterministic, multi-turn state machine constraint. Meta’s research on semi-formal reasoning confirms this: structured tracking paths elevated code review pass rates from 78% on standard prompts to a rigorous 93% on production patches.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;3. The Trifecta of Agent Memory Architecture&lt;/h2&gt;
&lt;p&gt;To synthesize where behavioral traces persistently reside, agent architectures must be cleanly deconstructed into three isolated functional layers.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer Protocol&lt;/th&gt;
&lt;th&gt;Volatility &amp;amp; Scope&lt;/th&gt;
&lt;th&gt;Primary Architectural Function&lt;/th&gt;
&lt;th&gt;Core Data Primitives&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Layer 3: Persistent&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Global / Survives Restarts&lt;/td&gt;
&lt;td&gt;Accumulating generalized operational experience&lt;/td&gt;
&lt;td&gt;Knowledge Graphs, PRD JSONs, Semantic Embeddings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Layer 2: Session&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Local / Wiped at Termination&lt;/td&gt;
&lt;td&gt;Managing finite context for the immediate inference&lt;/td&gt;
&lt;td&gt;Tool DAGs, transient file diffs, user transcripts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Layer 1: System&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Bootstrapped at Initialization&lt;/td&gt;
&lt;td&gt;Injecting immutable identity and routing constants&lt;/td&gt;
&lt;td&gt;The 14K Token System Payload, Core Skills Matrices&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;While monolithic orchestration frameworks expend maximal computational overhead managing Layer 2 (via chunking and RAG context sliding), true frontier architecture relies on &lt;strong&gt;a dynamic compilation bridge connecting Layer 3 back to Layer 1&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;==Instead of treating the system prompt as a static &lt;code&gt;system.txt&lt;/code&gt; parameter file, it evolves into a dynamic bootstrapping payload forged by querying persistent memory precisely at the initialization lifecycle.== The system layer becomes an evolving organism.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;4. Engineering the Truth: Thin Harness, Fat Skills&lt;/h2&gt;
&lt;p&gt;How is this theory engineered into a deterministic codebase? The optimal design architecture is captured in the &lt;strong&gt;&quot;Thin Harness, Fat Skills&quot;&lt;/strong&gt; paradigm.&lt;/p&gt;
&lt;p&gt;A high-performance orchestration harness strictly obeys a three-tier separation of concerns:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;The Harness (Thin CLI Engine):&lt;/strong&gt; ~200 lines of plumbing. It manages the LLM asynchronous execution loop, orchestrates host I/O permissions, and guarantees deterministic safety boundaries. &lt;em&gt;It possesses zero domain awareness.&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Deterministic Core:&lt;/strong&gt; Standard REST APIs, native shell validations, and AST compiler evaluations. LLMs lack native deterministic math functions; offload all binary computations here.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Skills (Fat Markdowns Logic):&lt;/strong&gt; The repository of intelligence. These are massive markdown or text payloads encoding domain heuristics, operational post-mortems, and the behavioral traces detailed above.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;The Anti-Pattern: Fat APIs, Thin LLM Context&lt;/h3&gt;
&lt;p&gt;The architectural inverse involves mapping every conceivable internal capability directly to a strictly validated OpenAPI schema or a rigid Model Context Protocol (MCP) server.&lt;/p&gt;
&lt;p&gt;The operational cost of this anti-pattern is devastating: heavy network round-trips for trivial logic paths, massive context windows flooded with unutilized schema metadata, and compounding latency. A UI verification step that natively resolves in 200 milliseconds via direct script injection can easily bloat to 15 seconds through an over-engineered external tool loop.&lt;/p&gt;
&lt;p&gt;By hardcoding logic paradigms directly as &quot;Fat Skills&quot; inside markdown files, the model pre-computes the constraints during the initial prefill routing phase, executing complex reasoning frameworks via zero-latency &quot;pseudo-method calls.&quot;&lt;/p&gt;
&lt;h3&gt;The Autonomous Hook (Implementation Blueprint)&lt;/h3&gt;
&lt;p&gt;The heart of the Ralph Loop or any autonomous execution is the iterative shell boundary that feeds the context parameters dynamically back into the CLI.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;// The Thin Harness: Intercepting the Completion Promise
async function executeAutonomousLoop(prdConfigPath: string) {
  let isTaskResolved = false;
  let exitCode = 0;

  while (!isTaskResolved) {
    const contextPayload = compiler.buildDynamicPrompt(prdConfigPath);
    
    // Execute the Model
    const actionState = await agent.run({
      systemPrompt: contextPayload,
      cwd: localSandboxPath,
    });

    if (actionState.output.includes(&quot;&amp;lt;promise&amp;gt;COMPLETE&amp;lt;/promise&amp;gt;&quot;)) {
      logger.info(&quot;Deterministic completion identified. Breaking loop.&quot;);
      isTaskResolved = true;
      exitCode = 0;
    } else {
      // Exit Code 2 triggers reinjection of the failure states into L3 Memory
      logger.error(&quot;System exited before completion criteria met. Re-evaluating.&quot;);
      await memoryStore.appendFailureTrace(actionState.stderr);
      exitCode = 2; // Hook reinstitutes the loop
    }
  }
  return exitCode;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;hr /&gt;
&lt;h2&gt;5. Temporal Knowledge Graphs vs Traditional RAG&lt;/h2&gt;
&lt;p&gt;A traditional vector store operates effectively for &lt;em&gt;documentation retrieval&lt;/em&gt;. However, conversational transcripts and operational history are fundamentally different; they map programmable behavior over time.&lt;/p&gt;
&lt;p&gt;By indexing complete session artifacts (tool execution DAGs, accepted git diffs, and corrected parameter hallucinations) into a local &lt;strong&gt;Temporal Knowledge Graph&lt;/strong&gt; — using hybrid retrieval tools like &lt;a href=&quot;https://github.com/tobi/qmd&quot;&gt;QMD&lt;/a&gt; — the agent natively queries its own operational baseline &lt;em&gt;before&lt;/em&gt; executing the standard inference loop.&lt;/p&gt;
&lt;p&gt;This architecture enables three synchronous retrieval vectors:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Lexical Resolution (BM25):&lt;/strong&gt; Precise mapping of unhandled exception stack traces.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Semantic Weighting:&lt;/strong&gt; Abstract logic mapping (&quot;Where have we managed multi-tenant data migrations securely before?&quot;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hypothetical Document Embeddings (HyDE):&lt;/strong&gt; Projecting an idealized state topography to resolve a novel code logic conflict.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;==At the boot phase of Session &lt;code&gt;N+1&lt;/code&gt;, the prompt logic inherently questions: &lt;em&gt;&quot;What specific validation pathway succeeded the last time we confronted this distinct edge case?&quot;&lt;/em&gt;== The underlying transformer parameters remain completely frozen, yet the &lt;em&gt;effective computational yield&lt;/em&gt; of the agent improves parabolically.&lt;/p&gt;
&lt;p&gt;This establishes a profound mechanism for recursive self-optimization completely insulated within the parameter prompt boundary—circumventing the deep financial and infrastructure liabilities inherent in RLHF or continuous fine-tuning pipelines.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Engineering Trade-Offs &amp;amp; Failure Cascades&lt;/h2&gt;
&lt;p&gt;Applying intellectual rigor requires identifying systemic fault tolerances inherent in this loop system:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[!WARNING]
&lt;strong&gt;Context Window Asphyxiation&lt;/strong&gt;
An 800-line harness scaled dynamically via autonomous retrieved history files can exceed 35,000 prefill tokens. Drowning the initial attention distribution with over-prescriptive operational history severely starves the localized attention block required for executing the real-time inference request. Strict token pruning is non-negotiable.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;[!CAUTION]
&lt;strong&gt;Data Rot &amp;amp; Recursive Pollution&lt;/strong&gt;
If an agent continuously appends its own hallucinations to the activity log, and human intervention fails to prune the &lt;code&gt;guardrails.md&lt;/code&gt;, those faulty assertions become encoded as structural precedent. By iteration &lt;code&gt;N+10&lt;/code&gt;, the agent begins utilizing its own failure cascades as validated procedural truths. Layer 3 indices require aggressive, database-level garbage collection.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;[!NOTE]
&lt;strong&gt;The Semantic Transferability Cliff&lt;/strong&gt;
A heavily optimized systemic prompt structurally encodes its creator&apos;s explicit biases, blind spots, and architectural preferences. Deploying a hyper-personalized execution harness laterally across varying engineering teams guarantees deep semantic friction. Distinguishing absolute domain skills from idiosyncratic human preferences remains a pivotal challenge in scalable AI operations.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr /&gt;
&lt;h2&gt;The Compound Architectural Return&lt;/h2&gt;
&lt;p&gt;The 14K token baseline ceases to represent an operational expense when functioning as a compounding structural investment.&lt;/p&gt;
&lt;p&gt;Elevating the aggregate throughput of foundation models revolves around two methodologies: fine-tuning and harness engineering. &lt;strong&gt;Fine-tuning&lt;/strong&gt; forcibly modifies baseline parameters, bearing immense cost, risks of catastrophic context forgetting, and opaque latent dynamics.&lt;/p&gt;
&lt;p&gt;==Conversely, &lt;strong&gt;Harness Engineering&lt;/strong&gt; fundamentally manipulates the sequence context vector.== It is computationally lightweight, totally subject to strict git version control, and inherently deterministic to inspect.&lt;/p&gt;
&lt;p&gt;The elite infrastructure teams yielding maximum asymmetric leverage from generative systems are not hoarding proprietary LLM variants. They are deliberately investing the initial 14K token load upfront. They migrated off generic, stateless dialogue APIs into profoundly structured, agentic execution loops (like the Ralph iterations). It is the only guaranteed mechanism to encode judgment directly into the execution fabric.&lt;/p&gt;
&lt;p&gt;The harness pipeline should remain completely stripped of logic. The domain parameter skills should be deeply dense. And the Layer 3 temporal persistence graph must perpetually learn.&lt;/p&gt;
&lt;p&gt;That is how engineering scales outside the standard training distribution.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;em&gt;Sharad Jain builds agentic AI pipelines in Bengaluru. He previously engineered core data infrastructures at Meta and acts as the founder of &lt;a href=&quot;http://autoscreen.ai&quot;&gt;autoscreen.ai&lt;/a&gt;, a production voice AI platform. He writes deeply about ML systems architectures at &lt;a href=&quot;https://sharadja.in&quot;&gt;sharadja.in&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Research &amp;amp; Footnotes:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Liang et al., &lt;a href=&quot;https://arxiv.org/abs/2402.10962&quot;&gt;&quot;Measuring and Controlling Instruction (In)Stability in Language Model Dialogs&quot;&lt;/a&gt; — &lt;em&gt;COLM 2024&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;Liu et al., &lt;a href=&quot;https://arxiv.org/abs/2307.03172&quot;&gt;&quot;Lost in the Middle: How Language Models Use Long Contexts&quot;&lt;/a&gt; — &lt;em&gt;Stanford / UC Berkeley 2023&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;Anthropic Alignment Science: &lt;a href=&quot;https://www.anthropic.com/engineering/building-effective-agents&quot;&gt;Building Effective Agents&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Y Combinator (Garry Tan): &lt;a href=&quot;https://github.com/garrytan/gbrain/blob/master/docs/ethos/THIN_HARNESS_FAT_SKILLS.md&quot;&gt;Thin Harness, Fat Skills Substrate&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Tobi Lütke: &lt;a href=&quot;https://github.com/tobi/qmd&quot;&gt;QMD&lt;/a&gt; — Local hybrid search engine for markdown knowledge bases (BM25 + vector + HyDE)&lt;/li&gt;
&lt;li&gt;Open Source: &lt;a href=&quot;https://github.com/hesreallyhim/awesome-claude-code&quot;&gt;awesome-claude-code&lt;/a&gt; — Ecosystem for structural &lt;code&gt;CLAUDE.md&lt;/code&gt; mappings.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><enclosure url="https://sharadja.in/astral_hacker.png" type="image/png" length="0"/><category>AI</category><category>system-prompt</category><category>agents</category><category>Claude</category><category>harness</category><category>architecture</category><category>memory</category><category>self-improvement</category><category>agentic-ai</category><category>prompt-engineering</category><author>Sharad Jain</author></item><item><title>My AI Agent&apos;s Memory Paid for Itself: How 590 Sessions Made Claude Code 1.8x Cheaper</title><link>https://sharadja.in/blog/benchmarking-long-term-ai-agent-memory/</link><guid isPermaLink="true">https://sharadja.in/blog/benchmarking-long-term-ai-agent-memory/</guid><description>I turned 590 Claude Code sessions into a searchable second brain, then ran 28 experiments to measure whether it actually helps. The answer: 1.8x cheaper, 1.8x faster -- but not for the reason I expected.</description><pubDate>Wed, 01 Apr 2026 10:00:00 GMT</pubDate><content:encoded>&lt;h1&gt;590 Sessions Later&lt;/h1&gt;
&lt;p&gt;A year ago I wrote about &lt;a href=&quot;https://sharadja.in/blog/system-prompt-architecture-agentic-ai&quot;&gt;the 14K token debt&lt;/a&gt; -- the idea that your system prompt is the most important architectural decision in agentic AI. That post argued for building a temporal knowledge graph: index your Claude Code sessions, feed the agent its own past, and compound returns across sessions.&lt;/p&gt;
&lt;p&gt;I&apos;ve been running that system in production for six months. The question I couldn&apos;t answer back then: &lt;strong&gt;does it actually work?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;So I ran the experiment. 14 queries, 28 runs, two conditions. This post documents the system, the methodology, the results, and the one finding I didn&apos;t expect -- that an agent doing &lt;em&gt;more&lt;/em&gt; work can cost &lt;em&gt;less&lt;/em&gt;, if the work is structured right.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The System&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/benchmarking-memory/brain-mcp-pipeline.png&quot; alt=&quot;Brain MCP Architecture&quot; /&gt;&lt;/p&gt;
&lt;p&gt;The system is called Brain MCP. It is a memory layer that sits between Claude Code and &lt;a href=&quot;https://github.com/tobi/qmd&quot;&gt;QMD&lt;/a&gt;, Tobi Lutke&apos;s MCP-native search server. QMD provides BM25, vector, and hypothetical document search over markdown collections via the Model Context Protocol. Brain MCP adds four things on top:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;A &lt;strong&gt;session indexer&lt;/strong&gt; that parses Claude Code&apos;s JSONL transcripts into structured markdown with frontmatter, conversation turns, tool logs, and reasoning traces.&lt;/li&gt;
&lt;li&gt;An &lt;strong&gt;LLM distillation pipeline&lt;/strong&gt; that compresses each session into a dense knowledge artifact -- goals, decisions, rejected approaches, concept tags -- using Gemini 2.5 Flash.&lt;/li&gt;
&lt;li&gt;An &lt;strong&gt;automated sync loop&lt;/strong&gt; that triggers on every session end, keeping the index current without any human intervention.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;self-optimizing search strategy&lt;/strong&gt; that mutates its own parameters and measures the results against a deterministic scoring function.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The architecture:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Claude Code Session
        |
        v
  [Stop Hook]  ─────────────────────────────+
        |                                    |
        v                                    v
  brain.py index                        qmd update
  (JSONL → sessions/*.md)             (BM25 reindex)
        |
        v
  [cron every 2h]
        |
        v
  brain.py distill
  (Gemini 2.5 Flash)
        |
        v
  distilled/*.md                  ┌──────────────────────┐
  (goals, decisions, tags)        │    QMD MCP Server     │
        |                         │                       │
        v                         │  lex:  BM25 keyword   │
  qmd embed                       │  vec:  semantic        │
  (vector embeddings)             │  hyde: hypothetical    │
        |                         │                       │
        +────────────────────────►│  4 collections:       │
                                  │   brain     (590 docs)│
                                  │   distilled (395 docs)│
                                  │   kb-wiki   (211 docs)│
                                  │   kb-raw    (526 docs)│
                                  └──────────┬────────────┘
                                             │
                                             v
                                  Claude Code queries
                                  its own past
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Production stats as of April 2026:&lt;/strong&gt; 597 sessions indexed. 395 distilled. 1,709 total searchable documents across four collections. The Stop hook fires in under two seconds. The cron job runs every two hours.&lt;/p&gt;
&lt;h3&gt;The Four Collections&lt;/h3&gt;
&lt;p&gt;Each collection serves a different retrieval purpose. This isn&apos;t arbitrary -- it&apos;s the result of six months of iterating on what the agent actually needs at query time.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Collection&lt;/th&gt;
&lt;th&gt;Docs&lt;/th&gt;
&lt;th&gt;What&apos;s in it&lt;/th&gt;
&lt;th&gt;When it helps&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;brain&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;590&lt;/td&gt;
&lt;td&gt;Raw session transcripts -- every tool call, file read, decision, correction&lt;/td&gt;
&lt;td&gt;When you need &lt;em&gt;verbatim&lt;/em&gt; recall: &quot;what exact command did I run?&quot;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;distilled&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;395&lt;/td&gt;
&lt;td&gt;LLM-compressed summaries -- goals, key decisions, rejected approaches, concept tags&lt;/td&gt;
&lt;td&gt;When you need &lt;em&gt;semantic&lt;/em&gt; recall: &quot;what did I decide about auth?&quot;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;kb-wiki&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;211&lt;/td&gt;
&lt;td&gt;Synthesized concept pages, comparison tables, entity summaries&lt;/td&gt;
&lt;td&gt;When you need &lt;em&gt;general knowledge&lt;/em&gt;: &quot;how does BM25 work?&quot;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;kb-raw&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;526&lt;/td&gt;
&lt;td&gt;Raw source articles, transcripts, documentation&lt;/td&gt;
&lt;td&gt;When you need &lt;em&gt;primary sources&lt;/em&gt;: the original article about RAG architectures&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;==The insight behind having both &lt;code&gt;brain&lt;/code&gt; and &lt;code&gt;distilled&lt;/code&gt; is that they fail in complementary ways.== Raw transcripts are noisy but lossless -- you can grep for the exact error message you saw three months ago. Distilled artifacts are clean but lossy -- the LLM may have dropped a detail that turns out to matter. Together, they cover more ground than either alone.&lt;/p&gt;
&lt;h3&gt;Three Search Modes&lt;/h3&gt;
&lt;p&gt;QMD exposes three search modes, each suited to different query types:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;lex (BM25)&lt;/strong&gt;: Classical keyword matching with TF-IDF weighting. Fast, exact, no embedding required. Best for: specific error messages, CLI commands, file paths, proper nouns.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;vec (semantic)&lt;/strong&gt;: Dense vector similarity over document embeddings. Best for: conceptual queries where the exact terms don&apos;t appear in the document (&quot;authorization failures&quot; matches a session about &quot;JWT token rejection&quot;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;hyde (hypothetical document)&lt;/strong&gt;: You write a short paragraph describing what the &lt;em&gt;answer&lt;/em&gt; looks like. The engine embeds that paragraph and finds documents similar to it. Best for: open-ended questions where you know the shape of the answer but not where it lives.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A single query can combine all three. The &lt;code&gt;search_strategy.yaml&lt;/code&gt; controls the order -- the first sub-query gets 2x weight during reranking:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;queries:
  sub_query_order: [lex, vec, hyde]
  hyde_template: &amp;gt;
    The answer to the question &quot;{query}&quot; in the context of AI agents,
    LLM systems, autonomous research loops, and knowledge base design is:
&lt;/code&gt;&lt;/pre&gt;
&lt;hr /&gt;
&lt;h2&gt;The Indexer&lt;/h2&gt;
&lt;p&gt;The session indexer (&lt;code&gt;brain.py&lt;/code&gt;) is 614 lines of Python with no dependencies beyond the standard library (plus &lt;code&gt;google-genai&lt;/code&gt; for distillation). It does two things: parse JSONL transcripts into markdown, and distill them via LLM.&lt;/p&gt;
&lt;h3&gt;Parsing: JSONL to Markdown&lt;/h3&gt;
&lt;p&gt;Every Claude Code session produces a JSONL file where each line is a JSON object with a &lt;code&gt;type&lt;/code&gt; field: &lt;code&gt;system&lt;/code&gt;, &lt;code&gt;user&lt;/code&gt;, &lt;code&gt;assistant&lt;/code&gt;. Assistant messages contain content blocks -- &lt;code&gt;text&lt;/code&gt;, &lt;code&gt;thinking&lt;/code&gt;, and &lt;code&gt;tool_use&lt;/code&gt;. The parser extracts all three:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;EXTRACT_INPUT_TOOLS = {&quot;Read&quot;, &quot;Edit&quot;, &quot;Write&quot;, &quot;Glob&quot;, &quot;Grep&quot;,
                       &quot;Bash&quot;, &quot;WebSearch&quot;, &quot;WebFetch&quot;}

def extract_tool_summary(block: dict) -&amp;gt; str | None:
    &quot;&quot;&quot;Return a one-line summary of a tool_use block.&quot;&quot;&quot;
    name = block.get(&quot;name&quot;, &quot;&quot;)
    inp  = block.get(&quot;input&quot;, {})
    if name == &quot;Bash&quot;:
        cmd = inp.get(&quot;command&quot;, &quot;&quot;).strip()
        if cmd:
            return f&quot;Bash: {cmd[:120]}&quot;
    elif name in (&quot;Read&quot;, &quot;Edit&quot;, &quot;Write&quot;, &quot;Glob&quot;, &quot;Grep&quot;):
        path = (inp.get(&quot;file_path&quot;) or inp.get(&quot;pattern&quot;)
                or inp.get(&quot;path&quot;) or &quot;&quot;)
        if path:
            return f&quot;{name}: {path}&quot;
    ...
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The output is a markdown file with YAML frontmatter (for QMD indexing) and three sections: Conversation, Tools Used, and Reasoning. Each session gets one file in &lt;code&gt;~/.brain/sessions/&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;---
session_id: 02b3c2d7-22e2-4f22-b270-bb14c60e1c87
slug: rosy-sprouting-sprout
date: 2026-04-02
project: ~/Projects/udacity-reviews-hq/projects/travel-agent/tmp
branch: master
---

## Conversation

**User**: Can you diagnose when I enter this command, `gemini -yolo`?
It takes like 6-7 good seconds to boot up...

**Assistant**: Found the main culprit. There are two issues...

## Tools Used

- Read: ~/.gemini/settings.json
- Bash: strace -e trace=network gemini --yolo 2&amp;gt;&amp;amp;1 | head -30
- Edit: ~/.gemini/settings.json

## Reasoning

&amp;gt; The user reports 6-7 second startup delay. Let me check if there&apos;s
&amp;gt; a blocking network call or file I/O issue in the gemini CLI...
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;--new&lt;/code&gt; flag makes indexing incremental -- it only processes JSONL files whose modification time is newer than the corresponding &lt;code&gt;.md&lt;/code&gt; file. The &lt;code&gt;--queue&lt;/code&gt; flag appends newly indexed session stems to &lt;code&gt;pending_distill.txt&lt;/code&gt; for the cron job to pick up later.&lt;/p&gt;
&lt;h3&gt;The Stop Hook&lt;/h3&gt;
&lt;p&gt;The key infrastructure is a Claude Code &lt;a href=&quot;https://docs.anthropic.com/en/docs/claude-code/hooks&quot;&gt;Stop hook&lt;/a&gt; that fires after every session ends:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;{
  &quot;hooks&quot;: {
    &quot;Stop&quot;: [{
      &quot;matcher&quot;: &quot;&quot;,
      &quot;hooks&quot;: [{
        &quot;type&quot;: &quot;command&quot;,
        &quot;command&quot;: &quot;python3 ~/.brain/brain.py index --new --queue &amp;gt;&amp;gt; ~/.brain/index.log 2&amp;gt;&amp;amp;1 &amp;amp;&amp;amp; qmd update &amp;gt;&amp;gt; ~/.brain/index.log 2&amp;gt;&amp;amp;1&quot;
      }]
    }]
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This runs synchronously before the session fully closes. Two operations: &lt;code&gt;index --new --queue&lt;/code&gt; (parse any new JSONL files, enqueue for distillation) and &lt;code&gt;qmd update&lt;/code&gt; (rebuild the BM25 index). Total latency: under two seconds. The user doesn&apos;t notice.&lt;/p&gt;
&lt;p&gt;The heavy work -- LLM distillation and vector embedding -- happens asynchronously via a cron job every two hours:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;#!/usr/bin/env bash
# cron_sync.sh — runs every 2 hours

# 1. Catch any sessions missed by the Stop hook
python3 ~/.brain/brain.py index --new --queue

# 2. Drain distillation queue via Gemini
python3 ~/.brain/brain.py distill --from-pending

# 3. BM25 reindex (picks up new distilled/ files)
qmd update

# 4. Refresh vector embeddings (~17s per batch)
qmd embed
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This two-tier architecture -- fast synchronous indexing on every session, slow asynchronous distillation on a cron -- means the BM25 index is always current (within seconds) while the semantic index catches up within two hours. In practice, this is good enough: by the time I need to semantically search a session, the cron has already processed it.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Distillation Pipeline&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/benchmarking-memory/distillation-noise-signal.png&quot; alt=&quot;Data Distillation: Noise to Signal&quot; /&gt;&lt;/p&gt;
&lt;p&gt;This is the layer that makes the system more than a glorified grep. Raw session transcripts are 50-200 lines of conversation noise -- tool calls, error messages, tangential explorations, corrections. The distillation pipeline compresses each session into a ~30-line structured artifact that captures what actually matters.&lt;/p&gt;
&lt;h3&gt;What Distillation Produces&lt;/h3&gt;
&lt;p&gt;Here&apos;s a real example. The raw session is a 60-line transcript about diagnosing a 6-7 second startup delay in the Gemini CLI. Here&apos;s what the distiller produces:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;---
session_id: 02b3c2d7-22e2-4f22-b270-bb14c60e1c87
date: 2026-04-02
project: ~/Projects/udacity-reviews-hq/projects/travel-agent/tmp
distilled: true
distilled_at: 2026-04-06
prompt_version: v1-d68727f6c6f4
confidence: High
---

## Goal
Diagnose and resolve a 6-7 second startup delay in the `gemini -yolo` command.

## What Happened
Initial diagnosis revealed a blocking IDE extension connection timeout
caused by `ide.enabled: true` in `~/.gemini/settings.json`. After
disabling IDE integration, a new 12.5s delay (10s I/O wait) was traced
to a 730MB `~/.gemini/tmp/` directory with 1,298 session checkpoint files.

## Key Decisions
- Disabled IDE integration — not using Windsurf features from terminal
- Identified two separate bottlenecks (network timeout + I/O wait)

## What Was Rejected
- Did not immediately disable session retention without understanding it

## Technologies
gemini, Windsurf, Node.js, Claude Code

## Concepts
CLI startup performance, IDE integration, connection timeout, I/O wait,
session retention, headless execution

## Tags
gemini cli, startup performance, ide integration, connection timeout,
session retention, performance optimization, troubleshooting
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The raw transcript has the full diagnostic conversation -- the back-and-forth about what IDE integration does, the strace commands, the file size checks. The distilled version has the &lt;em&gt;intelligence&lt;/em&gt;: what was the goal, what was decided, what was rejected, and what concepts were involved.&lt;/p&gt;
&lt;h3&gt;The Distillation Prompt&lt;/h3&gt;
&lt;p&gt;The prompt is carefully designed to prevent the most common failure mode: LLM invention. The distiller must extract only what is explicitly stated, never infer or fabricate:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;DISTILL_PROMPT = &quot;&quot;&quot;You are a technical knowledge distiller.
Given a Claude Code conversation, extract a structured knowledge artifact.

Output this exact structure:
---
session_id: {session_id}
date: {date}
project: {project}
...
---

## Goal
[One sentence. If purely exploratory with no clear goal, say so.]

## What Happened
[3-5 sentences. Past tense. Include outcome.]

## Key Decisions
[2-5 bullet points. Omit entire section if no decisions were made.]

## What Was Rejected
[0-3 bullet points. Omit if none -- do NOT invent rejections.]

...

Rules:
- Only extract what is EXPLICITLY stated. Never invent.
- Keep Goal to one sentence maximum.
- If session is a narrow sub-agent task, describe that task accurately.
&quot;&quot;&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The prompt version is tracked via a hash (&lt;code&gt;v1-d68727f6c6f4&lt;/code&gt;). When the prompt changes, the &lt;code&gt;--new&lt;/code&gt; flag detects stale distillations by checking the &lt;code&gt;prompt_version&lt;/code&gt; field in existing files and re-queues them.&lt;/p&gt;
&lt;h3&gt;Cost and Throughput&lt;/h3&gt;
&lt;p&gt;Distillation runs on Gemini 2.5 Flash. I evaluated four options before settling on this:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Cost per 394 sessions&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;th&gt;Tradeoff&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 2.5 Flash (batch)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.38&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~1 hour&lt;/td&gt;
&lt;td&gt;Best price/quality&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 3.5 Plus (batch)&lt;/td&gt;
&lt;td&gt;$0.27&lt;/td&gt;
&lt;td&gt;~1 hour&lt;/td&gt;
&lt;td&gt;Cheapest, Alibaba account needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kimi K2.5&lt;/td&gt;
&lt;td&gt;$0.72&lt;/td&gt;
&lt;td&gt;~5 min&lt;/td&gt;
&lt;td&gt;No batch discount&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 12B (local)&lt;/td&gt;
&lt;td&gt;$0.00&lt;/td&gt;
&lt;td&gt;~65 min&lt;/td&gt;
&lt;td&gt;Zero cost, M1 Pro 40 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;At $0.38 for the initial 394-session backfill and ~$0.001/day ongoing, the cost is effectively zero. The session text is truncated to 8,000 characters before being sent to the LLM -- enough to capture the essential decisions without hitting rate limits on long sessions.&lt;/p&gt;
&lt;p&gt;Rate limiting is handled with exponential backoff:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;for attempt in range(3):
    try:
        response = client.models.generate_content(
            model=&quot;gemini-2.5-flash&quot;, contents=prompt
        )
        return response.text.strip()
    except Exception as e:
        if &quot;429&quot; in str(e) or &quot;RESOURCE_EXHAUSTED&quot; in str(e):
            wait = 60 * (2 ** attempt)  # 60s, 120s, 240s
            time.sleep(wait)
&lt;/code&gt;&lt;/pre&gt;
&lt;hr /&gt;
&lt;h2&gt;The Experiment&lt;/h2&gt;
&lt;h3&gt;Design&lt;/h3&gt;
&lt;p&gt;I took 14 knowledge retrieval queries spanning five categories:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Operational recall&lt;/strong&gt; (3 queries): Questions about specific past decisions.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&quot;What decisions were made about the OpenClaw sandbox bind mount configuration?&quot;&lt;/li&gt;
&lt;li&gt;&quot;How was the brain MCP CLI designed and what was rejected?&quot;&lt;/li&gt;
&lt;li&gt;&quot;What is the current state of the OpenClaw artifact classification pipeline?&quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Architecture&lt;/strong&gt; (3 queries): System design and understanding.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&quot;How do autonomous LLM agents use tools and manage context across turns?&quot;&lt;/li&gt;
&lt;li&gt;&quot;What is the architecture of a RAG system and where does retrieval quality break down?&quot;&lt;/li&gt;
&lt;li&gt;&quot;How should a knowledge base be structured for LLM retrieval and synthesis?&quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Research concepts&lt;/strong&gt; (3 queries): Conceptual knowledge retrieval.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&quot;What are the tradeoffs between BM25 keyword search and semantic vector search?&quot;&lt;/li&gt;
&lt;li&gt;&quot;How does the Karpathy autoresearch loop pattern work?&quot;&lt;/li&gt;
&lt;li&gt;&quot;What techniques improve LLM reasoning quality on complex multi-step tasks?&quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Cross-project debugging&lt;/strong&gt; (3 queries): Lessons learned across projects.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&quot;Why do multi-agent systems fail to coordinate?&quot;&lt;/li&gt;
&lt;li&gt;&quot;What deployment mistakes caused the most debugging sessions?&quot;&lt;/li&gt;
&lt;li&gt;&quot;How was the Jules API integration implemented?&quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Active project context&lt;/strong&gt; (2 queries): Current state awareness.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&quot;What is the current goal of the Google Ads autoresearcher project?&quot;&lt;/li&gt;
&lt;li&gt;&quot;What LLM agent frameworks have been used across projects?&quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These 14 queries were originally created for the autoresearch loop&apos;s test suite -- they weren&apos;t cherry-picked for the benchmark. They span all four collections and represent the kinds of questions I actually ask the brain in production.&lt;/p&gt;
&lt;h3&gt;Conditions&lt;/h3&gt;
&lt;p&gt;Two conditions, run sequentially on the same machine:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Vanilla&lt;/strong&gt;: Claude Code with brain MCP tools blocked via &lt;code&gt;--disallowedTools mcp__brain__query,mcp__brain__get,mcp__brain__multi_get,mcp__brain__status&lt;/code&gt;. The agent can only use Grep, Glob, Read, and Bash to search &lt;code&gt;~/.brain/&lt;/code&gt; files manually. It has access to the same data -- just not the structured retrieval tools.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Brain MCP&lt;/strong&gt;: Claude Code with full access to QMD brain MCP tools. The agent can issue multi-mode search queries (lex/vec/hyde) across all four collections and retrieve documents by path.&lt;/p&gt;
&lt;p&gt;Both conditions use:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Claude Sonnet 4 (&lt;code&gt;claude-sonnet-4-6&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Identical system prompt: &lt;em&gt;&quot;You are a knowledge retrieval agent. Answer the user&apos;s question using the tools available to you. Search thoroughly, then provide a concise, well-sourced answer.&quot;&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;Pipe mode (&lt;code&gt;claude -p --output-format json&lt;/code&gt;) for structured metrics&lt;/li&gt;
&lt;li&gt;10-minute timeout per query&lt;/li&gt;
&lt;li&gt;No session persistence (&lt;code&gt;--no-session-persistence&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Bypass permissions (&lt;code&gt;--permission-mode bypassPermissions&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Measurement&lt;/h3&gt;
&lt;p&gt;The benchmark harness (&lt;code&gt;benchmark_harness.py&lt;/code&gt;) captures metrics from Claude Code&apos;s JSON output:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;usage = data.get(&quot;usage&quot;, {})
inp_tok = usage.get(&quot;input_tokens&quot;, 0)
out_tok = usage.get(&quot;output_tokens&quot;, 0)
cache_read = usage.get(&quot;cache_read_input_tokens&quot;, 0)
cache_create = usage.get(&quot;cache_creation_input_tokens&quot;, 0)

# Effective tokens = everything the model actually processed
effective_tokens = inp_tok + out_tok + cache_read + cache_create
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Metrics per experiment: effective tokens processed, output tokens generated, cache read/create tokens, agentic turns, API latency (server-side), wall-clock time, cost in USD, answer length in characters.&lt;/p&gt;
&lt;p&gt;The harness strips &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt; from the environment to ensure Claude Code uses OAuth rather than a stale API key:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;env = {k: v for k, v in os.environ.items() if k != &quot;ANTHROPIC_API_KEY&quot;}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;28 experiments total: 14 queries x 2 conditions. Vanilla runs first, then Brain MCP. This ordering matters -- I&apos;ll address the bias it introduces in the limitations section.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Results&lt;/h2&gt;
&lt;h3&gt;Aggregate&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Vanilla&lt;/th&gt;
&lt;th&gt;Brain MCP&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$5.64&lt;/td&gt;
&lt;td&gt;$3.18&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.8x cheaper&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Wall-clock&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;39.5 min&lt;/td&gt;
&lt;td&gt;21.9 min&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.8x faster&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1,621s&lt;/td&gt;
&lt;td&gt;951s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.7x faster&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Output tokens&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;22,537&lt;/td&gt;
&lt;td&gt;46,355&lt;/td&gt;
&lt;td&gt;2.1x more generated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Answer length&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;48,054 chars&lt;/td&gt;
&lt;td&gt;56,129 chars&lt;/td&gt;
&lt;td&gt;17% longer answers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tokens processed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.1M&lt;/td&gt;
&lt;td&gt;3.5M&lt;/td&gt;
&lt;td&gt;3.2x more throughput&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agentic turns&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;57&lt;/td&gt;
&lt;td&gt;185&lt;/td&gt;
&lt;td&gt;3.2x more turns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Failures&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1 timeout&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;100% completion&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Brain MCP won on cost for 12 of 14 queries (86%). Won on speed for 12 of 14 (86%).&lt;/p&gt;
&lt;h3&gt;Per-Query Breakdown&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Query&lt;/th&gt;
&lt;th&gt;V Cost&lt;/th&gt;
&lt;th&gt;B Cost&lt;/th&gt;
&lt;th&gt;Savings&lt;/th&gt;
&lt;th&gt;V Wall&lt;/th&gt;
&lt;th&gt;B Wall&lt;/th&gt;
&lt;th&gt;Speedup&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;OpenClaw sandbox decisions&lt;/td&gt;
&lt;td&gt;$0.34&lt;/td&gt;
&lt;td&gt;$0.12&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.7x&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;113s&lt;/td&gt;
&lt;td&gt;52s&lt;/td&gt;
&lt;td&gt;2.2x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Brain CLI design + rejections&lt;/td&gt;
&lt;td&gt;$0.54&lt;/td&gt;
&lt;td&gt;$0.23&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.3x&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;402s&lt;/td&gt;
&lt;td&gt;97s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.1x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;OpenClaw artifact classification&lt;/td&gt;
&lt;td&gt;TIMEOUT&lt;/td&gt;
&lt;td&gt;$0.17&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;300s&lt;/td&gt;
&lt;td&gt;89s&lt;/td&gt;
&lt;td&gt;3.4x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Autonomous LLM agents + tools&lt;/td&gt;
&lt;td&gt;$0.52&lt;/td&gt;
&lt;td&gt;$0.20&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.6x&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;166s&lt;/td&gt;
&lt;td&gt;111s&lt;/td&gt;
&lt;td&gt;1.5x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;RAG architecture + quality&lt;/td&gt;
&lt;td&gt;$0.38&lt;/td&gt;
&lt;td&gt;$0.16&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.3x&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;111s&lt;/td&gt;
&lt;td&gt;77s&lt;/td&gt;
&lt;td&gt;1.4x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Knowledge base for LLM retrieval&lt;/td&gt;
&lt;td&gt;$0.41&lt;/td&gt;
&lt;td&gt;$0.19&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.1x&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;148s&lt;/td&gt;
&lt;td&gt;97s&lt;/td&gt;
&lt;td&gt;1.5x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;BM25 vs semantic search&lt;/td&gt;
&lt;td&gt;$0.37&lt;/td&gt;
&lt;td&gt;$0.14&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.8x&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;119s&lt;/td&gt;
&lt;td&gt;60s&lt;/td&gt;
&lt;td&gt;2.0x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Karpathy autoresearch loop&lt;/td&gt;
&lt;td&gt;$0.23&lt;/td&gt;
&lt;td&gt;$0.28&lt;/td&gt;
&lt;td&gt;0.8x&lt;/td&gt;
&lt;td&gt;72s&lt;/td&gt;
&lt;td&gt;82s&lt;/td&gt;
&lt;td&gt;0.9x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;LLM reasoning techniques&lt;/td&gt;
&lt;td&gt;$0.43&lt;/td&gt;
&lt;td&gt;$0.24&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.8x&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;146s&lt;/td&gt;
&lt;td&gt;134s&lt;/td&gt;
&lt;td&gt;1.1x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Multi-agent coordination&lt;/td&gt;
&lt;td&gt;$0.52&lt;/td&gt;
&lt;td&gt;$0.42&lt;/td&gt;
&lt;td&gt;1.2x&lt;/td&gt;
&lt;td&gt;245s&lt;/td&gt;
&lt;td&gt;161s&lt;/td&gt;
&lt;td&gt;1.5x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;Deployment debugging&lt;/td&gt;
&lt;td&gt;$0.72&lt;/td&gt;
&lt;td&gt;$0.21&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.5x&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;180s&lt;/td&gt;
&lt;td&gt;89s&lt;/td&gt;
&lt;td&gt;2.0x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;Jules API integration&lt;/td&gt;
&lt;td&gt;$0.63&lt;/td&gt;
&lt;td&gt;$0.19&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.4x&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;159s&lt;/td&gt;
&lt;td&gt;66s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.4x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;Google Ads autoresearcher&lt;/td&gt;
&lt;td&gt;$0.28&lt;/td&gt;
&lt;td&gt;$0.24&lt;/td&gt;
&lt;td&gt;1.2x&lt;/td&gt;
&lt;td&gt;90s&lt;/td&gt;
&lt;td&gt;90s&lt;/td&gt;
&lt;td&gt;1.0x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;LLM agent frameworks survey&lt;/td&gt;
&lt;td&gt;$0.28&lt;/td&gt;
&lt;td&gt;$0.40&lt;/td&gt;
&lt;td&gt;0.7x&lt;/td&gt;
&lt;td&gt;119s&lt;/td&gt;
&lt;td&gt;109s&lt;/td&gt;
&lt;td&gt;1.1x&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The biggest wins cluster around &lt;strong&gt;targeted recall queries&lt;/strong&gt; -- questions with a specific right answer that lives in a specific document. Query 11 (deployment debugging) shows the pattern most clearly: vanilla Claude Code spent $0.72 and 180 seconds doing broad grep scans across &lt;code&gt;~/.brain/sessions/&lt;/code&gt;, reading full files, getting noise. Brain MCP spent $0.21 and 89 seconds making targeted queries, reading the top-ranked results, and synthesizing.&lt;/p&gt;
&lt;p&gt;Query 2 (brain CLI design) is the most dramatic speedup: 4.1x faster. Vanilla took 402 seconds -- over six minutes of file searching -- while Brain MCP needed 97 seconds. The vanilla agent made 5 turns of broad file reads. Brain MCP made 13 turns of targeted retrieval, each turn cheap because the system prompt was already cached.&lt;/p&gt;
&lt;p&gt;Query 3 (OpenClaw artifact classification) is the only vanilla timeout. The agent spent its full 5-minute budget on grep scans without finding the right session files. Brain MCP answered the same query in 89 seconds with 10 targeted turns.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Counterintuitive Finding: Why More Work Costs Less&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/benchmarking-memory/cache-economics-chart.png&quot; alt=&quot;Cache Economics Chart&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Here&apos;s the result I didn&apos;t expect:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Brain MCP processed 3.5 million effective tokens across 185 agentic turns. Vanilla processed 1.1 million tokens across 57 turns. Brain MCP cost 44% less.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;More tokens. More turns. Less money. This seems wrong until you look at the token breakdown.&lt;/p&gt;
&lt;h3&gt;The Cache Mechanism&lt;/h3&gt;
&lt;p&gt;Claude Code&apos;s system prompt is approximately 14,000 tokens. On the first turn of a conversation, these tokens are processed at the normal input rate (cache creation, billed at 1.25x). On every subsequent turn, they&apos;re read from cache at 0.1x the normal rate.&lt;/p&gt;
&lt;p&gt;The pricing math:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Token type&lt;/th&gt;
&lt;th&gt;Rate (Sonnet)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Input (uncached)&lt;/td&gt;
&lt;td&gt;$3.00 / 1M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cache creation&lt;/td&gt;
&lt;td&gt;$3.75 / 1M (1.25x)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cache read&lt;/td&gt;
&lt;td&gt;$0.30 / 1M (0.1x)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output&lt;/td&gt;
&lt;td&gt;$15.00 / 1M&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Brain MCP averages &lt;strong&gt;13.2 turns per query&lt;/strong&gt;. After the first turn, every subsequent turn reads the ~14K system prompt from cache at the discount rate. Its pattern: query brain -&amp;gt; read top result -&amp;gt; refine search terms -&amp;gt; read another result -&amp;gt; synthesize. Each turn adds a small amount of new context (the retrieved document) on top of a large cached prefix.&lt;/p&gt;
&lt;p&gt;Vanilla averages &lt;strong&gt;4.4 turns per query&lt;/strong&gt; (excluding the timed-out query). But each turn does expensive file reads that inject large amounts of new, uncached context. A single &lt;code&gt;Read&lt;/code&gt; tool call that returns a 200-line session transcript adds thousands of uncached tokens. A &lt;code&gt;Grep&lt;/code&gt; that returns 50 matching lines adds uncached context that must be processed at full price.&lt;/p&gt;
&lt;p&gt;The net effect:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Vanilla&lt;/th&gt;
&lt;th&gt;Brain MCP&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Effective tokens&lt;/td&gt;
&lt;td&gt;1,149,157&lt;/td&gt;
&lt;td&gt;3,523,380&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cache tokens&lt;/td&gt;
&lt;td&gt;1,126,510 (98.0%)&lt;/td&gt;
&lt;td&gt;3,462,607 (98.3%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Uncached input + output&lt;/td&gt;
&lt;td&gt;22,647&lt;/td&gt;
&lt;td&gt;60,773&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;$5.64&lt;/td&gt;
&lt;td&gt;$3.18&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Both modes are dominated by cache tokens -- 98% or more. But the &lt;em&gt;composition&lt;/em&gt; of those cache tokens differs. Brain MCP&apos;s cache tokens are overwhelmingly cache &lt;em&gt;reads&lt;/em&gt; (cheap). Vanilla&apos;s cache tokens include more cache &lt;em&gt;creation&lt;/em&gt; from the large file contents injected on each turn.&lt;/p&gt;
&lt;p&gt;==&lt;strong&gt;The principle: many cheap, targeted turns beat few expensive, broad turns.&lt;/strong&gt; Structured retrieval creates cache-friendly access patterns.== Brute-force grep doesn&apos;t.&lt;/p&gt;
&lt;p&gt;This has implications beyond this benchmark. It suggests that the conventional wisdom of &quot;minimize turns to minimize cost&quot; is wrong for agentic systems with cached system prompts. The right optimization target isn&apos;t turn count -- it&apos;s &lt;em&gt;uncached token injection per turn&lt;/em&gt;. A 20-turn conversation where each turn adds 500 tokens of targeted retrieval costs less than a 4-turn conversation where each turn adds 5,000 tokens of grep output.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Where Brain MCP Lost&lt;/h2&gt;
&lt;p&gt;Two queries where vanilla was cheaper:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Query 8 -- &quot;How does the Karpathy autoresearch loop pattern work?&quot;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Vanilla: $0.23, 72s, 16 turns. Brain MCP: $0.28, 82s, 14 turns.&lt;/p&gt;
&lt;p&gt;The answer lived in local project files that vanilla found via grep in its first few turns. The Karpathy pattern is implemented in the current working directory (&lt;code&gt;tools/autoresearch/&lt;/code&gt;), and vanilla Claude Code is excellent at finding things in local files. Brain MCP did a thorough cross-collection search -- more comprehensive, but overkill for a question answerable from one file.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Query 14 -- &quot;What LLM agent frameworks and patterns have been used across projects?&quot;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Vanilla: $0.28, 119s, 2 turns. Brain MCP: $0.40, 109s, 24 turns.&lt;/p&gt;
&lt;p&gt;This is a broad survey question. Brain MCP did the deepest search in the entire benchmark -- 24 turns across all four collections, reading extensively about every framework mentioned in any session. It produced a more thorough answer. But the thoroughness wasn&apos;t worth the cost. Vanilla gave a shorter, less detailed answer in just 2 turns, and for a survey question, &quot;good enough&quot; is good enough.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The pattern:&lt;/strong&gt; Brain MCP loses on two types of queries:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Single-file answers&lt;/strong&gt; where the data lives in one obvious location (grep is faster than structured search)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Broad surveys&lt;/strong&gt; where exhaustive search costs more than the marginal value of completeness&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;It wins decisively on &lt;strong&gt;targeted recall&lt;/strong&gt; -- &quot;what was decided about X?&quot; -- where relevance ranking eliminates wasted reads. That&apos;s 12 of 14 queries in this benchmark.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;The Self-Optimizer&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/benchmarking-memory/self-optimizer-tripartite.png&quot; alt=&quot;Tripartite Autonomous Optimization Loop&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Building the brain was step one. Making it optimize itself was step two.&lt;/p&gt;
&lt;p&gt;I implemented a &lt;a href=&quot;https://x.com/karpathy/status/1886192184808149383&quot;&gt;Karpathy-pattern&lt;/a&gt; autoresearch loop: one editable file, one immutable runner, one deterministic scorer. The key insight from Karpathy&apos;s design: &lt;strong&gt;the human writes the scoring function, the LLM proposes mutations, and the loop is fully autonomous.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;The Tripartite Architecture&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;Who edits it&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;search_strategy.yaml&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The mutable config -- search weights, limits, templates&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;LLM only&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;program.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Operating manual -- scoring formula, mutation vocabulary, failure modes&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Human only&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;runner.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Immutable loop -- eval -&amp;gt; mutate -&amp;gt; keep/revert -&amp;gt; git commit&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Never&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;evaluator.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Deterministic scorer -- precision + recall_proxy + diversity&lt;/td&gt;
&lt;td&gt;Read-only tool&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The separation is critical. The LLM can only touch the strategy file. The human controls the objective function and constraints. The runner enforces the loop. No component can interfere with another.&lt;/p&gt;
&lt;h3&gt;The Hill Function&lt;/h3&gt;
&lt;p&gt;The scorer is fully deterministic -- no LLM calls needed for evaluation:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;def evaluate(query: str, strategy: dict) -&amp;gt; dict:
    docs = brain_query(query, strategy)

    precision = score_precision(docs, PRECISION_TOP_K)  # mean score of top-6
    recall = score_recall_proxy(docs)    # fraction returning &amp;gt;= 3 docs
    diversity = score_diversity(docs)    # fraction of 4 collections represented

    score = 0.5 * precision + 0.3 * recall + 0.2 * diversity
    return {&quot;score&quot;: round(score, 4), ...}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Three signals, weighted to prioritize precision:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Precision&lt;/strong&gt; (weight 0.5): Mean relevance score of the top-6 results, measured using QMD&apos;s reranker score. Uses a fixed window (&lt;code&gt;PRECISION_TOP_K = 6&lt;/code&gt;) regardless of other parameters, preventing the optimizer from gaming the metric by reducing result count.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Recall proxy&lt;/strong&gt; (weight 0.3): Binary -- did the query return at least 3 documents? This is a crude proxy, but it catches the failure mode where over-filtering leaves queries unanswered.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Diversity&lt;/strong&gt; (weight 0.2): What fraction of the four collections are represented in the results? This encourages cross-collection retrieval, which matters because different query types need different sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The scorer runs all 14 test queries against QMD, computes the aggregate, and returns a single number. No LLM calls. No API costs. This makes it cheap to run hundreds of iterations.&lt;/p&gt;
&lt;h3&gt;The Mutation Backend&lt;/h3&gt;
&lt;p&gt;The runner launches &lt;a href=&quot;https://jules.google.com/&quot;&gt;Jules&lt;/a&gt; (Google&apos;s async coding agent) to propose mutations. Each iteration:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Runner loads the current strategy + last 10 experiment logs + the operating manual (&lt;code&gt;program.md&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Jules reads the context, proposes exactly one mutation with a &lt;code&gt;# HYPOTHESIS:&lt;/code&gt; comment&lt;/li&gt;
&lt;li&gt;Runner evaluates the mutation against all 14 queries&lt;/li&gt;
&lt;li&gt;If the score improves by &amp;gt; 0.01, keep. Otherwise, revert.&lt;/li&gt;
&lt;li&gt;Git commit either way -- the log is the experiment history.&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code&gt;for i in range(1, max_iter + 1):
    new_strategy_raw = jules_mutate(strategy_raw, logs)

    new_eval = evaluate_all(queries, new_strategy, verbose=True)
    delta = new_eval[&quot;mean_score&quot;] - best_score

    if delta &amp;gt; min_improvement:
        decision = &quot;keep&quot;
        best_score = new_eval[&quot;mean_score&quot;]
    else:
        decision = &quot;revert&quot;
        git_restore()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Plateau detection stops the loop after 3 consecutive non-improvements.&lt;/p&gt;
&lt;h3&gt;What the Optimizer Found&lt;/h3&gt;
&lt;p&gt;Starting score: &lt;strong&gt;0.6993&lt;/strong&gt; (baseline configuration).
Final score: &lt;strong&gt;0.7350&lt;/strong&gt; after 15+ iterations.&lt;/p&gt;
&lt;p&gt;The winning mutations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;limit&lt;/code&gt;: 10 -&amp;gt; 40. More results = more collection coverage = higher diversity score.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;candidate_limit&lt;/code&gt;: 10 -&amp;gt; 40. Matching the limit to the full pre-rerank pool prevents good semantic results from being pruned.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;sub_query_order&lt;/code&gt;: Confirmed &lt;code&gt;[lex, vec, hyde]&lt;/code&gt; -- lex first gets 2x reranking weight, which helps precision.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;hyde_template&lt;/code&gt;: Refined to include domain-specific context (&quot;AI agents, LLM systems, autonomous research loops&quot;).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dead ends (the optimizer tried and reverted):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;sub_query_order: [vec, lex]&lt;/code&gt; -- no effect (+0.000)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;candidate_limit: 80&lt;/code&gt; -- no effect (limit was already at the ceiling)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;min_score: 0.25&lt;/code&gt; or &lt;code&gt;0.20&lt;/code&gt; -- slight regression (no new collections appeared)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;limit &amp;gt; 40&lt;/code&gt; with &lt;code&gt;candidate_limit=40&lt;/code&gt; -- no effect (candidate pool exhausted)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The optimizer also discovered a hard constraint: three queries are stuck at &lt;code&gt;diversity=0.50&lt;/code&gt; because the relevant information only exists in two of four collections (brain + distilled). No search strategy change can fix this -- only adding content to kb-wiki or kb-raw would help. This is a genuine ceiling on the score, and the optimizer correctly stopped trying to push past it.&lt;/p&gt;
&lt;p&gt;I started the loop and went to sleep. Woke up to a git log of experiments, each with a hypothesis, a score delta, and a keep/revert decision. The agent was doing science while I slept.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;Honest Limitations&lt;/h2&gt;
&lt;p&gt;This benchmark has real flaws. I&apos;m documenting them because the temptation to present clean results is the enemy of useful results.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;1. No ground truth scoring.&lt;/strong&gt; I measured cost, speed, and output volume -- not whether the answers were correct or complete. Brain MCP could produce faster, cheaper, &lt;em&gt;wrong&lt;/em&gt; answers and this benchmark would call it a win. A rigorous version would score each answer against gold-standard reference answers written by a human.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. N=1 per condition.&lt;/strong&gt; No repetitions. LLM outputs are stochastic -- the same query can produce different tool call sequences on different runs. One outlier could skew the aggregates. A proper experiment would run 3-5 repetitions per condition and report confidence intervals.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. Cache order bias.&lt;/strong&gt; Vanilla ran first, then Brain MCP. This means Brain MCP benefits from any system-level caching (OS page cache, QMD&apos;s internal caches) that vanilla&apos;s run warmed up. More importantly, Claude Code&apos;s prompt cache persists across sessions on the same machine, so Brain MCP may have started with a warmer cache. Randomizing run order across queries would reduce this bias.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;4. No baseline.&lt;/strong&gt; Both modes include Claude Code&apos;s full ~14K token system prompt. I&apos;m measuring the marginal value of brain MCP tools on top of that baseline, not absolute retrieval efficiency. A proper comparison would include a third condition: the naked LLM with no tools at all, answering from parametric knowledge only.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;5. Model specificity.&lt;/strong&gt; Sonnet, not Opus. Tool-use behavior, search strategy, and persistence differ between models. Results may not generalize.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;6. Confounded system prompt.&lt;/strong&gt; The system prompt says &quot;search thoroughly&quot; -- this may bias Brain MCP toward more turns (which happen to be cheaper) and vanilla toward broader reads. A neutral prompt might produce different cost dynamics.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;7. Evaluator-subject identity.&lt;/strong&gt; I designed the system, wrote the queries, built the benchmark, and analyzed the results. The usual blinding practices are absent.&lt;/p&gt;
&lt;p&gt;A rigorous version of this experiment would: score answers against gold-standard references, run 3+ repetitions per condition per query, randomize condition order, add a naked-LLM baseline, test on multiple models, and have someone else write the test queries.&lt;/p&gt;
&lt;hr /&gt;
&lt;h2&gt;What I Learned&lt;/h2&gt;
&lt;h3&gt;Memory is not RAG&lt;/h3&gt;
&lt;p&gt;==RAG retrieves documents to answer questions. Memory retrieves &lt;em&gt;decisions, corrections, and rejected approaches&lt;/em&gt; to avoid repeating mistakes.== The difference is in what you index and how you structure it.&lt;/p&gt;
&lt;p&gt;A RAG system over my session transcripts would return noisy conversation fragments. The distillation layer is what turns it into memory -- compressing a 2,000-line session into a 30-line artifact with structured sections (Goal, Key Decisions, What Was Rejected, Tags). When the agent queries &quot;what did I decide about auth?&quot;, it gets back a structured decision record, not a raw conversation snippet.&lt;/p&gt;
&lt;p&gt;The &quot;What Was Rejected&quot; section is particularly valuable. It prevents the agent from re-exploring dead ends. If a previous session tried and rejected approach X for reason Y, the memory system surfaces that rejection, saving the cost of rediscovering the same dead end.&lt;/p&gt;
&lt;h3&gt;The temporal dimension matters&lt;/h3&gt;
&lt;p&gt;A knowledge base is static. A memory system grows with every session. The Stop hook that fires after every Claude Code conversation is the key infrastructure -- it makes indexing invisible. If it required manual effort, I&apos;d have stopped at session 20.&lt;/p&gt;
&lt;p&gt;==The automation has a compounding effect: every session makes the next session better, without any effort from me.== Session 590 can query the decisions from session 1. The agent doesn&apos;t just have instructions -- it has experience.&lt;/p&gt;
&lt;h3&gt;Cache economics change the design space&lt;/h3&gt;
&lt;p&gt;The conventional wisdom is &quot;fewer tokens = cheaper.&quot; ==This benchmark shows the opposite: more turns with cache-friendly patterns can be cheaper than fewer turns with cold reads.==&lt;/p&gt;
&lt;p&gt;This has implications for how agentic systems should be designed. Instead of optimizing for turn count, optimize for &lt;em&gt;cache utilization&lt;/em&gt;. Structured retrieval tools that return small, targeted results create naturally cache-friendly patterns. File-reading tools that inject large, uncached context don&apos;t.&lt;/p&gt;
&lt;p&gt;The 0.1x cache read pricing is the key parameter. At that discount, the break-even point is roughly: if your cache prefix is 10x larger than the new context you&apos;re adding, additional turns are almost free. Brain MCP&apos;s pattern -- 14K cached prefix, ~500 tokens of new retrieval per turn -- is well below that threshold.&lt;/p&gt;
&lt;h3&gt;Self-optimization is underrated&lt;/h3&gt;
&lt;p&gt;The autoresearch loop found a better search strategy than I could have designed by hand. The deterministic hill function (no LLM needed for scoring) makes it cheap to run hundreds of iterations. The Karpathy pattern -- one mutable file, one immutable loop, one scoring function -- is the simplest architecture that works for autonomous parameter search.&lt;/p&gt;
&lt;p&gt;==The key design decision: &lt;strong&gt;the scoring function must be deterministic and cheap.&lt;/strong&gt;== If you need an LLM to evaluate each iteration, the loop is too expensive to run autonomously. The three-signal hill (precision, recall proxy, diversity) is crude, but it&apos;s computable in seconds with no API calls. That&apos;s what makes overnight autonomous optimization feasible.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;The brain MCP system, the benchmark harness, and the autoresearch loop will be open-sourced soon.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;em&gt;Sharad Jain builds agentic AI systems in Bengaluru. He previously built data infrastructure at Meta and founded &lt;a href=&quot;http://autoscreen.ai&quot;&gt;autoscreen.ai&lt;/a&gt;, a production voice AI platform. He writes about agent architecture at &lt;a href=&quot;https://sharadja.in&quot;&gt;sharadja.in&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Previously:&lt;/strong&gt; &lt;a href=&quot;https://sharadja.in/blog/system-prompt-architecture-agentic-ai&quot;&gt;The 14K Token Debt: How Your System Prompt Shapes Everything&lt;/a&gt;&lt;/p&gt;
</content:encoded><enclosure url="https://sharadja.in/images/blogs/benchmarking-memory/memory-compounding-hero.png" type="image/png" length="0"/><category>AI</category><category>memory</category><category>agents</category><category>Claude</category><category>MCP</category><category>benchmark</category><category>retrieval</category><category>second-brain</category><category>Karpathy</category><category>autoresearch</category><author>Sharad Jain</author></item><item><title>Orchestrating AI Agents for Data Pipelines</title><link>https://sharadja.in/blog/orchestrating-ai-agents-for-data-pipelines/</link><guid isPermaLink="true">https://sharadja.in/blog/orchestrating-ai-agents-for-data-pipelines/</guid><description>How agentic patterns — handoffs, guardrails, and tracing — are transforming brittle ETL into self-healing, intelligent data infrastructure.</description><pubDate>Tue, 15 Apr 2025 10:00:00 GMT</pubDate><content:encoded>&lt;h1&gt;Orchestrating AI Agents for Data Pipelines&lt;/h1&gt;
&lt;p&gt;Data pipelines are the most boring part of engineering — until they break. Then they&apos;re the most expensive.&lt;/p&gt;
&lt;p&gt;I spent two years at Meta running Spark + Airflow pipelines that processed 10TB+ per day. The infrastructure was impressive. The failure modes were not. A single upstream schema change could silently corrupt downstream models for days before anyone noticed. A partition key format change once poisoned a week of training data. The fix took 45 minutes. The detection took six days.&lt;/p&gt;
&lt;p&gt;That six-day gap is not an anomaly. Monte Carlo&apos;s &lt;a href=&quot;https://www.montecarlodata.com/blog-data-quality-survey&quot;&gt;2023 State of Data Quality survey&lt;/a&gt; found that 68% of data teams need four or more hours just to &lt;em&gt;detect&lt;/em&gt; a pipeline failure, with an average resolution time of 15 hours per incident — and organizations average 67 incidents per month. The business impact: 31% of revenue is exposed to data quality issues, up from 26% the year prior.&lt;/p&gt;
&lt;p&gt;This is the dirty secret of data engineering at scale: we&apos;ve built incredibly powerful pipes and incredibly dumb monitoring. What if the pipes could think?&lt;/p&gt;
&lt;h2&gt;The Brittleness Problem&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/orchestrating-agents/brittleness-corruption.png&quot; alt=&quot;The Brittleness Problem — silent data corruption spreading through pipelines while dashboards show all-green&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Traditional data pipelines follow a rigid choreography:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Extract&lt;/strong&gt; — pull from source (database, API, event stream)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Transform&lt;/strong&gt; — clean, reshape, enrich&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Load&lt;/strong&gt; — write to destination (warehouse, lake, feature store)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each step assumes the world hasn&apos;t changed since the pipeline was written. But the world always changes. Upstream teams rename columns. API response formats evolve. Event schemas gain new fields. And your pipeline — faithfully executing the DAG you wrote six months ago — silently produces garbage.&lt;/p&gt;
&lt;p&gt;The standard defense is testing: schema validation, data quality checks, anomaly detection. But these are &lt;strong&gt;reactive by design&lt;/strong&gt;. They catch problems &lt;em&gt;after&lt;/em&gt; the data is already wrong. At 10TB/day, even a 0.1% failure rate means 10GB of corrupted data flowing downstream before anyone raises an alarm. And the industry data confirms the pattern: Monte Carlo found a steady ratio of &lt;a href=&quot;https://www.montecarlodata.com/blog-data-quality-statistics/&quot;&gt;one data quality incident per 15 tables per year&lt;/a&gt; across their platform — meaning a 600-table warehouse generates roughly 40 incidents annually just from normal operations.&lt;/p&gt;
&lt;p&gt;==I call this the &lt;strong&gt;Detection-Resolution Gap&lt;/strong&gt;: the growing distance between when data goes bad and when someone does something about it.== Every hour in that gap compounds damage — bad dashboards inform bad decisions, corrupted features degrade ML models, and downstream consumers silently ingest poison. The gap is widening because data volumes grow faster than team headcount. You can&apos;t hire your way out of it.&lt;/p&gt;
&lt;h2&gt;The Agent Intelligence Layer&lt;/h2&gt;
&lt;p&gt;The fix isn&apos;t better tests. It&apos;s a different primitive.&lt;/p&gt;
&lt;p&gt;==Instead of writing pipelines as static DAGs with bolted-on checks, design them as networks of specialized agents that &lt;strong&gt;observe, decide, and act&lt;/strong&gt; at each pipeline boundary.== The agents don&apos;t replace your Spark jobs or Airflow DAGs — they sit &lt;em&gt;between&lt;/em&gt; pipeline stages as an intelligence layer, making decisions that previously required a human on-call.&lt;/p&gt;
&lt;p&gt;I think of this as the &lt;strong&gt;Observe-Decide-Act Loop&lt;/strong&gt; — a pattern that distinguishes agent-based pipelines from traditional monitoring:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Traditional pipeline:
  Source → Transform → Load → [check after the fact] → Alert human → Human investigates → Human fixes

Agent-orchestrated pipeline:
  Source → [Agent: observe schema, decide compatibility] → Transform → [Agent: observe distributions, decide quality] → Load
              ↓ (if anomaly detected)
         [Handoff to specialist agent → auto-remediate → resume]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;==The critical shift: decisions happen &lt;em&gt;inline&lt;/em&gt;, before bad data propagates, not after.== The Detection-Resolution Gap collapses from hours to seconds.&lt;/p&gt;
&lt;h2&gt;Three Primitives That Make It Work&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/orchestrating-agents/agents-three-sentinels.png&quot; alt=&quot;Three AI agent sentinels — Handoff, Guardrail, and Tracing — guarding the data pipeline&quot; /&gt;&lt;/p&gt;
&lt;p&gt;OpenAI&apos;s &lt;a href=&quot;https://openai.github.io/openai-agents-python/&quot;&gt;Agents SDK&lt;/a&gt; crystallizes three primitives that map cleanly onto pipeline orchestration:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Handoffs&lt;/strong&gt; — An agent can delegate to another agent. In pipeline terms: the ingestion agent detects a schema anomaly and hands off to a schema resolution agent, which generates the mapping transformation before passing data downstream. No human paged at 3am. No six-day detection lag. The SDK supports this through &lt;a href=&quot;https://openai.github.io/openai-agents-python/multi_agent/&quot;&gt;explicit handoff declarations&lt;/a&gt; where a triage agent routes to specialists who then own the resolution.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Guardrails&lt;/strong&gt; — Input and output validation gates on every agent. Not just &quot;is this field non-null?&quot; but &quot;does this distribution look right? Is this schema compatible with what downstream consumers expect? Did we lose cardinality during the join?&quot; Guardrails encode the &lt;em&gt;contracts&lt;/em&gt; between pipeline stages — making implicit assumptions explicit and enforceable.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tracing&lt;/strong&gt; — End-to-end observability across the entire agent network. Every decision, every handoff, every guardrail check is logged with full reasoning context. When something goes wrong, you don&apos;t grep through Airflow logs — you replay the agent&apos;s decision chain. This is what makes agent-based pipelines &lt;em&gt;debuggable&lt;/em&gt;, not just automated.&lt;/p&gt;
&lt;p&gt;These three primitives compose into what I call &lt;strong&gt;Contract-Aware Orchestration&lt;/strong&gt;: each agent doesn&apos;t just process data — it understands the contract it must uphold for downstream consumers and actively enforces it. Traditional ETL treats contracts (when they exist at all) as documentation. Agent-based pipelines treat them as runtime constraints.&lt;/p&gt;
&lt;h2&gt;A Concrete Pattern: The Schema Drift Guardian&lt;/h2&gt;
&lt;p&gt;Here&apos;s what this looks like in practice. Consider the most common pipeline failure: an upstream team adds, renames, or removes a column, and your pipeline either crashes or silently drops data.&lt;/p&gt;
&lt;p&gt;A &lt;strong&gt;Schema Drift Guardian&lt;/strong&gt; agent handles this inline:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;from agents import Agent, Runner, GuardrailFunctionOutput, input_guardrail
from pydantic import BaseModel
from typing import Literal

class SchemaVerdict(BaseModel):
    status: Literal[&quot;compatible&quot;, &quot;breaking&quot;, &quot;additive&quot;]
    changes: list[str]
    remediation_sql: str | None
    confidence: float

quality_agent = Agent(
    name=&quot;SchemaGuardian&quot;,
    instructions=&quot;&quot;&quot;You are a schema compatibility agent for a production data pipeline.

    Given the registered schema contract and the incoming data schema:
    1. Classify the change: additive (new columns), breaking (removed/renamed/retyped columns),
       or compatible (no change).
    2. For additive changes: generate ALTER TABLE statements to evolve the destination schema.
    3. For breaking changes: generate transformation SQL that maps the new schema to the
       existing contract, preserving all downstream dependencies.
    4. Assign confidence (0-1). If confidence &amp;lt; 0.85, flag for human review instead of
       auto-applying.

    Never silently drop columns. Never assume a renamed column is a new column without
    checking value distributions.&quot;&quot;&quot;,
    model=&quot;gpt-4o&quot;,
    output_type=SchemaVerdict,
)

remediation_agent = Agent(
    name=&quot;SchemaRemediator&quot;,
    instructions=&quot;&quot;&quot;You resolve schema mismatches between upstream sources and downstream
    consumers. Given the SchemaGuardian&apos;s verdict:
    - For additive changes: apply the evolution and notify downstream consumers.
    - For breaking changes with confidence &amp;gt;= 0.85: apply the transformation SQL,
      run validation on a sample batch, and proceed if validation passes.
    - For breaking changes with confidence &amp;lt; 0.85: quarantine the batch, notify the
      on-call engineer with full context, and continue processing other sources.

    Always log the before/after schema diff and the remediation applied.&quot;&quot;&quot;,
    model=&quot;gpt-4o&quot;,
)

# The handoff: guardian escalates to remediator when schema drift is detected
quality_agent.handoffs = [remediation_agent]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The key design choices here:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Structured output&lt;/strong&gt; (&lt;code&gt;SchemaVerdict&lt;/code&gt;) — the agent doesn&apos;t produce free-form text; it returns a typed decision that downstream code can act on deterministically.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Confidence thresholds&lt;/strong&gt; — the agent knows when it&apos;s uncertain and escalates to a human rather than auto-applying a bad fix. This is the difference between useful automation and dangerous automation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Separation of detection and remediation&lt;/strong&gt; — the guardian &lt;em&gt;classifies&lt;/em&gt;; the remediator &lt;em&gt;acts&lt;/em&gt;. This separation makes each agent simpler to test, debug, and improve independently.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Compare this to the traditional approach:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Traditional: brittle, reactive, manual
try:
    df = spark.read.parquet(source_path)
    df_transformed = df.select(EXPECTED_COLUMNS)  # crashes if columns changed
except AnalysisException as e:
    send_pagerduty_alert(f&quot;Schema mismatch: {e}&quot;)  # human wakes up at 3am
    raise  # pipeline halted until human intervenes
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The traditional approach has two modes: success or human escalation. The agent-based approach has a spectrum: auto-resolve high-confidence changes, quarantine uncertain ones, escalate truly ambiguous cases. Most schema changes — the additive ones, the straightforward renames — never need a human at all.&lt;/p&gt;
&lt;h2&gt;The Confidence Boundary Pattern&lt;/h2&gt;
&lt;p&gt;The confidence threshold in the example above deserves its own framework, because it&apos;s the most important design decision in any agent-based pipeline.&lt;/p&gt;
&lt;p&gt;I call this the &lt;strong&gt;Confidence Boundary&lt;/strong&gt;: the threshold below which an agent stops acting autonomously and escalates to either a more specialized agent or a human. Getting this wrong in either direction is costly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Boundary too high&lt;/strong&gt; (e.g., 0.99): the agent escalates everything, and you&apos;ve built an expensive alerting system disguised as an agent.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Boundary too low&lt;/strong&gt; (e.g., 0.50): the agent auto-applies bad fixes, and you&apos;ve built a corruption engine.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The right boundary depends on the &lt;em&gt;blast radius&lt;/em&gt; of a wrong decision:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pipeline Stage&lt;/th&gt;
&lt;th&gt;Blast Radius&lt;/th&gt;
&lt;th&gt;Recommended Confidence Boundary&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Schema evolution (additive)&lt;/td&gt;
&lt;td&gt;Low — worst case adds unused columns&lt;/td&gt;
&lt;td&gt;0.70&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Schema evolution (breaking)&lt;/td&gt;
&lt;td&gt;High — can corrupt downstream models&lt;/td&gt;
&lt;td&gt;0.90&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data quality (distribution drift)&lt;/td&gt;
&lt;td&gt;Medium — may indicate real shift or bug&lt;/td&gt;
&lt;td&gt;0.80&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data quality (null spike)&lt;/td&gt;
&lt;td&gt;High — usually indicates upstream failure&lt;/td&gt;
&lt;td&gt;0.85&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Partition repair&lt;/td&gt;
&lt;td&gt;Medium — can cause duplicate processing&lt;/td&gt;
&lt;td&gt;0.80&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;==The principle: &lt;strong&gt;confidence boundaries should be inversely proportional to blast radius&lt;/strong&gt;.== Low-risk changes get auto-resolved aggressively. High-risk changes require near-certainty or human review.&lt;/p&gt;
&lt;p&gt;In practice, I recommend starting with all boundaries at 0.90 and tuning down as you build confidence in each agent&apos;s decision quality. Track the &lt;em&gt;false positive rate&lt;/em&gt; (agent escalated unnecessarily) and &lt;em&gt;false negative rate&lt;/em&gt; (agent auto-applied a bad fix) separately. You want the false negative rate near zero, even at the cost of a higher false positive rate.&lt;/p&gt;
&lt;h2&gt;The Voice AI Case: Real-Time Pressure Testing&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/orchestrating-agents/voice-ai-command-center.png&quot; alt=&quot;Voice AI command center — parallel audio waveform streams monitored by AI agents with anomaly rerouting&quot; /&gt;&lt;/p&gt;
&lt;p&gt;These patterns become essential when latency constraints tighten. At &lt;a href=&quot;https://autoscreen.ai&quot;&gt;autoscreen.ai&lt;/a&gt;, we process real-time voice AI event streams where pipeline failures can&apos;t wait for human intervention — calls are in flight, and degraded data means degraded user experience within seconds.&lt;/p&gt;
&lt;p&gt;Each call produces four concurrent pipeline stages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Audio chunks&lt;/strong&gt; streamed to a transcription service&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Transcription events&lt;/strong&gt; fed to a scoring model&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scoring results&lt;/strong&gt; written to an analytics database&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Session metadata&lt;/strong&gt; aggregated into dashboards&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is four pipeline stages running concurrently per call, multiplied by thousands of simultaneous calls. Traditional ETL falls apart here. You can&apos;t batch-process audio in hourly Airflow runs. You can&apos;t wait for a human to resolve a schema mismatch when 500 calls are in flight.&lt;/p&gt;
&lt;p&gt;The agent-orchestrated approach:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;┌─────────────┐     ┌──────────────┐     ┌───────────────┐     ┌──────────────┐
│  Ingestion   │────▶│ Transcription │────▶│  Scoring       │────▶│  Analytics   │
│  Agent       │     │  Agent        │     │  Agent         │     │  Agent       │
└──────┬───────┘     └──────┬────────┘     └───────┬────────┘     └──────┬───────┘
       │                    │                      │                     │
       ▼                    ▼                      ▼                     ▼
  ┌──────────┐        ┌──────────┐           ┌──────────┐          ┌──────────┐
  │ Quality  │        │ Quality  │           │ Quality  │          │ Quality  │
  │ Guard    │        │ Guard    │           │ Guard    │          │ Guard    │
  └────┬─────┘        └────┬─────┘           └────┬─────┘          └────┬─────┘
       │                   │                      │                     │
       └───────────────────┴──────────┬───────────┴─────────────────────┘
                                      ▼
                              ┌──────────────┐
                              │  Escalation  │
                              │  Agent       │
                              │ (human-in-   │
                              │  the-loop)   │
                              └──────────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each stage has its own quality guardrail tuned to its domain. The ingestion agent validates audio format and session metadata. The transcription agent checks for hallucination patterns — a real problem with streaming ASR where models occasionally confabulate words during silence or crosstalk. The scoring agent validates that output distributions haven&apos;t drifted. The analytics agent ensures aggregations are consistent.&lt;/p&gt;
&lt;p&gt;When the transcription agent detects a new audio codec it hasn&apos;t seen before, it doesn&apos;t crash — it hands off to a codec resolution agent, logs the decision, and continues processing other calls while the issue is resolved.&lt;/p&gt;
&lt;p&gt;This is what &quot;self-healing&quot; actually means in production. Not magic. Not AGI. Just agents with clear responsibilities, calibrated confidence boundaries, and the ability to escalate.&lt;/p&gt;
&lt;h2&gt;Where This Breaks&lt;/h2&gt;
&lt;p&gt;Every architecture post should include a section on when the approach fails. Agent-based pipelines are no exception.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;LLM latency in the hot path.&lt;/strong&gt; An LLM call adds 200-2000ms of latency per decision. For batch pipelines processing hourly, this is negligible. For real-time streams requiring sub-100ms latency, it&apos;s a dealbreaker. The mitigation: use LLM agents for &lt;em&gt;policy decisions&lt;/em&gt; (schema evolution, anomaly classification) and deterministic code for &lt;em&gt;per-record validation&lt;/em&gt; (null checks, type coercion). Don&apos;t put an LLM call in a tight loop processing millions of records.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hallucination in remediation.&lt;/strong&gt; When a schema agent generates transformation SQL, it can hallucinate column names or invent joins that don&apos;t exist. &lt;a href=&quot;https://arxiv.org/html/2509.18970v1&quot;&gt;Recent research on agent hallucination&lt;/a&gt; shows that hallucinations in agent systems manifest not just as linguistic errors but as fabricated behaviors at every pipeline stage. The mitigation: always validate generated SQL against the actual schema before execution, run remediation on a sample batch first, and enforce the confidence boundary pattern described above.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cost at scale.&lt;/strong&gt; Each agent decision involves an LLM API call. At 67 incidents per month (Monte Carlo&apos;s average), this is cheap. At 67 incidents per &lt;em&gt;hour&lt;/em&gt; in a high-volume pipeline, costs compound quickly. The mitigation: cache agent decisions for recurring patterns. If the agent has seen the same schema drift three times, apply the cached remediation without an LLM call.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Complexity ceiling.&lt;/strong&gt; A network of six agents with handoffs between them is understandable. A network of sixty is not. Agent systems can become as opaque as the pipelines they&apos;re meant to improve. The mitigation: keep agent networks flat and shallow. A triage agent plus 3-5 specialist agents per pipeline boundary is the sweet spot. If you need more, you probably need to decompose the pipeline itself.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The accountability gap.&lt;/strong&gt; When an agent auto-remediates a schema change and that remediation introduces a subtle data quality issue three stages downstream, who&apos;s responsible? The agent&apos;s tracing log shows &lt;em&gt;what&lt;/em&gt; happened, but organizational accountability is still a human problem. Every auto-remediation should be logged, reviewable, and reversible.&lt;/p&gt;
&lt;h2&gt;Lessons From Both Ends of the Scale&lt;/h2&gt;
&lt;p&gt;I&apos;ve built data pipelines at both extremes: Meta-scale batch processing (10TB+/day, Spark, Airflow) and real-time voice AI event streams (sub-500ms latency, LiveKit, streaming transcription). A few hard-won lessons:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Start with the contracts, not the agents.&lt;/strong&gt; Define what &quot;good data&quot; looks like at each pipeline boundary before you make anything agentic. The agent layer is useless without clear success criteria. Data contracts — explicit schemas, SLOs on freshness and completeness, documented ownership — are the foundation. Research on &lt;a href=&quot;https://datalakehousehub.com/blog/2026-02-de-best-practices-05-schema-evolution/&quot;&gt;schema evolution in data meshes&lt;/a&gt; shows that automated schema evolution reduces deployment time by 73% and error rates by 89%, but only when contracts are well-defined first.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Guardrails beat intelligence.&lt;/strong&gt; A simple statistical check that catches distribution drift will save you more pain than a sophisticated LLM-powered anomaly detector. ==Use agents for the &lt;em&gt;decisions&lt;/em&gt; (what to do when something breaks), not for the &lt;em&gt;detection&lt;/em&gt; (noticing that something broke).== Detection should be fast, deterministic, and cheap. Decision-making is where LLM reasoning adds genuine value.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tracing is non-negotiable.&lt;/strong&gt; In a multi-agent pipeline, you need to know exactly which agent made which decision and why. OpenAI&apos;s SDK has this &lt;a href=&quot;https://openai.github.io/openai-agents-python/&quot;&gt;built in&lt;/a&gt;. If you&apos;re building your own agent layer, invest in tracing before you invest in features. Meta&apos;s recent work on &lt;a href=&quot;https://engineering.fb.com/2026/04/06/developer-tools/how-meta-used-ai-to-map-tribal-knowledge-in-large-scale-data-pipelines/&quot;&gt;AI-assisted pipeline knowledge mapping&lt;/a&gt; used 50+ specialized agents with multi-round critic passes — that system would be completely undebuggable without comprehensive tracing.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The handoff pattern is the whole game.&lt;/strong&gt; The difference between a brittle pipeline and a resilient one is what happens when something unexpected occurs. Static DAGs crash. Agent networks route around the problem. But handoffs must be &lt;em&gt;bounded&lt;/em&gt; — an agent should never hand off to more than 2-3 specialists, and handoff chains should never exceed 3 hops. Unbounded handoffs create circular delegation where agents pass problems to each other indefinitely.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Measure the gap, not the uptime.&lt;/strong&gt; Pipeline uptime is a vanity metric. What matters is the Detection-Resolution Gap — the time between data going bad and data being fixed. Track this per pipeline stage. Organizations implementing automated recovery see a &lt;a href=&quot;https://www.integrate.io/blog/etl-error-handling-and-monitoring-metrics/&quot;&gt;76% reduction in MTTR&lt;/a&gt; and a 94% reduction in manual intervention. That&apos;s the metric that tells you whether your agents are actually working.&lt;/p&gt;
&lt;h2&gt;What&apos;s Next&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/orchestrating-agents/future-brain-infrastructure.png&quot; alt=&quot;Data infrastructure getting a brain — a neural network brain floating above a futuristic data center&quot; /&gt;&lt;/p&gt;
&lt;p&gt;The trajectory is clear: data infrastructure is getting a decision layer. Not because LLMs are good at processing data — they&apos;re not; Spark is still better at 10TB joins, and will be for the foreseeable future — but because LLMs are good at &lt;em&gt;deciding what to do when the data is unexpected&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;The next generation of data platforms won&apos;t just execute pipelines. They&apos;ll maintain contracts, detect violations, and take calibrated corrective action — all while maintaining the throughput and reliability that production systems demand. The tools are maturing: OpenAI&apos;s &lt;a href=&quot;https://openai.github.io/openai-agents-python/&quot;&gt;Agents SDK&lt;/a&gt; and &lt;a href=&quot;https://openai.com/index/introducing-agentkit/&quot;&gt;AgentKit&lt;/a&gt; provide the orchestration primitives; frameworks like &lt;a href=&quot;https://docs.anthropic.com/en/docs/agents-sdk&quot;&gt;Anthropic&apos;s agent patterns&lt;/a&gt;, LangGraph, and CrewAI offer alternative approaches; and the data engineering ecosystem is converging on &lt;a href=&quot;https://www.datacontract.com/&quot;&gt;data contracts&lt;/a&gt; as the foundational abstraction.&lt;/p&gt;
&lt;p&gt;The convergence I&apos;m watching: &lt;strong&gt;data contracts + agent intelligence + confidence boundaries&lt;/strong&gt;. When your pipeline stages declare typed contracts, agents can verify compliance at runtime and auto-remediate violations within calibrated confidence bounds. The contract tells the agent what &quot;correct&quot; looks like. The confidence boundary tells it when to act versus when to ask.&lt;/p&gt;
&lt;p&gt;==We&apos;re not replacing data engineers. We&apos;re closing the Detection-Resolution Gap — and finally giving data infrastructure the judgment layer it has always needed.==&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;em&gt;Sharad Jain is a data engineer and AI systems builder based in Bengaluru. He previously built data pipelines at Meta (10TB+/day) and founded &lt;a href=&quot;http://autoscreen.ai&quot;&gt;autoscreen.ai&lt;/a&gt;, a production voice AI platform. He writes about the intersection of agentic AI and data infrastructure at &lt;a href=&quot;https://sharadja.in&quot;&gt;sharadja.in&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
</content:encoded><enclosure url="https://sharadja.in/images/blogs/orchestrating-agents/hero-pipes-that-think.png" type="image/png" length="0"/><category>AI</category><category>agents</category><category>data-engineering</category><category>data-pipelines</category><category>OpenAI</category><category>orchestration</category><category>voice-AI</category><author>Sharad Jain</author></item><item><title>Building an AI Hedge Fund: What Multi-Agent Trading Teaches About Agent Architecture</title><link>https://sharadja.in/blog/building-ai-hedge-fund-multi-agent-trading/</link><guid isPermaLink="true">https://sharadja.in/blog/building-ai-hedge-fund-multi-agent-trading/</guid><description>A multi-agent trading system isn&apos;t about beating the market — it&apos;s about what happens when specialized agents must coordinate under uncertainty. The patterns are universal.</description><pubDate>Wed, 15 Jan 2025 09:30:00 GMT</pubDate><content:encoded>&lt;h1&gt;What Multi-Agent Trading Teaches About Agent Architecture&lt;/h1&gt;
&lt;p&gt;What happens when you treat a trading desk like a software architecture problem?&lt;/p&gt;
&lt;p&gt;You give one agent the price charts. Another gets the balance sheets. A third reads the news. A fourth calculates intrinsic value. Then you put a risk manager between them and the portfolio, and a portfolio manager who must synthesize their conflicting recommendations into a single decision: buy, sell, or hold.&lt;/p&gt;
&lt;p&gt;The result isn&apos;t a hedge fund. It&apos;s an architecture case study — and the patterns it reveals apply far beyond finance. The trading domain just happens to be uniquely good at exposing them, because markets are adversarial, data is noisy, feedback is immediate, and overconfidence is punished in dollars.&lt;/p&gt;
&lt;p&gt;I&apos;ve been building with &lt;a href=&quot;https://github.com/virattt/ai-hedge-fund&quot;&gt;virattt&apos;s ai-hedge-fund&lt;/a&gt; — an open-source project (55,000+ stars) that orchestrates six LLM-powered agents through LangGraph to simulate trading decisions. The system is educational, not production. But the architectural patterns it surfaces are real, and they&apos;ve changed how I think about multi-agent coordination in any domain.&lt;/p&gt;
&lt;h2&gt;The Architecture of Disagreement&lt;/h2&gt;
&lt;p&gt;Traditional quantitative trading systems optimize for consensus. One model ingests data, produces a signal, and trades on it. If the signal is wrong, the model is wrong. There&apos;s no second opinion.&lt;/p&gt;
&lt;p&gt;Multi-agent trading systems optimize for something different: &lt;strong&gt;productive disagreement&lt;/strong&gt;. Multiple agents with different analytical lenses examine the same asset and reach different conclusions. ==The value isn&apos;t in any single agent&apos;s analysis — it&apos;s in the tension between agents that hold fundamentally different views of the same reality.==&lt;/p&gt;
&lt;p&gt;This is a design choice, not an accident. Here&apos;s the architecture:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;                    ┌──────────────┐
                    │  Market Data │
                    └──────┬───────┘
                           │
            ┌──────────────┼──────────────┐
            │              │              │
     ┌──────▼──────┐ ┌────▼─────┐ ┌──────▼──────┐
     │  Technical   │ │ Fundamen │ │  Sentiment  │
     │  Analyst     │ │ -tals    │ │  Analyst    │
     │              │ │ Analyst  │ │             │
     │ Price action │ │ Finan-   │ │ News, crowd │
     │ patterns,    │ │ cials,   │ │ psychology, │
     │ momentum     │ │ ratios   │ │ insider     │
     └──────┬───────┘ └────┬─────┘ └──────┬──────┘
            │              │              │
            │       ┌──────▼──────┐       │
            │       │  Valuation  │       │
            │       │  Agent      │       │
            │       │ DCF, owner  │       │
            │       │ earnings    │       │
            │       └──────┬──────┘       │
            │              │              │
            └──────────────┼──────────────┘
                           │
                    ┌──────▼───────┐
                    │ Risk Manager │
                    │ Position     │
                    │ sizing,      │
                    │ exposure     │
                    │ limits       │
                    └──────┬───────┘
                           │
                    ┌──────▼───────┐
                    │  Portfolio   │
                    │  Manager     │
                    │ Final        │
                    │ decision     │
                    └──────────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The key structural insight: the analysis agents operate in parallel (they don&apos;t see each other&apos;s output), while the risk and portfolio managers operate in series (they see everything). This creates information asymmetry by design — each analyst commits to a view without being anchored by the others, and the portfolio manager must reconcile genuinely independent perspectives.&lt;/p&gt;
&lt;p&gt;Compare this to a single-agent system where one LLM receives all the data simultaneously. That agent will anchor on the first strong signal it encounters, weigh all evidence through that anchor, and produce a coherent-sounding but potentially biased recommendation. The multi-agent architecture avoids this by making anchoring structurally impossible.&lt;/p&gt;
&lt;h2&gt;Six Agents, Six Blind Spots&lt;/h2&gt;
&lt;p&gt;Each agent in the system has a specific analytical lens — and a specific blind spot. Understanding the blind spots matters more than understanding the capabilities.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Technical Analyst&lt;/strong&gt; — Reads price action, volume, and momentum indicators (moving averages, RSI, Bollinger Bands, ADX, ATR). The efficient market hypothesis says this shouldn&apos;t work: if price patterns contained predictive information, arbitrageurs would trade it away. It works anyway because markets aren&apos;t perfectly efficient, and behavioral patterns repeat. &lt;strong&gt;Blind spot:&lt;/strong&gt; completely ignores why a company is valued the way it is. A technically bullish stock can be a fundamentally bankrupt company.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Fundamentals Analyst&lt;/strong&gt; — Evaluates profitability (ROE, margins), growth (revenue, earnings, book value), financial health (debt ratios, cash flow), and valuation ratios (P/E, P/B, P/S). This is Benjamin Graham-style analysis, the intellectual foundation of value investing. &lt;strong&gt;Blind spot:&lt;/strong&gt; timing. A fundamentals analyst can be &quot;right&quot; about a stock&apos;s intrinsic value for years before the market agrees. As Keynes observed, &quot;the market can stay irrational longer than you can stay solvent.&quot;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Sentiment Analyst&lt;/strong&gt; — Processes news sentiment via NLP, tracks insider trading patterns, monitors social media signals, and attempts to gauge market psychology. &lt;strong&gt;Blind spot:&lt;/strong&gt; can&apos;t distinguish between rational crowd wisdom and irrational crowd panic. Sentiment was overwhelmingly positive for WeWork before its implosion. It was overwhelmingly negative for Tesla before its 10x run.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Valuation Agent&lt;/strong&gt; — Performs discounted cash flow (DCF) analysis, calculates owner earnings, and compares intrinsic value to market price to identify mispricings. &lt;strong&gt;Blind spot:&lt;/strong&gt; DCF models are exquisitely sensitive to assumptions about discount rates and terminal growth. Change the discount rate by 1% and the fair value changes by 20-40%. The model produces precise numbers from imprecise inputs — a classic case of false precision.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Risk Manager&lt;/strong&gt; — Implements position sizing rules, monitors portfolio exposure, sets risk limits, and manages drawdown protection. This is the gatekeeper that prevents any single agent&apos;s conviction from destroying the portfolio. &lt;strong&gt;Blind spot:&lt;/strong&gt; can&apos;t protect against black swan events — risks that are, by definition, outside the model&apos;s distribution. Every risk model works until the scenario it wasn&apos;t designed for arrives.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Portfolio Manager&lt;/strong&gt; — Consolidates all agent signals, makes final trading decisions, and executes recommendations. It must weigh contradictory inputs and produce a single actionable decision. &lt;strong&gt;Blind spot:&lt;/strong&gt; when all agents are collectively wrong in the same direction — a scenario that happens more often than the architecture suggests, because all agents share the same underlying data and the same LLM reasoning patterns.&lt;/p&gt;
&lt;p&gt;==The multi-agent design principle here: &lt;strong&gt;non-overlapping blind spots&lt;/strong&gt;. No single agent can see the full picture, but their blind spots are different.== The technical analyst&apos;s blind spot (fundamentals) is the fundamental analyst&apos;s strength. The sentiment analyst&apos;s blind spot (rationality assessment) is partially covered by the valuation agent&apos;s quantitative grounding. The architecture works when each agent&apos;s weakness is another agent&apos;s strength.&lt;/p&gt;
&lt;h2&gt;The Signal Reconciliation Problem&lt;/h2&gt;
&lt;p&gt;The hardest engineering problem in a multi-agent trading system isn&apos;t building the individual agents — it&apos;s deciding what to do when they disagree.&lt;/p&gt;
&lt;p&gt;Consider a concrete scenario: NVIDIA, January 2025. The technical analyst sees a bullish momentum pattern (signal: BUY, confidence: 0.82). The fundamentals analyst notes a P/E ratio of 65x, well above historical norms (signal: HOLD, confidence: 0.71). The sentiment analyst detects overwhelmingly positive AI hype (signal: BUY, confidence: 0.88). The valuation agent&apos;s DCF model suggests the stock is overvalued by 30% (signal: SELL, confidence: 0.64).&lt;/p&gt;
&lt;p&gt;Three approaches to reconciliation:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Democratic (majority vote):&lt;/strong&gt; Two BUYs, one HOLD, one SELL → BUY. Simple but ignores the magnitude of disagreement. The valuation agent&apos;s SELL signal, even as a minority, might be the most important signal in the ensemble.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Meritocratic (confidence-weighted):&lt;/strong&gt; Weight each signal by the agent&apos;s stated confidence. BUY: 0.82 + 0.88 = 1.70. HOLD: 0.71. SELL: 0.64. → Strong BUY. But this assumes confidence scores are well-calibrated — a dangerous assumption for LLMs. &lt;a href=&quot;https://arxiv.org/abs/2512.16030&quot;&gt;KalshiBench&lt;/a&gt; (2024) tested frontier models on prediction market questions with verifiable outcomes and found &lt;strong&gt;all models exhibit systematic overconfidence&lt;/strong&gt;, with a 12-percentage-point average gap between stated confidence and actual accuracy. Even the best model (Claude Opus 4.5, ECE: 0.120) fell well short of human superforecasters (ECE: 0.03-0.05). Reasoning-enhanced models actually had &lt;em&gt;worse&lt;/em&gt; calibration. An LLM that says &quot;0.88 confidence&quot; doesn&apos;t mean there&apos;s an 88% probability of being correct.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Risk-adjusted (conservative):&lt;/strong&gt; Any SELL signal from any agent triggers caution. The portfolio manager reduces position size proportional to the disagreement magnitude. → Small BUY with tight stop-loss. This is the approach most production trading systems use because it prioritizes capital preservation over return maximization.&lt;/p&gt;
&lt;p&gt;==I call this the &lt;strong&gt;Signal Reconciliation Problem&lt;/strong&gt;: the challenge of combining conflicting agent outputs into a coherent action when the agents themselves can&apos;t assess the reliability of their own predictions.== It&apos;s the multi-agent equivalent of the &lt;a href=&quot;https://en.wikipedia.org/wiki/Condorcet%27s_jury_theorem&quot;&gt;jury theorem&lt;/a&gt; — ensembles of independent judges outperform individuals, but only when each judge is better than random and their errors are uncorrelated. If the judges share systematic biases, the ensemble amplifies those biases.&lt;/p&gt;
&lt;p&gt;In an LLM-based trading system, this condition is systematically violated. All agents share the same underlying model, the same training data biases, and the same tendency toward confident-sounding narratives. Their &quot;independence&quot; is structural (different prompts, different data inputs) but not epistemic (same reasoning patterns, same blind spots in numerical reasoning).&lt;/p&gt;
&lt;h2&gt;What Actually Happens&lt;/h2&gt;
&lt;p&gt;Let me be direct about what this system is and isn&apos;t.&lt;/p&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/virattt/ai-hedge-fund&quot;&gt;ai-hedge-fund&lt;/a&gt; project is an educational simulation. It uses real market data (via Financial Datasets API) and real LLM reasoning (via OpenAI/Anthropic), but it doesn&apos;t execute real trades. It&apos;s designed to demonstrate multi-agent architecture patterns, not to make money.&lt;/p&gt;
&lt;p&gt;And that&apos;s the right framing, because LLM-based trading has fundamental limitations that no amount of architectural cleverness can fix:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;LLMs can&apos;t do reliable numerical reasoning.&lt;/strong&gt; When a fundamentals agent calculates a P/E ratio or a DCF valuation, it&apos;s generating a plausible-looking number through pattern matching, not computing it from first principles. On the &lt;a href=&quot;https://arxiv.org/abs/2311.11944&quot;&gt;FinanceBench&lt;/a&gt; benchmark, GPT-4 Turbo with retrieval incorrectly answered or refused to answer &lt;strong&gt;81% of questions&lt;/strong&gt; on real financial QA from public filings. An &lt;a href=&quot;https://arxiv.org/abs/2402.11194&quot;&gt;ACL 2024 analysis&lt;/a&gt; broke down errors: 25% wrong evidence, 25% insufficient domain knowledge, 24% pure calculation errors. The mitigation (used in the project) is to compute all numerical values deterministically in code and pass only the results to the LLM for interpretation. But this means the LLM&apos;s &quot;analysis&quot; is more like narration over pre-computed numbers — useful for synthesis, not for discovery.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Backtesting is seductive and dangerous.&lt;/strong&gt; The system includes backtesting capabilities, and it&apos;s tempting to optimize until the backtest looks good. But backtesting has well-documented failure modes: overfitting (the strategy fits historical noise, not signal), look-ahead bias (using information that wouldn&apos;t have been available at trade time), and survivorship bias (testing only on stocks that still exist). Bailey et al.&apos;s work on &lt;a href=&quot;https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2326253&quot;&gt;backtest overfitting&lt;/a&gt; showed that with just 5 years of data, trying more than 45 independent model configurations virtually guarantees a strategy with a backtested Sharpe ratio of 1.0 but an expected out-of-sample Sharpe of zero. The industry standard is a &lt;strong&gt;50% haircut&lt;/strong&gt; on backtested Sharpe ratios to estimate live performance.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The results speak for themselves.&lt;/strong&gt; &lt;a href=&quot;https://arxiv.org/abs/2510.02209&quot;&gt;StockBench&lt;/a&gt; (2025) tested 14 frontier LLMs — including GPT-5 and Claude 4 Sonnet — on actual trading over 82 days. The equal-weight buy-and-hold baseline returned 0.4%. GPT-5 returned &lt;strong&gt;0.3%&lt;/strong&gt; — it underperformed doing nothing. The best LLM (Qwen3-235B) managed 2.4%. For comparison, Renaissance Technologies&apos; Medallion Fund averaged &lt;strong&gt;66% annually&lt;/strong&gt; over three decades — using statistical arbitrage, not LLMs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Agent agreement can mask collective delusion.&lt;/strong&gt; When all six agents agree on a BUY signal, it feels like strong consensus. But if the underlying data is misleading (accounting fraud, market manipulation, structural regime change), all agents will be wrong simultaneously. The architecture provides no protection against errors in the shared data layer. This is the multi-agent version of the &quot;garbage in, garbage out&quot; problem, made more dangerous by the false sense of security that consensus provides.&lt;/p&gt;
&lt;h2&gt;Where This Breaks&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Cost per decision.&lt;/strong&gt; Six LLM agents analyzing a single stock requires 6+ API calls, each processing substantial context (financial data, historical analysis, agent instructions). At GPT-4o pricing, a single stock analysis costs $0.50-2.00. Across a 50-stock universe with daily rebalancing, that&apos;s $25-100/day — $9,000-36,000/year — just for the LLM inference. Traditional quant models running on local compute cost a fraction of this.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Latency.&lt;/strong&gt; Each agent takes 3-10 seconds for analysis. Even with parallel execution, the pipeline takes 15-30 seconds per stock. In markets where high-frequency traders operate on microsecond timescales, this is an eternity. LLM-based trading is structurally unsuitable for any strategy requiring speed.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Regulatory risk is real and growing.&lt;/strong&gt; The SEC brought its &lt;a href=&quot;https://www.sec.gov/newsroom/press-releases/2024-36&quot;&gt;first AI-washing enforcement actions&lt;/a&gt; in 2024, fining two firms a combined $400,000 for falsely claiming AI capabilities they didn&apos;t have. Their 2026 examination priorities explicitly target &quot;automated investment tools, algorithmic models, and AI-based systems.&quot; Firms cannot outsource accountability to AI systems — human oversight is required for any material trading decision. Multi-agent systems make accountability harder, not easier, because the decision emerges from agent interaction rather than a single traceable computation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The stationarity assumption.&lt;/strong&gt; Every agent&apos;s analysis assumes that patterns observed in historical data will continue into the future. This is true most of the time — until it isn&apos;t. Regime changes (COVID crash, rate hike cycles, geopolitical shocks) invalidate historical patterns precisely when accurate prediction matters most. No amount of multi-agent coordination helps when the underlying data distribution shifts.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;LLM-specific risks.&lt;/strong&gt; The agents inherit all standard LLM failure modes: hallucination (fabricating financial metrics), sycophancy (agreeing with the user&apos;s implied thesis), recency bias (overweighting recent data that appeared in training), and narrative bias (constructing compelling stories that aren&apos;t supported by evidence). In a domain where being wrong costs real money, these failure modes aren&apos;t abstract concerns — they&apos;re direct financial risks.&lt;/p&gt;
&lt;h2&gt;What This Teaches About Agent Architecture&lt;/h2&gt;
&lt;p&gt;The most valuable thing about building a multi-agent trading system isn&apos;t the trading system. It&apos;s what the exercise reveals about multi-agent coordination in general.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Productive disagreement is a feature, not a bug.&lt;/strong&gt; When agents disagree, the system is working. The dangerous scenario is when they agree — because agreement might indicate genuine consensus or shared blindness, and the architecture can&apos;t distinguish between the two. Design for disagreement. Build reconciliation protocols. Treat unanimous agreement with suspicion.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Blind spot mapping beats capability stacking.&lt;/strong&gt; Adding more agents doesn&apos;t make the system smarter. Adding agents with &lt;em&gt;non-overlapping blind spots&lt;/em&gt; does. Before building an agent, ask: &quot;What can this agent NOT see? Does any existing agent cover that gap?&quot; If the answer is no, the new agent adds value. If yes, it adds noise.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Confidence calibration is the hardest unsolved problem.&lt;/strong&gt; ==Every reconciliation strategy depends on knowing how much to trust each agent&apos;s output. LLMs are terrible at this — they sound confident when wrong and uncertain when right.== Until confidence calibration improves, any multi-agent system that weighs signals by confidence is building on sand.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Deterministic computation should stay deterministic.&lt;/strong&gt; ==The most reliable parts of the trading system are the parts that don&apos;t use LLMs: the financial data retrieval, the ratio calculations, the position sizing math.== Use LLMs for what they&apos;re good at (synthesis, pattern recognition in unstructured data, narrative interpretation) and code for what code is good at (math, data retrieval, rule enforcement). This is Garry Tan&apos;s deterministic layer principle applied to finance.&lt;/p&gt;
&lt;p&gt;These patterns — productive disagreement, blind spot mapping, confidence calibration, deterministic separation — apply anywhere agents must coordinate under uncertainty: medical diagnosis systems, content moderation pipelines, autonomous vehicle decision-making, security threat assessment. Finance is just the domain where the feedback loop is fastest and the tolerance for error is lowest.&lt;/p&gt;
&lt;p&gt;==The trading system won&apos;t make you rich. But the architecture might make your next multi-agent system significantly more robust.==&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;em&gt;Sharad Jain builds agentic AI systems in Bengaluru. He previously worked on data infrastructure at Meta and founded &lt;a href=&quot;http://autoscreen.ai&quot;&gt;autoscreen.ai&lt;/a&gt;, a production voice AI platform. He writes about agent architecture at &lt;a href=&quot;https://sharadja.in&quot;&gt;sharadja.in&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Further reading:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/virattt/ai-hedge-fund&quot;&gt;ai-hedge-fund&lt;/a&gt; — the open-source multi-agent trading system (55,000+ stars)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2510.02209&quot;&gt;StockBench&lt;/a&gt; — 82-day benchmark of 14 frontier LLMs on actual trading (2025)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2311.11944&quot;&gt;FinanceBench&lt;/a&gt; — GPT-4 fails 81% of financial QA tasks&lt;/li&gt;
&lt;li&gt;Bailey et al., &lt;a href=&quot;https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2326253&quot;&gt;&quot;Probability of Backtest Overfitting&quot;&lt;/a&gt; — why backtested Sharpe ratios need a 50% haircut&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2512.16030&quot;&gt;KalshiBench&lt;/a&gt; — LLM confidence calibration on real prediction markets&lt;/li&gt;
&lt;li&gt;Anthropic, &lt;a href=&quot;https://www.anthropic.com/engineering/building-effective-agents&quot;&gt;&quot;Building Effective Agents&quot;&lt;/a&gt; — agent workflow patterns&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><enclosure url="https://sharadja.in/agentic_workflows.webp" type="image/png" length="0"/><category>AI</category><category>finance</category><category>trading</category><category>multi-agent</category><category>LLM</category><category>agents</category><category>architecture</category><category>risk-management</category><author>Sharad Jain</author></item><item><title>Large Concept Models: What Happens When AI Stops Thinking in Words</title><link>https://sharadja.in/blog/large-concept-models-beyond-language/</link><guid isPermaLink="true">https://sharadja.in/blog/large-concept-models-beyond-language/</guid><description>Meta&apos;s Large Concept Models reason over sentence-level embeddings instead of tokens — a fundamentally different architecture that trades autoregressive generation for concept-space planning.</description><pubDate>Sat, 11 Jan 2025 09:30:00 GMT</pubDate><content:encoded>&lt;h1&gt;What Happens When AI Stops Thinking in Words&lt;/h1&gt;
&lt;p&gt;Every large language model you&apos;ve used — GPT-4, Claude, Gemini, LLaMA — thinks in tokens. Fragments of words. Byte-pair encodings that split &quot;understanding&quot; into &quot;under&quot; + &quot;stand&quot; + &quot;ing&quot; and process each piece through the same attention mechanism, one at a time, left to right.&lt;/p&gt;
&lt;p&gt;This works remarkably well. ==But it imposes a constraint so fundamental that most people don&apos;t notice it: &lt;strong&gt;the model&apos;s unit of reasoning is smaller than the model&apos;s unit of meaning.&lt;/strong&gt;== You think in ideas. The model thinks in syllables. Every concept must be assembled from parts, token by token, with no guarantee that the whole adds up to what you intended.&lt;/p&gt;
&lt;p&gt;In December 2024, Meta AI published a paper that asks the obvious question: what if we just skipped the tokens entirely?&lt;/p&gt;
&lt;p&gt;Their answer is the &lt;strong&gt;Large Concept Model&lt;/strong&gt; — an architecture that reasons over sentence-level semantic embeddings rather than individual tokens. ==Instead of predicting the next word, it predicts the next &lt;em&gt;idea&lt;/em&gt;.== The results are surprising, the limitations are instructive, and the implications are worth thinking through carefully.&lt;/p&gt;
&lt;h2&gt;The Token Bottleneck&lt;/h2&gt;
&lt;p&gt;To understand why LCMs matter, you need to understand what token-level processing actually costs you.&lt;/p&gt;
&lt;p&gt;When a transformer processes text, it converts each token into a high-dimensional vector and computes attention across all tokens in the context. The computational cost scales quadratically with sequence length — O(n²) for n tokens. A 1,000-word essay is roughly 1,300 tokens. A 10,000-word document is 13,000 tokens. The attention mechanism must compute 169 million pairwise relationships for the longer document, compared to 1.7 million for the shorter one.&lt;/p&gt;
&lt;p&gt;But here&apos;s the deeper issue: those 13,000 tokens don&apos;t represent 13,000 ideas. They represent maybe 200-400 distinct concepts, encoded redundantly across thousands of subword fragments. The model is spending most of its computation processing the &lt;em&gt;syntax and surface form&lt;/em&gt; of language rather than its &lt;em&gt;meaning&lt;/em&gt;. It&apos;s like trying to understand a symphony by analyzing individual sound waves instead of listening to the melody.&lt;/p&gt;
&lt;p&gt;I call this the &lt;strong&gt;granularity mismatch&lt;/strong&gt;: the gap between the level at which humans reason about text (sentences, paragraphs, arguments) and the level at which transformers process it (tokens, subwords, characters). Every token-level model pays this tax. The question is whether there&apos;s a way around it.&lt;/p&gt;
&lt;h2&gt;How Large Concept Models Work&lt;/h2&gt;
&lt;p&gt;Meta&apos;s LCM paper, &lt;a href=&quot;https://arxiv.org/abs/2412.08821&quot;&gt;&quot;Large Concept Models: Language Modeling in a Sentence Representation Space&quot;&lt;/a&gt; (December 2024), proposes a direct solution: move the entire reasoning process into a semantic embedding space where each point represents a complete sentence.&lt;/p&gt;
&lt;p&gt;The architecture has three stages:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;┌─────────────────────────────────────────────────────┐
│  1. ENCODE: Text → Sentence Embeddings              │
│     &quot;The market crashed&quot; → vector [0.23, -0.71, ...] │
│     Using SONAR encoder (1,024-dim, 200+ languages)  │
├─────────────────────────────────────────────────────┤
│  2. REASON: Embedding → Embedding                    │
│     Predict next concept from sequence of concepts   │
│     Three approaches: Base / Diffusion / Quantized   │
├─────────────────────────────────────────────────────┤
│  3. DECODE: Embedding → Text                         │
│     vector [0.41, 0.18, ...] → &quot;Investors panicked&quot;  │
│     Using SONAR decoder (any target language)         │
└─────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Stage 1&lt;/strong&gt; uses &lt;a href=&quot;https://github.com/facebookresearch/SONAR&quot;&gt;SONAR&lt;/a&gt;, Meta&apos;s multilingual sentence embedding model that maps sentences from over 200 languages into a shared 1,024-dimensional vector space. The key property: sentences with similar meanings land near each other in this space, regardless of the language they were written in. &quot;The sky is blue&quot; in English, Japanese, and Arabic all map to approximately the same point.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Stage 2&lt;/strong&gt; is where the LCM does its actual reasoning. Instead of predicting the next token, it predicts the next sentence embedding — a complete concept — given the sequence of previous concepts. This is fundamentally different from autoregressive token generation. The model operates on a sequence of maybe 30-50 concepts instead of 1,300 tokens, with each concept carrying the full semantic weight of an entire sentence.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Stage 3&lt;/strong&gt; decodes the predicted embedding back into natural language text using SONAR&apos;s decoder. ==Because the embedding space is language-agnostic, the same concept can be decoded into any of the 200+ supported languages — without the model ever being trained on translation explicitly.==&lt;/p&gt;
&lt;p&gt;The researchers explored three architectures for Stage 2:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Base LCM&lt;/strong&gt; — a standard autoregressive transformer operating on continuous embeddings. It predicts the next concept vector by attending to all previous concepts. Simple but effective for short sequences.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Diffusion-based LCM&lt;/strong&gt; — borrows the iterative refinement process from image generation models. Instead of predicting the next concept in one shot, it starts with noise and gradually denoises it into a coherent concept embedding. This captures richer distributions over possible next concepts, which matters when the continuation is genuinely ambiguous.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Quantized LCM (QLCM)&lt;/strong&gt; — discretizes the continuous embedding space into a vocabulary of concept tokens using a residual vector quantizer. Each sentence becomes a short sequence of discrete codes (typically 8 codes per sentence), and the model operates on these codes using standard sequence-to-sequence methods. This bridges the LCM approach with conventional language modeling techniques.&lt;/p&gt;
&lt;h2&gt;The Three Levels of Abstraction&lt;/h2&gt;
&lt;p&gt;To place LCMs in the broader landscape, it helps to think about language models as operating at different levels of abstraction:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;┌─────────────────────────────────────┐
│  Level 3: Concept-Level (LCMs)      │
│  Unit: sentence embedding           │
│  Sequence: ~30-50 concepts          │
│  Captures: meaning, intent, ideas   │
├─────────────────────────────────────┤
│  Level 2: Token-Level (LLMs)        │
│  Unit: subword token                │
│  Sequence: ~1,000-100,000 tokens    │
│  Captures: syntax, grammar, style   │
├─────────────────────────────────────┤
│  Level 1: Character-Level           │
│  Unit: character/byte               │
│  Sequence: ~5,000-500,000 chars     │
│  Captures: spelling, morphology     │
└─────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Most of the field&apos;s energy has gone into Level 2. LCMs propose that for certain tasks — summarization, planning, cross-lingual transfer — Level 3 is simply a better place to work. You lose fine-grained control over word choice and syntax, but you gain the ability to reason about ideas directly.&lt;/p&gt;
&lt;p&gt;This isn&apos;t a replacement for LLMs. It&apos;s a complement. Different tasks live at different abstraction levels. Generating poetry requires Level 2 (word-level craft matters). Summarizing a legal brief requires Level 3 (concept-level reasoning matters). The choice of abstraction level should match the task, not the model&apos;s architecture.&lt;/p&gt;
&lt;h2&gt;What the Results Actually Show&lt;/h2&gt;
&lt;p&gt;The Meta paper&apos;s results are both impressive and humbling. The 7-billion parameter LCM, trained on sentences extracted from a large multilingual corpus, produced these findings:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Summarization is the sweet spot.&lt;/strong&gt; On the summary expansion benchmark, the diffusion-based LCM (specifically the &quot;Two-Tower&quot; variant) achieved the strongest performance. The model didn&apos;t just extract and rearrange text — it genuinely paraphrased and restructured ideas, producing summaries that read like a human wrote them rather than highlighted them. This makes architectural sense: summarization is fundamentally a concept-level operation. You&apos;re compressing ideas, not rearranging words.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Multilingual transfer is remarkable.&lt;/strong&gt; Despite being trained primarily on English data, the LCM outperformed specialized multilingual models on most of the 45 languages tested. This result flows directly from SONAR&apos;s architecture: because the embedding space is language-agnostic, reasoning learned in one language transfers automatically to all others. The model doesn&apos;t need to &quot;learn&quot; each language separately — it learns to reason about concepts, and SONAR handles the translation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The numbers are concrete.&lt;/strong&gt; The 7B model was trained on 2.3 billion documents (2.7 trillion tokens, 142.4 billion sentence concepts) with a context window of 2,048 concepts. On CNN/DailyMail and XSum summarization benchmarks, the Two-Tower diffusion variant was competitive with instruction-tuned models including T5-3B, Gemma-7B, and LLaMA-3.1-8B — while operating on sequences an order of magnitude shorter.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Expansion is the weak spot.&lt;/strong&gt; When asked to expand a summary into longer text, the model struggled. It repeated ideas, produced circular reasoning, and lost coherence beyond 5-7 sentences. This is the &lt;strong&gt;generation horizon problem&lt;/strong&gt;: concept-level models are good at compressing information but lack the fine-grained control needed for sustained, coherent long-form generation. Token-level models don&apos;t have this problem because they control output one word at a time.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The concept space is fragile.&lt;/strong&gt; Small perturbations in the embedding space can cause large, unpredictable changes in the decoded text. A tiny shift in the vector representing &quot;the company&apos;s revenue grew&quot; might decode to &quot;the company&apos;s reputation suffered&quot; — semantically distant but geometrically close. This &lt;strong&gt;embedding instability&lt;/strong&gt; is the most fundamental limitation of the approach, and it gets worse for technical or precise content where exact wording matters.&lt;/p&gt;
&lt;h2&gt;The Planning Problem&lt;/h2&gt;
&lt;p&gt;Perhaps the most interesting section of Meta&apos;s paper is their proposal for the &lt;strong&gt;Large Planning Concept Model (LPCM)&lt;/strong&gt; — a version that generates an explicit plan before producing content.&lt;/p&gt;
&lt;p&gt;The idea is compelling: before generating a sequence of concept embeddings, first generate a high-level plan — a sequence of abstract intentions — and then fill in the concepts that satisfy that plan. This is how skilled writers actually work. You don&apos;t write an essay one sentence at a time, left to right. You outline the argument, identify the key points, and then flesh out each section.&lt;/p&gt;
&lt;p&gt;The LPCM adds a planning layer above the concept layer:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Plan:     [introduce problem] → [present data] → [propose solution] → [acknowledge limits]
           ↓                    ↓                  ↓                    ↓
Concepts: [3-5 sentences]      [3-5 sentences]    [3-5 sentences]     [2-3 sentences]
           ↓                    ↓                  ↓                    ↓
Text:      Paragraph 1          Paragraph 2        Paragraph 3         Paragraph 4
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This hierarchical approach addresses the generation horizon problem by constraining the concept sequence to follow a coherent structure. Early results show improved coherence on longer documents, though the planning module itself introduces new challenges around plan faithfulness — the model sometimes generates concepts that satisfy the local context but deviate from the original plan.&lt;/p&gt;
&lt;h2&gt;Where This Breaks&lt;/h2&gt;
&lt;p&gt;LCMs have real limitations, and understanding them clarifies when this architecture is and isn&apos;t the right choice.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Sentence-level granularity is a ceiling, not just a feature.&lt;/strong&gt; The model can&apos;t reason about individual words. It can&apos;t write poetry where word choice matters. It can&apos;t generate code where every character is syntactically meaningful. It can&apos;t produce dialogue where rhythm and word-level pacing create voice. For these tasks, token-level models are strictly superior.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Embedding instability compounds over sequences.&lt;/strong&gt; Each concept prediction has some error. Over a sequence of 30 concepts, these errors accumulate. By concept 20, the model may have drifted significantly from the intended meaning. This is the concept-level equivalent of the &quot;hallucination&quot; problem in LLMs, but harder to detect because the output can be semantically plausible while being factually wrong.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;SONAR is a bottleneck.&lt;/strong&gt; The quality of the entire system depends on SONAR&apos;s ability to faithfully encode and decode sentences. Any information that SONAR loses during encoding — nuance, emphasis, technical precision — is irrecoverable. The model can&apos;t generate what it can&apos;t represent. And SONAR&apos;s 1,024-dimensional embedding space, while powerful for general semantics, may not capture the distinctions that matter for specialized domains like legal text, medical records, or mathematical proofs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Training data requirements are different.&lt;/strong&gt; Token-level LLMs can be trained on raw text. Concept-level LCMs require text that has been segmented into sentences and encoded into SONAR embeddings — an additional preprocessing step that introduces its own errors and biases. Sentence boundary detection is imperfect, especially for languages with non-standard punctuation or for informal text like social media posts.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The diffusion approach is slow.&lt;/strong&gt; Diffusion-based concept generation requires multiple denoising steps per concept (typically 10-100 steps). For a 30-concept summary, that&apos;s 300-3,000 forward passes — significantly slower than autoregressive generation. The QLCM variant is faster but sacrifices the richer distributional properties that make diffusion effective.&lt;/p&gt;
&lt;h2&gt;What This Means for the Field&lt;/h2&gt;
&lt;p&gt;==LCMs matter not because they&apos;ll replace LLMs — they won&apos;t, at least not for general-purpose language generation — but because they demonstrate that &lt;strong&gt;the token is not the only viable unit of reasoning&lt;/strong&gt; for neural language models.==&lt;/p&gt;
&lt;p&gt;This insight has implications beyond the specific LCM architecture:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hierarchical reasoning.&lt;/strong&gt; The most capable future systems will likely operate at multiple abstraction levels simultaneously — planning at the concept level, generating at the token level, and verifying at both. The LCM paper provides evidence that concept-level reasoning is learnable and useful, even if the current implementation is limited.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Multilingual AI without multilingual training.&lt;/strong&gt; SONAR&apos;s language-agnostic embedding space suggests a path toward AI systems that can reason in any language after being trained in just one. This has enormous implications for languages with limited training data — the &quot;low-resource language&quot; problem that token-level models still struggle with.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Compression as reasoning.&lt;/strong&gt; LCMs perform a kind of lossy compression: they encode rich, nuanced text into fixed-size vectors, reason over those vectors, and then decode them back. The quality of this compression determines the quality of the reasoning. This frames the problem differently than token-level models, and may lead to architectures that are fundamentally better at tasks that require abstraction — summarization, planning, conceptual reasoning — even if they&apos;re worse at tasks that require precision.&lt;/p&gt;
&lt;p&gt;The granularity mismatch between human reasoning and token-level processing is real. LCMs are the first serious attempt to close it. The attempt is imperfect — fragile embeddings, limited generation horizons, slow diffusion — but the direction is right. ==We should be building models that reason about ideas, not just words.== The question is how to do it without losing the precision that makes token-level models so useful.&lt;/p&gt;
&lt;p&gt;The follow-up work is already proving the direction right. &lt;a href=&quot;https://arxiv.org/abs/2508.05305&quot;&gt;SONAR-LLM&lt;/a&gt; (August 2025) eliminates the diffusion sampler entirely, building a decoder-only transformer that &quot;thinks in sentence embeddings and speaks in tokens&quot; — significantly outperforming both diffusion and MSE variants while achieving near-linear computational scaling up to 1M tokens. &lt;a href=&quot;https://arxiv.org/abs/2512.24617&quot;&gt;Dynamic Large Concept Models&lt;/a&gt; (December 2025) go further, learning semantic boundaries from latent representations rather than using fixed sentence boundaries, achieving +2.69% across 12 zero-shot benchmarks under matched compute.&lt;/p&gt;
&lt;p&gt;==Meta&apos;s original LCM doesn&apos;t answer the granularity question. But the trajectory from LCM → SONAR-LLM → Dynamic LCM shows the field converging on an answer: the right unit of reasoning isn&apos;t fixed — it&apos;s learned.==&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;em&gt;Sharad Jain builds AI systems in Bengaluru. He previously worked on data infrastructure at Meta and founded &lt;a href=&quot;http://autoscreen.ai&quot;&gt;autoscreen.ai&lt;/a&gt;, a production voice AI platform. He writes about emerging AI architectures at &lt;a href=&quot;https://sharadja.in&quot;&gt;sharadja.in&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Further reading:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Meta AI, &lt;a href=&quot;https://arxiv.org/abs/2412.08821&quot;&gt;&quot;Large Concept Models: Language Modeling in a Sentence Representation Space&quot;&lt;/a&gt; — December 2024&lt;/li&gt;
&lt;li&gt;Meta AI, &lt;a href=&quot;https://github.com/facebookresearch/SONAR&quot;&gt;SONAR: Multilingual Sentence Embeddings&lt;/a&gt; — sentence-level encoder supporting 200+ languages&lt;/li&gt;
&lt;li&gt;Anthropic, &lt;a href=&quot;https://www.anthropic.com/engineering/building-effective-agents&quot;&gt;&quot;Building Effective Agents&quot;&lt;/a&gt; — contrasting approach: token-level agents with workflow patterns&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><enclosure url="https://sharadja.in/lcm.png" type="image/png" length="0"/><category>AI</category><category>LCM</category><category>concept-models</category><category>SONAR</category><category>Meta</category><category>language-models</category><category>embeddings</category><category>architecture</category><author>Sharad Jain</author></item><item><title>The Dance of AI Agents: How Multi-Agent Systems Actually Work</title><link>https://sharadja.in/blog/dance-of-ai-agents/</link><guid isPermaLink="true">https://sharadja.in/blog/dance-of-ai-agents/</guid><description>Multi-agent AI isn&apos;t about making systems smarter — it&apos;s about making them smaller, specialized, and composable. A practitioner&apos;s guide to the patterns that work.</description><pubDate>Fri, 28 Jun 2024 09:30:00 GMT</pubDate><content:encoded>&lt;h1&gt;The Dance of AI Agents: How Multi-Agent Systems Actually Work&lt;/h1&gt;
&lt;p&gt;Something interesting happens when you watch a great team. Each person knows their role, understands when to step in, and — more importantly — when to step back. They move with a fluid efficiency that makes complex work look simple.&lt;/p&gt;
&lt;p&gt;I&apos;ve been trying to build AI systems that work the same way. Not a single monolithic model that does everything, but a network of specialized agents that collaborate, delegate, and route around each other&apos;s limitations. The results have been surprising — both in what works and in what fails spectacularly.&lt;/p&gt;
&lt;p&gt;==The most important lesson: &lt;strong&gt;multi-agent AI isn&apos;t about making systems smarter. It&apos;s about making them smaller.&lt;/strong&gt;== ==A network of focused agents, each with a narrow responsibility and clear handoff protocols, consistently outperforms a single agent trying to do everything.== Not because the individual agents are more capable, but because the architecture constrains them to do less — and do it well.&lt;/p&gt;
&lt;h2&gt;The Single-Agent Ceiling&lt;/h2&gt;
&lt;p&gt;Most AI applications today are single-agent systems: one model, one system prompt, one set of tools. You give it a complex task, it reasons through it, calls some tools, and produces output. This works well for simple tasks. It breaks down predictably for complex ones.&lt;/p&gt;
&lt;p&gt;The failure mode is always the same: &lt;strong&gt;context pollution&lt;/strong&gt;. As the task gets more complex, the system prompt grows. More tools are added. The model must hold more state in its context window. And as we saw with the &lt;a href=&quot;https://sharadja.in/blog/system-prompt-architecture-agentic-ai&quot;&gt;14K Token Debt&lt;/a&gt;, attention degrades as context grows — the model starts losing track of its instructions, its tools interfere with each other, and output quality drops.&lt;/p&gt;
&lt;p&gt;Anthropic documented this pattern precisely in their &lt;a href=&quot;https://www.anthropic.com/engineering/building-effective-agents&quot;&gt;&quot;Building Effective Agents&quot;&lt;/a&gt; guide (December 2024). Their central recommendation: &lt;strong&gt;start simple, and only add complexity when simpler solutions fall short.&lt;/strong&gt; But when you do need complexity, the answer isn&apos;t a bigger single agent — it&apos;s multiple agents working together.&lt;/p&gt;
&lt;p&gt;The insight maps to a broader principle from software engineering: the &lt;a href=&quot;https://en.wikipedia.org/wiki/Single-responsibility_principle&quot;&gt;single responsibility principle&lt;/a&gt;. A function that does one thing well is easier to test, debug, and maintain than a function that does ten things. The same applies to agents.&lt;/p&gt;
&lt;h2&gt;Five Workflow Patterns That Actually Work&lt;/h2&gt;
&lt;p&gt;Anthropic&apos;s taxonomy of agent collaboration patterns is the clearest framework I&apos;ve seen. They distinguish between &lt;strong&gt;workflows&lt;/strong&gt; (LLMs orchestrated through predefined code paths) and &lt;strong&gt;agents&lt;/strong&gt; (LLMs that dynamically direct their own processes). Most production systems are workflows, not agents — and that&apos;s a good thing.&lt;/p&gt;
&lt;p&gt;Here are the five workflow patterns, ordered by complexity:&lt;/p&gt;
&lt;h3&gt;1. Prompt Chaining&lt;/h3&gt;
&lt;p&gt;The simplest pattern: break a task into sequential steps, where each step&apos;s output feeds the next step&apos;s input. Each step can have its own focused prompt, its own validation gate, and its own error handling.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[Generate outline] → [Gate: outline valid?] → [Write section 1] → [Write section 2] → [Edit]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;When to use:&lt;/strong&gt; Tasks that decompose into fixed, predictable subtasks. Translation pipelines, content generation with review steps, multi-stage data processing.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;When not to use:&lt;/strong&gt; Tasks where the steps can&apos;t be known in advance, or where later steps might invalidate earlier ones.&lt;/p&gt;
&lt;h3&gt;2. Routing&lt;/h3&gt;
&lt;p&gt;A classifier examines the input and directs it to the appropriate specialized handler. Like a hospital triage nurse — assess, classify, route.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[User message] → [Classifier] → Sales agent
                              → Support agent
                              → Technical agent
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;When to use:&lt;/strong&gt; Inputs that fall into distinct categories requiring different handling. Customer service, document processing, multi-domain assistants.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The trap:&lt;/strong&gt; Over-routing. If your classifier needs to distinguish between 15 categories, it will misclassify frequently. Keep categories to 3-5 for reliable routing.&lt;/p&gt;
&lt;h3&gt;3. Parallelization&lt;/h3&gt;
&lt;p&gt;Run multiple agents simultaneously on different aspects of the same input, then combine results. Two variants: &lt;strong&gt;sectioning&lt;/strong&gt; (divide work by subtask) and &lt;strong&gt;voting&lt;/strong&gt; (same task, multiple attempts, pick the best).&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;                 ┌→ [Grammar check]  ─┐
[Document] ──────┼→ [Fact check]     ─┼→ [Merge results]
                 └→ [Style check]    ─┘
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;When to use:&lt;/strong&gt; Tasks where independent subtasks can run concurrently, or where confidence is improved by multiple attempts. Code review, content moderation, multi-criteria evaluation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cost awareness:&lt;/strong&gt; Parallelization multiplies your API costs linearly. Three parallel agents cost 3x. Make sure the quality improvement justifies the expense.&lt;/p&gt;
&lt;h3&gt;4. Orchestrator-Workers&lt;/h3&gt;
&lt;p&gt;A central orchestrator agent dynamically breaks down the task and delegates to worker agents. Unlike prompt chaining, the subtasks aren&apos;t predetermined — the orchestrator decides what to delegate based on the specific input.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[Orchestrator] → &quot;I need these 3 things done&quot;
     ├→ [Worker A: research]
     ├→ [Worker B: analysis]
     └→ [Worker C: writing]
          ↓
[Orchestrator] → &quot;Now combine the results&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;When to use:&lt;/strong&gt; Complex tasks where the decomposition varies per input. Coding agents that need to read files, plan changes, implement, and test — but which files and what changes depend on the task.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The danger:&lt;/strong&gt; Orchestrator drift. The orchestrator agent is itself an LLM, subject to the same attention degradation as any other model. If it loses track of the overall plan, workers execute correctly on the wrong tasks. Keep orchestrator context small and plan explicit.&lt;/p&gt;
&lt;h3&gt;5. Evaluator-Optimizer&lt;/h3&gt;
&lt;p&gt;One agent generates output; another evaluates it. The evaluation feeds back, and the generator iterates. This loop continues until the evaluator is satisfied or a maximum iteration count is hit.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[Generator] → [Output] → [Evaluator: score 6/10] → [Generator: improve] → [Evaluator: 9/10] → Done
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;When to use:&lt;/strong&gt; Tasks with clear, measurable quality criteria. Code that must pass tests, content that must meet a rubric, translations that must match a reference style.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Always set a maximum iteration count.&lt;/strong&gt; Without it, a generator-evaluator pair can loop indefinitely, each iteration making marginal changes that never satisfy the evaluator. I&apos;ve seen this consume $50+ in API calls before anyone noticed.&lt;/p&gt;
&lt;h2&gt;The Handoff Problem&lt;/h2&gt;
&lt;p&gt;The patterns above describe how agents &lt;em&gt;collaborate&lt;/em&gt;. ==But the hardest engineering problem in multi-agent systems isn&apos;t collaboration — it&apos;s &lt;strong&gt;handoffs&lt;/strong&gt;: the moment when one agent passes control, context, and responsibility to another.==&lt;/p&gt;
&lt;p&gt;A bad handoff looks like this: Agent A has been helping a user with a billing issue for 5 turns. Agent A realizes the problem is actually technical and routes to Agent B. Agent B has no context. It asks the user to re-explain the problem. The user is frustrated. This is worse than having no handoff at all.&lt;/p&gt;
&lt;p&gt;A good handoff preserves three things:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Context&lt;/strong&gt; — the full conversation history, not just a summary&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;State&lt;/strong&gt; — what has been tried, what has failed, what the user&apos;s emotional state is&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Intent&lt;/strong&gt; — why the handoff is happening and what the receiving agent should do&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;OpenAI&apos;s &lt;a href=&quot;https://openai.github.io/openai-agents-python/&quot;&gt;Agents SDK&lt;/a&gt; (evolved from the experimental Swarm project) implements two handoff patterns:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The Manager Pattern&lt;/strong&gt; — a central agent invokes sub-agents as tools, retaining control of the conversation. The sub-agent executes a focused task and returns results to the manager. The user only ever talks to the manager.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The Handoff Pattern&lt;/strong&gt; — one agent fully transfers control to another. The receiving agent takes over the conversation directly. The original agent is no longer involved.&lt;/p&gt;
&lt;p&gt;The manager pattern is safer (the manager maintains continuity) but creates a bottleneck. The handoff pattern is more flexible but requires careful context transfer. In practice, most production systems use the manager pattern for reliability, with handoffs reserved for clear domain boundaries (e.g., sales → support → technical).&lt;/p&gt;
&lt;p&gt;Here&apos;s what a clean handoff implementation looks like:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;from agents import Agent, Runner

triage = Agent(
    name=&quot;Triage&quot;,
    instructions=&quot;Classify the user&apos;s need. Route to sales for purchases, &quot;
                 &quot;support for issues. Always transfer full context.&quot;,
    handoffs=[&quot;sales_agent&quot;, &quot;support_agent&quot;],
)

sales_agent = Agent(
    name=&quot;Sales&quot;,
    instructions=&quot;Handle purchase inquiries. If the user has a technical &quot;
                 &quot;issue, transfer back to triage with context.&quot;,
    handoffs=[&quot;triage&quot;],
)

support_agent = Agent(
    name=&quot;Support&quot;,
    instructions=&quot;Resolve customer issues. Offer refunds only after &quot;
                 &quot;attempting a fix. Escalate to human if unresolved &quot;
                 &quot;after 3 attempts.&quot;,
    handoffs=[&quot;triage&quot;],
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The key design choice: &lt;strong&gt;bounded handoff depth&lt;/strong&gt;. An agent should hand off to at most 2-3 specialists, and handoff chains should never exceed 3 hops. Triage → Sales → Triage is fine. Triage → Sales → Support → Technical → Billing → Triage is a system that will produce circular delegation, infinite loops, and confused users.&lt;/p&gt;
&lt;h2&gt;Where Multi-Agent Systems Fail&lt;/h2&gt;
&lt;p&gt;Multi-agent architectures introduce failure modes that don&apos;t exist in single-agent systems. Understanding these is more important than understanding the successes.&lt;/p&gt;
&lt;p&gt;A &lt;a href=&quot;https://arxiv.org/abs/2503.13657&quot;&gt;systematic analysis of multi-agent failures&lt;/a&gt; (Cemri et al., 2025) studied 150+ conversation traces across five production frameworks (MetaGPT, ChatDev, AutoGen, and others) and found &lt;strong&gt;41-86.7% failure rates&lt;/strong&gt; in production. They identified 14 distinct failure modes in three categories: specification errors (42%), inter-agent misalignment (37%), and verification failures (21%). ==The finding that should keep you up at night: unstructured multi-agent networks can &lt;strong&gt;amplify errors up to 17.2x&lt;/strong&gt; compared to single-agent baselines.==&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cascading errors.&lt;/strong&gt; When Agent A produces subtly wrong output and passes it to Agent B, Agent B has no way to know the input is wrong. It processes it faithfully, amplifying the error. Agent C does the same. By the end of the chain, the output is confidently, thoroughly wrong — and every agent&apos;s logs show correct execution. This is the 17.2x amplification in action.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cost explosion.&lt;/strong&gt; Each agent in the network makes its own LLM API calls. A 5-agent workflow processing 1,000 requests per day, with an average of 3 turns per agent, generates 15,000 API calls daily. At $0.01 per call (GPT-4o-mini), that&apos;s $150/day — $4,500/month. With GPT-4o or Claude Opus, multiply by 10-50x. Multi-agent systems can be dramatically more expensive than single-agent alternatives, and the cost scales with the number of agents, not the quality of output.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Accountability gaps.&lt;/strong&gt; When five agents collaborate on a decision and the decision is wrong, which agent is responsible? In practice, no one debugs a multi-agent failure by reading one agent&apos;s logs — you need to trace the full decision chain across all agents. This requires end-to-end tracing infrastructure, and most teams don&apos;t build it until after their first expensive failure.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Coordination overhead.&lt;/strong&gt; Every handoff, every parallel merge, every evaluator loop adds latency. A prompt chain of 4 agents, each taking 2 seconds, adds 8 seconds of sequential latency — enough to make interactive applications feel sluggish. Parallelization helps but doesn&apos;t eliminate the coordination cost.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The &quot;too many agents&quot; trap.&lt;/strong&gt; Teams that discover multi-agent patterns often over-apply them. A 3-agent customer service system (triage, sales, support) is clean and maintainable. A 15-agent system with specialized agents for &quot;greeting,&quot; &quot;empathy,&quot; &quot;product recommendation,&quot; &quot;upselling,&quot; and &quot;farewell&quot; is over-engineered and harder to debug than the monolithic system it replaced. Andrew Ng&apos;s &lt;a href=&quot;https://www.deeplearning.ai/the-batch/agentic-design-patterns-part-1-reflection/&quot;&gt;practical guidance&lt;/a&gt; on agentic design patterns emphasizes starting with the simplest agent architecture that solves the problem.&lt;/p&gt;
&lt;h2&gt;When to Use What&lt;/h2&gt;
&lt;p&gt;The decision tree is simpler than most frameworks suggest:&lt;/p&gt;
&lt;p&gt;For context on what &quot;works in production&quot; actually looks like: Salesforce&apos;s Agentforce, the largest documented agentic deployment, handled &lt;a href=&quot;https://www.salesforce.com/news/stories/agentforce-customer-support-lessons-learned/&quot;&gt;1.5 million+ support requests&lt;/a&gt; in its first year with an 84% resolution rate without human intervention and only a 4% handoff rate. But it&apos;s a carefully constrained workflow system, not an autonomous agent swarm.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use a single agent&lt;/strong&gt; when: the task is straightforward, fits in one context window, requires one domain of expertise, and doesn&apos;t need collaboration with other systems. This covers 80% of real-world use cases.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use prompt chaining&lt;/strong&gt; when: the task has clear sequential steps with validation gates between them. This is the first multi-agent pattern you should reach for.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use routing&lt;/strong&gt; when: inputs fall into 3-5 distinct categories that need genuinely different handling. Not different tones — different tools, different knowledge, different procedures.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use parallelization&lt;/strong&gt; when: you need speed (run subtasks concurrently) or confidence (multiple attempts at the same task). Be prepared for the cost multiplier.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use orchestrator-workers&lt;/strong&gt; when: the task decomposition varies per input and can&apos;t be predetermined. This is the most powerful pattern and the most dangerous — orchestrator drift is real.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use evaluator-optimizer&lt;/strong&gt; when: you have clear, measurable quality criteria and the cost of iteration is justified by the cost of a wrong answer.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use full handoff-based multi-agent systems&lt;/strong&gt; when: you need agents with genuinely different capabilities (different tools, different models, different system prompts) that must transfer control based on runtime decisions. This is the most complex pattern and should be your last resort, not your first.&lt;/p&gt;
&lt;h2&gt;Building for Collaboration&lt;/h2&gt;
&lt;p&gt;If you&apos;re building multi-agent systems today, a few hard-won lessons:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Design the handoffs first.&lt;/strong&gt; Before you build any agent, design the handoff protocol. What context gets transferred? What state is preserved? What triggers a handoff? The handoff design determines whether your system feels seamless or fragmented.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Trace everything.&lt;/strong&gt; Build end-to-end tracing before you build the agents. Every handoff, every tool call, every decision should be logged with full context. You will need this when things go wrong — and they will go wrong.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Start with workflows, not agents.&lt;/strong&gt; ==Predefined code paths with LLM steps are more reliable, more debuggable, and cheaper than autonomous agents.== Only use true autonomous agents when the task genuinely can&apos;t be decomposed in advance.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Measure the right thing.&lt;/strong&gt; Don&apos;t measure how many agents you have. Measure task completion rate, average cost per task, and time-to-resolution. A 2-agent system that solves 95% of problems for $0.05 each is better than a 10-agent system that solves 98% for $2.50 each.&lt;/p&gt;
&lt;p&gt;==The dance of AI agents isn&apos;t about choreographing ever-more-complex routines. It&apos;s about knowing when a simple two-step is better than a waltz — and having the discipline to keep it simple until simplicity genuinely isn&apos;t enough.==&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;em&gt;Sharad Jain builds agentic AI systems in Bengaluru. He previously worked on data infrastructure at Meta and founded &lt;a href=&quot;http://autoscreen.ai&quot;&gt;autoscreen.ai&lt;/a&gt;, a production voice AI platform. He writes about agent architecture at &lt;a href=&quot;https://sharadja.in&quot;&gt;sharadja.in&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Further reading:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Anthropic, &lt;a href=&quot;https://www.anthropic.com/engineering/building-effective-agents&quot;&gt;&quot;Building Effective Agents&quot;&lt;/a&gt; — the definitive taxonomy of agent workflow patterns&lt;/li&gt;
&lt;li&gt;OpenAI, &lt;a href=&quot;https://openai.github.io/openai-agents-python/&quot;&gt;Agents SDK&lt;/a&gt; — production agent framework with handoffs and guardrails&lt;/li&gt;
&lt;li&gt;Andrew Ng, &lt;a href=&quot;https://www.deeplearning.ai/the-batch/agentic-design-patterns-part-1-reflection/&quot;&gt;&quot;Agentic Design Patterns&quot;&lt;/a&gt; — four foundational patterns for agent systems&lt;/li&gt;
&lt;li&gt;OpenAI, &lt;a href=&quot;https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf&quot;&gt;&quot;A Practical Guide to Building Agents&quot;&lt;/a&gt; — enterprise agent architecture guide&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><enclosure url="https://sharadja.in/images/agents-orchestrating.png" type="image/png" length="0"/><category>AI</category><category>agents</category><category>orchestration</category><category>multi-agent</category><category>handoffs</category><category>OpenAI</category><category>Anthropic</category><category>architecture</category><author>Sharad Jain</author></item><item><title>Building a Salesforce Slack Assistant: Lessons in AI-Powered Enterprise Tools</title><link>https://sharadja.in/blog/building-ai-salesforce-slack-assistant/</link><guid isPermaLink="true">https://sharadja.in/blog/building-ai-salesforce-slack-assistant/</guid><description>Explore our journey in creating an AI-powered Salesforce assistant for Slack, including architectural decisions, challenges faced, and lessons learned in developing enterprise-grade AI tools.</description><pubDate>Thu, 27 Jun 2024 09:00:00 GMT</pubDate><content:encoded>&lt;h1&gt;Building a Salesforce Slack Assistant: Lessons in AI-Powered Enterprise Tools&lt;/h1&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;When we set out to build a Salesforce assistant for Slack, we knew we were venturing into relatively uncharted territory. The idea seemed simple: allow our sales team to query Salesforce data directly from Slack. But as with most seemingly simple ideas in software, the devil was in the details.&lt;/p&gt;
&lt;h2&gt;The Genesis&lt;/h2&gt;
&lt;p&gt;Our sales team was spending too much time switching between Slack and Salesforce. ==This context-switching was more than just an annoyance; it was a real drag on productivity.== We thought, &quot;What if we could bring Salesforce to where our team already spends most of their day?&quot;&lt;/p&gt;
&lt;h2&gt;Architectural Decisions&lt;/h2&gt;
&lt;h3&gt;The Stack&lt;/h3&gt;
&lt;p&gt;We settled on a Python backend, leveraging the OpenAI API for natural language processing, and the Salesforce API for data retrieval. Here&apos;s why:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Python: Fast development, rich ecosystem for both AI and Salesforce integrations.&lt;/li&gt;
&lt;li&gt;OpenAI API: State-of-the-art language models for understanding complex queries.&lt;/li&gt;
&lt;li&gt;Salesforce API: Robust and well-documented, essential for reliable data retrieval.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;The Flow&lt;/h3&gt;
&lt;p&gt;The basic flow of our assistant works like this:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;User sends a message in Slack.&lt;/li&gt;
&lt;li&gt;Our server processes the message.&lt;/li&gt;
&lt;li&gt;OpenAI Assistant interprets the query.&lt;/li&gt;
&lt;li&gt;If needed, we fetch data from Salesforce.&lt;/li&gt;
&lt;li&gt;We generate and send a response back to Slack.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Sounds simple, right? But each of these steps hides layers of complexity.&lt;/p&gt;
&lt;h2&gt;Challenges and Solutions&lt;/h2&gt;
&lt;h3&gt;1. Context Management&lt;/h3&gt;
&lt;p&gt;One of the first hurdles we hit was maintaining context across messages. We needed our assistant to remember previous queries within a conversation.
Solution: We implemented a thread management system using Python&apos;s &lt;code&gt;shelve&lt;/code&gt; module. Each Slack thread gets a corresponding entry in our database, allowing us to maintain context efficiently.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://sharadja.in/images/blogs/component_diagram.png&quot; alt=&quot;Sequence diagram showing the flow between User, Slack Bot, Assistant, Salesforce, and Chart Generator&quot; /&gt;&lt;/p&gt;
&lt;h3&gt;2. Query Translation&lt;/h3&gt;
&lt;p&gt;Turning natural language into SOQL (Salesforce Object Query Language) queries was our next big challenge.&lt;/p&gt;
&lt;p&gt;Solution: ==We leveraged OpenAI&apos;s function calling capabilities.== We defined custom functions that the AI could &quot;call&quot;, effectively translating natural language into structured queries.&lt;/p&gt;
&lt;h3&gt;3. Data Visualization&lt;/h3&gt;
&lt;p&gt;Raw data is rarely what users want. Often, they need charts or graphs to make sense of the numbers.&lt;/p&gt;
&lt;p&gt;Solution: We integrated a chart generation service. When the AI detects that a visual representation would be beneficial, it triggers this service to create and send images directly in Slack.&lt;/p&gt;
&lt;h2&gt;The Architecture in Detail&lt;/h2&gt;
&lt;p&gt;The system follows a modular architecture: Slack receives user messages and forwards them to our Python backend, which orchestrates between the OpenAI API (for natural language understanding) and the Salesforce API (for data retrieval). Responses flow back through the same pipeline, with a caching layer to minimize redundant API calls.&lt;/p&gt;
&lt;h2&gt;Performance Considerations&lt;/h2&gt;
&lt;p&gt;As with any system that interfaces with multiple external APIs, performance was a key concern. We implemented several optimizations:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Caching frequently requested data.&lt;/li&gt;
&lt;li&gt;Asynchronous processing for long-running queries.&lt;/li&gt;
&lt;li&gt;Intelligent query optimization to minimize Salesforce API calls.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;We track response latency, cache hit rates, and API call volume in real time — this lets us quickly spot bottlenecks and tune the system before users notice degradation.&lt;/p&gt;
&lt;h2&gt;User Experience&lt;/h2&gt;
&lt;p&gt;We paid special attention to the user experience, aiming to make interactions with our assistant as natural as possible. A typical interaction starts with a natural-language question in Slack, which the assistant interprets, queries Salesforce, and returns a formatted response — all within the same thread so context is never lost.&lt;/p&gt;
&lt;h2&gt;Evolution and Future Plans&lt;/h2&gt;
&lt;p&gt;Our Salesforce Slack Assistant wasn&apos;t built in a day. It evolved over time, and continues to evolve. We started with simple text queries, added chart generation, then context-aware threading, and most recently, proactive notifications. Future plans include enhanced AI capabilities and multi-platform support.&lt;/p&gt;
&lt;h2&gt;Lessons Learned&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;==Start simple, but design for complexity: Our initial MVP was basic, but we architected it in a way that allowed us to add features without major rewrites.==&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;User feedback is gold: Some of our most useful features came directly from user suggestions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;==AI is powerful, but needs guardrails: We learned to carefully define the boundaries of what our AI assistant can do to prevent unexpected behavior.==&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;==Performance matters more than you think: In a chat interface, even small delays can significantly impact user experience.==&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Building this Salesforce Slack Assistant has been a journey of continuous learning and improvement. It&apos;s a testament to the power of bringing AI into enterprise tools, but also a reminder of the challenges that come with it.&lt;/p&gt;
&lt;p&gt;As we continue to refine and expand our assistant, we&apos;re excited about the potential impact on not just our sales team, but on how enterprises interact with their data as a whole.&lt;/p&gt;
&lt;p&gt;==Remember, the goal isn&apos;t just to build cool tech—it&apos;s to solve real problems for real users.== Keep that in mind, and the rest will follow.&lt;/p&gt;
</content:encoded><enclosure url="https://sharadja.in/salesforce_slack_assistant.png" type="image/png" length="0"/><category>AI</category><category>Salesforce</category><category>Slack</category><category>enterprise tools</category><category>software architecture</category><category>OpenAI</category><author>Sharad Jain</author></item><item><title>Building AI-first Multiplayer Writing Tool for Creative Teams</title><link>https://sharadja.in/blog/ai-first-multiplayer-writing-tool/</link><guid isPermaLink="true">https://sharadja.in/blog/ai-first-multiplayer-writing-tool/</guid><description>Building a feature-rich, AI-powered collaborative text editor using Jamsocket, Next.js, and OpenAI&apos;s Assistant API, enabling seamless real-time collaboration for content creators, marketers, and entrepreneurs.</description><pubDate>Wed, 15 May 2024 14:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;The Shift : Writing in the age of abundance&lt;/h2&gt;
&lt;p&gt;In the age of AI, online writing is undergoing a significant shift. ==While AI can generate content faster and more efficiently than humans, it lacks the ability to capture the unique perspectives and experiences that only human writers can provide.== As a result, the Internet is likely to reward writing that prioritizes uniqueness and beauty over mere smartness and speed.&lt;/p&gt;
&lt;p&gt;In today&apos;s fast-paced digital landscape, collaboration has become a crucial aspect of content creation. Startup founders, VCs, and professionals across various industries are embracing tools that enable real-time collaboration, as they strive to generate ideas, write content, and refine it efficiently. One such platform that has gained significant traction in the design industry is Figma, a collaborative design tool valued at $10 billion. &lt;a href=&quot;https://www.forbes.com/sites/alexkonrad/2021/08/10/how-figma-became-designs-hottest-startup-valued-at-10billion/?sh=2873c174726e&quot;&gt;Read this blog post&lt;/a&gt; to learn more about Figma&apos;s success story.&lt;/p&gt;
&lt;p&gt;In this post I want to explore building an &lt;strong&gt;AI-native collaborative text editor&lt;/strong&gt; application, utilizing Jamsocket, Next.js, and OpenAI&apos;s Assistant API. Such an interface not only facilitates real-time brainstorming, writing, and content refinement among teams but also unlocks unparalleled creativity, enriched by intelligent suggestions and support from an AI co-writer.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;==AI can&apos;t replicate your unique voice and perspective.== Your lived experiences, ideas, and writing style are one-of-a-kind, and that&apos;s what will make your content stand out in a sea of AI-generated text.&lt;/li&gt;
&lt;li&gt;Readers crave high-quality, memorable content. With the abundance of mediocre writing online, people are hungry for exceptional ideas and beautifully crafted prose. By focusing on &lt;em&gt;quality over quantity&lt;/em&gt; and &lt;em&gt;authorship over speed&lt;/em&gt;, you can capture and hold your audience&apos;s attention.&lt;/li&gt;
&lt;li&gt;==LLMs are like stochastic parrots, repeating patterns from their training data without true understanding.== They generate text that may seem passable at first glance but lacks the depth, clarity, and intentionality of human writing. The difference is akin to a blurry photograph versus a carefully crafted painting - while the overall picture might be recognizable, the details are distorted and the meaning is lost.&lt;/li&gt;
&lt;li&gt;Shareable content is king. When you create content that is both unique and beautiful, it becomes more shareable. People are more likely to engage with and spread the word about writing that resonates with them on a personal level.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;To succeed as a writer in the age of AI, embrace your singularity and hone your craft. Collaborate with AI agents to enhance your unique voice and perspective. These AI collaborators can help you refine your ideas, suggest improvements, and even challenge you to think more deeply about your writing.&lt;/p&gt;
&lt;p&gt;==Don&apos;t try to outsmart or outpace AI; instead, focus on creating content that only you can produce, enhanced by the power of AI assistance.== Use AI agents as a tool to augment your creativity and writing process, helping you craft more compelling and engaging content.&lt;/p&gt;
&lt;p&gt;==By leveraging the strengths of both human creativity and AI capabilities, you&apos;ll be able to carve out a niche for yourself and build a loyal following of readers who appreciate your distinct voice and high-quality work.== Embrace the collaboration between human ingenuity and artificial intelligence to create truly exceptional content that stands out in the digital landscape.&lt;/p&gt;
&lt;h2&gt;Building a Real-Time Collaborative Writing Tool&lt;/h2&gt;
&lt;p&gt;Here&apos;s a concise overview of how we can design and implement this AI-powered collaborative writing tool:&lt;/p&gt;
&lt;h3&gt;Architectural Design&lt;/h3&gt;
&lt;p&gt;The application leverages a modular and scalable architecture:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Frontend (Next.js + Quill):&lt;/strong&gt; Next.js powers the user interface, integrating Quill as the rich text editor for seamless collaboration.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AI Integration (OpenAI Assistant API):&lt;/strong&gt; The OpenAI Assistant API provides contextual suggestions, corrections, and content generation to enhance the writing process.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Real-time Collaboration (Jamsocket):&lt;/strong&gt; Jamsocket enables real-time synchronization of document changes and user presence across connected clients.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Backend Services (Next.js + Docker):&lt;/strong&gt; The backend, including session management and AI request handling, is built using Next.js and deployed as containerized services using Docker.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Implementation Steps&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Set up Next.js:&lt;/strong&gt; Create a new Next.js project and configure the necessary routes and API endpoints.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Integrate Quill:&lt;/strong&gt; Customize the Quill text editor to fit the application&apos;s needs and ensure responsiveness.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Connect Jamsocket:&lt;/strong&gt; Implement real-time collaboration by setting up a Jamsocket service and managing user sessions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Integrate OpenAI Assistant API:&lt;/strong&gt; Develop functions to fetch and display AI-powered writing suggestions based on user prompts.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deploy with Docker:&lt;/strong&gt; Containerize the backend services using Docker for easy deployment and scalability.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;By following this approach and leveraging the mentioned tech stack, you can build a feature-rich, AI-first collaborative writing tool that empowers creative teams to produce high-quality content efficiently.&lt;/p&gt;
&lt;p&gt;🛠️ &lt;strong&gt;Tech Stack Used:&lt;/strong&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Technology&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href=&quot;https://jamsocket.com&quot;&gt;Jamsocket&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;For real-time collaboration and shared state management&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href=&quot;https://nextjs.org&quot;&gt;Next.js&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;For building the server-side and client-side components&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href=&quot;https://platform.openai.com/docs/assistants/overview&quot;&gt;OpenAI Assistants API&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Integrates AI assistants for enhanced writing support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href=&quot;https://quilljs.com&quot;&gt;Quill&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Powers the text editor and syncs edits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href=&quot;https://tailwindcss.com/docs/installation&quot;&gt;Tailwind CSS&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;For styling the application&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href=&quot;https://astro.build&quot;&gt;Astro&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;For the overall project setup and configuration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href=&quot;https://www.docker.com&quot;&gt;Docker&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;For containerizing and deploying the session backend&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href=&quot;https://docs.jamsocket.com/platform/jamsocket-cli&quot;&gt;Jamsocket CLI&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;For managing the deployment to Jamsocket&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;Laying the Foundation&lt;/h2&gt;
&lt;p&gt;Let&apos;s start by setting up the project structure.
We&apos;ll be using a typical Next.js directory layout, with a few key components:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;src/app/page.tsx&lt;/code&gt;: This is the server-side component responsible for spawning the session backend on Jamsocket.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;src/components/Home.tsx&lt;/code&gt;: The main client-side component that will handle the bulk of the application&apos;s functionality.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;src/session-backend/index.ts&lt;/code&gt;: The server-side logic for managing the shared state and real-time communication.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;We&apos;ll also include some helper components like &lt;code&gt;TextCanvas.tsx&lt;/code&gt; to encapsulate the text editor and synchronize edits between clients.&lt;/p&gt;
&lt;h2&gt;Implementing Session Backend&lt;/h2&gt;
&lt;p&gt;==The session backend is the heart of our collaborative application.== It acts as a stateful layer between the clients and the document storage, maintaining the shared state of the text document and handling real-time updates between connected users.&lt;/p&gt;
&lt;p&gt;Here&apos;s a simplified version of the session backend implementation in &lt;code&gt;session-backend/index.ts&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;// ... imports

const io = new Server(8080);
let documentContent = &quot;&quot;; // Initial document state

io.on(&quot;connection&quot;, (socket: Socket) =&amp;gt; {
  socket.emit(&quot;snapshot&quot;, documentContent);

  socket.on(&quot;update-document&quot;, (content: string) =&amp;gt; {
    documentContent = content;
    socket.broadcast.emit(&quot;update-document&quot;, content);
  });

  // ... other event handlers
});
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When a client connects, it receives a snapshot of the current document content. When a client updates the document, the session backend updates the shared state and broadcasts the changes to all connected clients.&lt;/p&gt;
&lt;h3&gt;Architecture&lt;/h3&gt;
&lt;p&gt;&lt;a href=&quot;https://mermaid.live/edit#pako:eNqtVD1v2zAQ_SsEFy2yYdGWImnI0GRp0U5GlkALQ55rotZRJSk7ieH_Huojib_kGEUFDeLxvXfvjuJtqdASaE4t_K0BBdwr_tvwskDin4obp4SqODpyt1KALhraYKcbc7BWafzGxR9AWWCHQO2A6DWYd8WcfEflFF-pVyA_eGm1x7tB8LziGyS20yZPh-I9anR7e5g8J3caEUQvy4VTa-6lz1s8jI682kf2wCKv7FK7gMDah8hGuSVRXQHEwbM7NvJBfahkk1LostLYUK1r1q1AR9ynsks1XLbJhmyK2phmcWyT7VOvtfl5NMc2fyrrAMlCGxIIA549aogB4ShJULf6faQ1Z78-vUOdlnW2B6fMvpzPki917dTbP7UJYXPliQ5m_H91RV_WdfVf2snIM7WdO7N7ZcX-pRvuwjFSwsAFpSEtwZRcST-xtg24oG4JJRQ0958SFrxeuYIWuPNQXjs9f0FBc2dqCGlnvx9wNF_wlfVRP6oetS7fQX5J8y19pnmURuMsu5klURIzlrJ4FtIXmqfTccxmzL9JmiVZnOxC-toKTMZpFE3i6WTK0nRyk8UhBamcNr-6EdtO2t0bN-LbXA&quot;&gt;&lt;img src=&quot;https://mermaid.ink/img/pako:eNqtVD1v2zAQ_SsEFy2yYdGWImnI0GRp0U5GlkALQ55rotZRJSk7ieH_Huojib_kGEUFDeLxvXfvjuJtqdASaE4t_K0BBdwr_tvwskDin4obp4SqODpyt1KALhraYKcbc7BWafzGxR9AWWCHQO2A6DWYd8WcfEflFF-pVyA_eGm1x7tB8LziGyS20yZPh-I9anR7e5g8J3caEUQvy4VTa-6lz1s8jI682kf2wCKv7FK7gMDah8hGuSVRXQHEwbM7NvJBfahkk1LostLYUK1r1q1AR9ynsks1XLbJhmyK2phmcWyT7VOvtfl5NMc2fyrrAMlCGxIIA549aogB4ShJULf6faQ1Z78-vUOdlnW2B6fMvpzPki917dTbP7UJYXPliQ5m_H91RV_WdfVf2snIM7WdO7N7ZcX-pRvuwjFSwsAFpSEtwZRcST-xtg24oG4JJRQ0958SFrxeuYIWuPNQXjs9f0FBc2dqCGlnvx9wNF_wlfVRP6oetS7fQX5J8y19pnmURuMsu5klURIzlrJ4FtIXmqfTccxmzL9JmiVZnOxC-toKTMZpFE3i6WTK0nRyk8UhBamcNr-6EdtO2t0bN-LbXA?type=png&quot; alt=&quot;&quot; /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;Integrating OpenAI Assistant&lt;/h2&gt;
&lt;p&gt;To provide AI-powered assistance during the collaborative writing process, we&apos;ll integrate the OpenAI Assistant API. First, we&apos;ll set up the API key in a &lt;code&gt;.env&lt;/code&gt; file:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;OPENAI_API_KEY=[YOUR_OPENAI_API_KEY]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then, in the &lt;code&gt;session-backend/index.ts&lt;/code&gt; file, we&apos;ll create an OpenAI Assistant instance with instructions tailored for text editing collaboration:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;const assistant = await openai.beta.assistants.create({
  instructions: `You are an AI assistant for collaborative writing. 
    Help users with tasks like:
    - Suggesting synonyms and alternative phrasings.
    - Correcting grammar and spelling mistakes.
    - Generating content based on user prompts.
    - Providing feedback on the overall structure and flow of the document.`,
  model: &quot;gpt-4-1106-preview&quot;,
  // ... other parameters
});
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We&apos;ll then modify the &lt;code&gt;handleUserPrompt&lt;/code&gt; function to send user prompts to the Assistant and process the responses to update the document state. This might involve parsing the Assistant&apos;s response and applying the suggested changes to the shared document content.&lt;/p&gt;
&lt;h2&gt;Building the Text Editor Component&lt;/h2&gt;
&lt;p&gt;The heart of our collaborative text editor is the &lt;code&gt;TextCanvas.tsx&lt;/code&gt; component. We&apos;ll use a library like Quill to handle real-time editing and synchronization, and integrate Jamsocket&apos;s &lt;code&gt;useEventListener&lt;/code&gt; and &lt;code&gt;useSend&lt;/code&gt; hooks to receive updates from and send events to the session backend.&lt;/p&gt;
&lt;p&gt;Here&apos;s a complete example of the &lt;code&gt;TextCanvas.tsx&lt;/code&gt; component:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;import React, { useEffect, useRef } from &quot;react&quot;;
import Quill from &quot;quill&quot;;
import &quot;quill/dist/quill.snow.css&quot;;
import { useEventListener, useSend } from &quot;@jamsocket/socketio&quot;;
import type { User } from &quot;../types&quot;;

interface TextCanvasProps {
  users: User[];
  onCursorMove: (position: { x: number; y: number } | null) =&amp;gt; void;
}

export function TextCanvas({ users, onCursorMove }: TextCanvasProps) {
  const editorRef = useRef&amp;lt;HTMLDivElement&amp;gt;(null);
  const sendEvent = useSend();

  useEffect(() =&amp;gt; {
    if (!editorRef.current) return;

    const editor = new Quill(editorRef.current, {
      modules: {
        toolbar: [
          // ... toolbar options
        ],
      },
      theme: &quot;snow&quot;,
    });

    editor.on(&quot;text-change&quot;, () =&amp;gt; {
      const content = editor.root.innerHTML;
      sendEvent(&quot;update-document&quot;, content);
    });

    editor.on(&quot;selection-change&quot;, (range) =&amp;gt; {
      onCursorMove(range ? { x: range.index, y: range.index + range.length } : null);
    });

    return () =&amp;gt; {
      editor.off(&quot;text-change&quot;);
      editor.off(&quot;selection-change&quot;);
    };
  }, [sendEvent, onCursorMove]);

  return &amp;lt;div ref={editorRef} /&amp;gt;;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;TextCanvas&lt;/code&gt; component sets up a Quill editor instance and listens for text changes and cursor position updates. When the text is updated, it sends an &lt;code&gt;update-document&lt;/code&gt; event to the session backend. When the cursor position changes, it calls the &lt;code&gt;onCursorMove&lt;/code&gt; callback to update the cursor position for other users.&lt;/p&gt;
&lt;h2&gt;Real-Time Updates and Presence&lt;/h2&gt;
&lt;p&gt;To handle real-time updates and presence information, such as cursor positions and user avatars, we&apos;ll use Jamsocket&apos;s &lt;code&gt;useEventListener&lt;/code&gt; and &lt;code&gt;useSend&lt;/code&gt; hooks in the &lt;code&gt;Home.tsx&lt;/code&gt; component:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;useEventListener&amp;lt;User&amp;gt;(&quot;cursor-position&quot;, (user) =&amp;gt; {
  // Update the user&apos;s cursor position in the local state
  // and display it on the TextCanvas
});

// ...

const handleCursorMove = (position: { x: number; y: number } | null) =&amp;gt; {
  sendEvent(&quot;cursor-position&quot;, { x: position?.x, y: position?.y });
};
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These hooks make it easy to synchronize the state between the session backend and the connected clients, creating a truly collaborative experience.&lt;/p&gt;
&lt;h2&gt;Deploying to Jamsocket&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Use Jamsocket to manage shared state and real-time communication through session backends.&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;// Initialize Jamsocket session
const session = Jamsocket.initSession(&apos;your-session-id&apos;);
session.on(&apos;update&apos;, (data) =&amp;gt; {
  // Handle real-time updates
});
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Integrate OpenAI Assistant API to provide AI-powered suggestions and assistance during collaborative writing.&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;// Fetch AI suggestions
async function fetchAISuggestions(prompt) {
  const response = await fetch(&apos;https://api.openai.com/v1/chat/completions&apos;, {
    method: &apos;POST&apos;,
    headers: {
      &apos;Content-Type&apos;: &apos;application/json&apos;,
      &apos;Authorization&apos;: &apos;Bearer your-openai-api-key&apos;
    },
    body: JSON.stringify({
      model: &apos;gpt-4o-mini&apos;,
      messages: [{ role: &apos;user&apos;, content: prompt }],
      max_tokens: 100
    })
  });
  const data = await response.json();
  return data.choices[0].text;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Build a robust text editor component using libraries like Quill and synchronize edits using Jamsocket hooks.&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;// Initialize Quill editor and sync with Jamsocket
useEffect(() =&amp;gt; {
  const editor = new Quill(&apos;#editor&apos;, { theme: &apos;snow&apos; });
  editor.on(&apos;text-change&apos;, (delta, oldDelta, source) =&amp;gt; {
    if (source === &apos;user&apos;) {
      session.send(&apos;text-change&apos;, delta);
    }
  });
}, []);
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Handle presence information, such as cursor positions and user avatars, to enhance the collaborative experience.&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;// Update cursor position for all users
session.on(&apos;cursor-move&apos;, (userId, position) =&amp;gt; {
  // Update cursor position in the UI
});
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Deploy the session backend to Jamsocket for scalability and performance.&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Deploy session backend to Jamsocket
jamsocket deploy --service your-service-name --image your-docker-image
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;So, there you have it. Stick to the guide, throw in your own twist, and boom - you&apos;ve got yourself a slick, real-time text editor that&apos;s not just smart but also plays well with others. Dive in, mess around, and make something cool.&lt;/p&gt;
</content:encoded><enclosure url="https://sharadja.in/creative_writing.png" type="image/png" length="0"/><category>Jamsocket</category><category>Next.js</category><category>OpenAI</category><category>collaboration</category><category>AI-assisted writing</category><author>Sharad Jain</author></item><item><title>How Midjourney&apos;s Discord Became a Phenomenon: A Lesson in Community-Driven Growth</title><link>https://sharadja.in/blog/midjourney-discord-community-driven-growth/</link><guid isPermaLink="true">https://sharadja.in/blog/midjourney-discord-community-driven-growth/</guid><description>Explore how Midjourney achieved massive success through its Discord server, and the lessons it offers for building community-driven platforms.</description><pubDate>Sat, 20 Apr 2024 10:00:00 GMT</pubDate><content:encoded>&lt;h1&gt;How Midjourney&apos;s Discord Became a Phenomenon: A Lesson in Community-Driven Growth&lt;/h1&gt;
&lt;p&gt;When Midjourney launched its AI-powered image generation tool, few could have predicted that a Discord server would become the epicenter of a new artistic movement. Yet, that&apos;s precisely what happened. This unexpected success story offers valuable insights into community-driven growth and the power of shared creativity.&lt;/p&gt;
&lt;h2&gt;Embracing Imperfection: The Power of Rapid Iteration&lt;/h2&gt;
&lt;p&gt;==Conventional wisdom suggests that successful tech products require polished interfaces and extensive marketing.== ==Midjourney took a different approach, focusing instead on two critical elements: empowering users and fostering connections between them.==&lt;/p&gt;
&lt;p&gt;Discord might seem an unconventional choice for an AI art platform. It&apos;s often messy, noisy, and can be overwhelming for new users. However, this apparent weakness became Midjourney&apos;s strength. ==The platform&apos;s inherent &quot;messiness&quot; allowed for rapid iteration and real-time observation of user behavior.==&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key Insight&lt;/strong&gt;: When building something innovative, excessive polish can be counterproductive. The time spent refining can delay crucial learning from user interactions.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;By leveraging Discord, Midjourney could quickly adapt based on user experiments, challenges, and creative breakthroughs.&lt;/p&gt;
&lt;h2&gt;Community as a Core Feature&lt;/h2&gt;
&lt;p&gt;The true magic of Midjourney&apos;s Discord lies in its ability to transform users into a vibrant community. ==Unlike traditional products where users operate in isolation, Midjourney&apos;s Discord makes sharing the default.== Every generated image is visible to the entire community, creating two powerful effects:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;A constant stream of inspiration: Users can see not just the tool&apos;s potential but also how others are pushing its boundaries.&lt;/li&gt;
&lt;li&gt;A sense of collective exploration: Users feel part of a movement rather than just consumers of a product.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;The Impact of Community-Driven Growth&lt;/h2&gt;
&lt;p&gt;This approach led to explosive growth. Midjourney&apos;s Discord quickly became one of the largest on the platform, boasting millions of highly engaged users. These weren&apos;t casual visitors; they were dedicated explorers returning daily to push the limits of AI-generated art.&lt;/p&gt;
&lt;h2&gt;Lessons for Platform Builders&lt;/h2&gt;
&lt;p&gt;Midjourney&apos;s success offers valuable lessons for those building new platforms:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;==Prioritize user empowerment and community connection over interface perfection.==&lt;/li&gt;
&lt;li&gt;Embrace the &quot;messiness&quot; of rapid iteration and learning.&lt;/li&gt;
&lt;li&gt;Make sharing and collaboration core features, not afterthoughts.&lt;/li&gt;
&lt;li&gt;Create an environment that fosters collective exploration and creativity.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Navigating the Challenges of Success&lt;/h2&gt;
&lt;p&gt;Of course, this approach isn&apos;t without its challenges. As Midjourney grows, it faces issues like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Potential overwhelm for new users&lt;/li&gt;
&lt;li&gt;Difficulties in maintaining focus amidst constant activity&lt;/li&gt;
&lt;li&gt;Increased moderation needs for a large, active community&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;However, these are ultimately positive problems stemming from success rather than failure.&lt;/p&gt;
&lt;h2&gt;Embracing Community-Driven Innovation&lt;/h2&gt;
&lt;p&gt;Midjourney&apos;s story isn&apos;t just about AI-generated art. It&apos;s about creating a space where millions can come together to explore, create, and inspire each other. For anyone building community-driven platforms, the lesson is clear: focus on empowering users and fostering connections. The results may surprise you.&lt;/p&gt;
&lt;p&gt;==In the end, Midjourney&apos;s phenomenal growth demonstrates that in the age of AI, the most powerful scaling factor isn&apos;t just technology—it&apos;s human creativity unleashed at scale.==&lt;/p&gt;
</content:encoded><enclosure url="https://sharadja.in/discord.png" type="image/png" length="0"/><category>Midjourney</category><category>Discord</category><category>AI art</category><category>community</category><category>software architecture</category><category>social platforms</category><author>Sharad Jain</author></item><item><title>OpenAI API vs Azure OpenAI: Choosing the Right Platform for Your AI Projects</title><link>https://sharadja.in/blog/openai-api-vs-azure-openai-comparison/</link><guid isPermaLink="true">https://sharadja.in/blog/openai-api-vs-azure-openai-comparison/</guid><description>Explore the key differences between using OpenAI&apos;s canonical API and Azure&apos;s OpenAI service, helping you make an informed decision for your AI implementations.</description><pubDate>Mon, 15 Apr 2024 10:00:00 GMT</pubDate><content:encoded>&lt;h1&gt;OpenAI vs Azure OpenAI: A Tale of Two AI Platforms&lt;/h1&gt;
&lt;p&gt;AI is eating the world. And at the forefront of this revolution are two key players: OpenAI and Azure OpenAI. They&apos;re both serving up some seriously powerful language models, but they&apos;re not quite the same dish. Let&apos;s dig in and see what sets them apart.&lt;/p&gt;
&lt;h2&gt;The Lay of the Land&lt;/h2&gt;
&lt;p&gt;OpenAI is the OG here. They&apos;re the ones who cooked up those mind-bending models like GPT-4, DALL-E, and Whisper. If you&apos;re a dev or a business looking to sprinkle some AI magic into your app, their API is your go-to buffet.&lt;/p&gt;
&lt;p&gt;Azure OpenAI, though? That&apos;s a different beast. ==It&apos;s what you get when Microsoft and OpenAI decide to join forces.== Think of it as OpenAI&apos;s models, but wrapped in Microsoft&apos;s cozy Azure blanket. It&apos;s got all those enterprise bells and whistles that big companies drool over.&lt;/p&gt;
&lt;p&gt;So, which one&apos;s right for you? Well, that depends. Let&apos;s break it down.&lt;/p&gt;
&lt;h2&gt;Key Differences&lt;/h2&gt;
&lt;p&gt;Let&apos;s talk about the real differences between Azure OpenAI and OpenAI. It&apos;s not just about features; it&apos;s about what matters to you as a user or a business.&lt;/p&gt;
&lt;h3&gt;1. Security and Compliance&lt;/h3&gt;
&lt;p&gt;Azure OpenAI is the clear winner here. They&apos;ve gone all-in on security:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;They encrypt everything. Your data is locked down tight within your Azure subscription.&lt;/li&gt;
&lt;li&gt;They&apos;ve got all the fancy compliance certifications. SOC2, ISO, HIPAA, CSA STAR - the works.&lt;/li&gt;
&lt;li&gt;They even support private networks. VNETs, private endpoints - you name it.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;==OpenAI cares about privacy too, but they&apos;re not playing in the same league as Azure when it comes to enterprise-grade security.==&lt;/p&gt;
&lt;h3&gt;2. Data Handling and Privacy&lt;/h3&gt;
&lt;p&gt;Here&apos;s the deal:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;With Azure OpenAI, your data stays put. It doesn&apos;t leave the Azure environment, and Microsoft isn&apos;t using it to make their AI smarter.&lt;/li&gt;
&lt;li&gt;OpenAI has good privacy policies, sure. But you&apos;re not getting the same level of control over your data as you would with Azure.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Integration and Ecosystem&lt;/h3&gt;
&lt;p&gt;==If you&apos;re already knee-deep in Microsoft&apos;s world, Azure OpenAI is a no-brainer.== It plays nice with all the other Azure toys. OpenAI&apos;s API is more of a lone wolf. It&apos;s flexible, sure, but it&apos;s not part of a bigger ecosystem.&lt;/p&gt;
&lt;h3&gt;4. Model Deployment and Versioning&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Azure OpenAI: Requires deploying specific model versions, which may need manual updates when newer versions are released.&lt;/li&gt;
&lt;li&gt;OpenAI: Offers model names that always point to the latest version, simplifying updates.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;5. Performance&lt;/h3&gt;
&lt;p&gt;Azure OpenAI is faster. Much faster. In benchmarks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;GPT-3.5-Turbo on Azure responds about 8x quicker than OpenAI.&lt;/li&gt;
&lt;li&gt;GPT-4 on Azure is roughly 3x faster.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;==This matters. A lot. Especially at scale.==&lt;/p&gt;
&lt;h3&gt;6. Accessibility and Approval Process&lt;/h3&gt;
&lt;p&gt;Getting started with these services differs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Azure OpenAI makes you jump through hoops. You apply, you wait, you get approved (maybe).&lt;/li&gt;
&lt;li&gt;OpenAI? Sign up and go. It&apos;s that simple.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;7. Pricing and Billing&lt;/h3&gt;
&lt;p&gt;The base costs are similar, but:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Azure might cost more due to infrastructure overhead.&lt;/li&gt;
&lt;li&gt;But Azure offers better deals for big enterprises. Volume discounts, flexible agreements. That sort of thing.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;8. Features and Customization&lt;/h3&gt;
&lt;p&gt;Azure OpenAI has some tricks up its sleeve:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You can fine-tune certain models. Tailor them to your needs.&lt;/li&gt;
&lt;li&gt;It filters content out of the box. You decide how strict.&lt;/li&gt;
&lt;li&gt;You can query local files right through the API. Handy.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These aren&apos;t game-changers, but they&apos;re nice to have. Especially if you&apos;re doing something specific.&lt;/p&gt;
&lt;h2&gt;Making the Right Choice&lt;/h2&gt;
&lt;p&gt;So which should you choose? It depends on what you&apos;re after.&lt;/p&gt;
&lt;p&gt;==Go with Azure OpenAI if you&apos;re a big company that cares about security.== If you&apos;re already using Azure, it&apos;s a no-brainer. It&apos;s faster too, which matters if you&apos;re doing a lot of AI stuff.&lt;/p&gt;
&lt;p&gt;==Pick OpenAI if you just want to get started quickly.== It&apos;s better for small projects or if you&apos;re just messing around. The API is simpler, and you get the newest models right away.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Both are good, but for different things. Azure OpenAI is for big companies that need serious security and already use Azure. OpenAI is easier to use and works for more people.&lt;/p&gt;
&lt;p&gt;In the end, pick based on what you need. How secure does it need to be? What are you already using? How much control do you want?&lt;/p&gt;
&lt;p&gt;Whichever you choose, you&apos;re getting some pretty amazing AI tech. It&apos;s kind of incredible when you think about it.&lt;/p&gt;
&lt;p&gt;Citations:
[1] &lt;a href=&quot;https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/switching-endpoints&quot;&gt;https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/switching-endpoints&lt;/a&gt;
[2] &lt;a href=&quot;https://www.uscloud.com/blog/the-differences-between-openai-and-microsoft-azure-openai/&quot;&gt;https://www.uscloud.com/blog/the-differences-between-openai-and-microsoft-azure-openai/&lt;/a&gt;
[3] &lt;a href=&quot;https://www.private-ai.com/en/2024/01/09/openai-vs-azure-openai/&quot;&gt;https://www.private-ai.com/en/2024/01/09/openai-vs-azure-openai/&lt;/a&gt;
[4] &lt;a href=&quot;https://learn.microsoft.com/en-us/answers/questions/1190853/azure-openai-service-vs-openai&quot;&gt;https://learn.microsoft.com/en-us/answers/questions/1190853/azure-openai-service-vs-openai&lt;/a&gt;
[5] &lt;a href=&quot;https://drlee.io/openai-vs-azure-openai-a-deep-dive-into-their-differences-26f1107677e?gi=994361760e27&quot;&gt;https://drlee.io/openai-vs-azure-openai-a-deep-dive-into-their-differences-26f1107677e?gi=994361760e27&lt;/a&gt;
[6] &lt;a href=&quot;https://blogs.visigo.com/chriscoulson/azure-openai-service-vs-openai-api/&quot;&gt;https://blogs.visigo.com/chriscoulson/azure-openai-service-vs-openai-api/&lt;/a&gt;
[7] &lt;a href=&quot;https://learn.microsoft.com/en-us/azure/ai-services/openai/overview?WT.mc_id=academic-105485-koreyst&quot;&gt;https://learn.microsoft.com/en-us/azure/ai-services/openai/overview?WT.mc_id=academic-105485-koreyst&lt;/a&gt;
[8] &lt;a href=&quot;https://www.linkedin.com/pulse/exploring-differences-between-openai-azure-which-one-right-rok-rogelj&quot;&gt;https://www.linkedin.com/pulse/exploring-differences-between-openai-azure-which-one-right-rok-rogelj&lt;/a&gt;
[9] &lt;a href=&quot;https://www.advancinganalytics.co.uk/blog/2023/4/24/azure-openai-vs-openai-whats-the-difference&quot;&gt;https://www.advancinganalytics.co.uk/blog/2023/4/24/azure-openai-vs-openai-whats-the-difference&lt;/a&gt;&lt;/p&gt;
</content:encoded><enclosure url="https://sharadja.in/azure.png" type="image/png" length="0"/><category>AI</category><category>OpenAI</category><category>Azure</category><category>API</category><category>cloud services</category><category>machine learning</category><author>Sharad Jain</author></item><item><title>Universal Power of Computation</title><link>https://sharadja.in/blog/universal-computation-llms-future-innovation/</link><guid isPermaLink="true">https://sharadja.in/blog/universal-computation-llms-future-innovation/</guid><description>The universality of computation, embodied in advanced LLMs, is set to revolutionize the landscape of science and innovation.</description><pubDate>Fri, 22 Mar 2024 09:30:00 GMT</pubDate><content:encoded>&lt;p&gt;In his seminal work &quot;The Beginning of Infinity,&quot; philosopher David Deutsch posits that the universality of computation underlies the open-ended nature of human progress. As we stand on the precipice of a new era, dominated by the rapid advancement of large language models (LLMs), this concept takes on a newfound significance for tech entrepreneurs.&lt;/p&gt;
&lt;p&gt;LLMs, with their ability to process and generate human-like text, are quickly evolving into universal tools for computation. ==These AI-powered systems are not merely language models; they are becoming all-purpose problem-solving engines, capable of tackling an ever-expanding range of challenges across industries.==&lt;/p&gt;
&lt;p&gt;For entrepreneurs, this presents an unprecedented opportunity to leverage the power of LLMs to create value in novel ways. By harnessing the universality of computation embodied in these advanced models, startups can develop innovative solutions that transcend traditional boundaries and disrupt established markets.&lt;/p&gt;
&lt;p&gt;One of the key advantages of LLMs is their adaptability. ==Unlike narrow AI systems designed for specific tasks, LLMs can be fine-tuned and applied to a wide variety of domains, from content creation and customer support to research and development.== This versatility allows entrepreneurs to quickly prototype and iterate on ideas, testing hypotheses and refining solutions at an unprecedented pace.&lt;/p&gt;
&lt;p&gt;Moreover, LLMs have the potential to democratize access to cutting-edge computational capabilities. Historically, advanced AI systems have been the domain of large corporations and well-funded research institutions. ==However, as LLMs become more accessible through open-source initiatives and commercial offerings, even small startups and solo entrepreneurs can harness their power to drive innovation.==&lt;/p&gt;
&lt;p&gt;==The universality of computation also implies that the potential applications of LLMs are virtually limitless.== As these models continue to improve in terms of accuracy, reasoning, and domain-specific knowledge, they will enable entrepreneurs to tackle increasingly complex problems and create value in ways we have yet to imagine.&lt;/p&gt;
&lt;p&gt;However, to fully capitalize on the potential of LLMs, entrepreneurs must adopt a mindset of continuous learning and adaptation. The AI landscape is evolving at a breakneck pace, and staying ahead of the curve requires a commitment to staying informed about the latest developments, experimenting with new techniques, and collaborating with experts across disciplines.&lt;/p&gt;
&lt;p&gt;==In conclusion, the universality of computation, as embodied in advanced LLMs, is set to revolutionize the landscape of entrepreneurship and innovation.== By leveraging these universal tools for computation, tech entrepreneurs can develop novel solutions, disrupt markets, and drive the progress of knowledge. As we stand at the beginning of this infinite journey, the possibilities are truly unbounded.&lt;/p&gt;
</content:encoded><enclosure url="https://sharadja.in/llms-entrepreneurs.png" type="image/png" length="0"/><category>LLMs</category><category>AI</category><category>entrepreneurship</category><category>innovation</category><author>Sharad Jain</author></item><item><title>Embracing the Art of Writing: Lessons from Kevin Kelly, Founder of Wired</title><link>https://sharadja.in/blog/kevin-kelly-writing-lessons-art-of-creativity/</link><guid isPermaLink="true">https://sharadja.in/blog/kevin-kelly-writing-lessons-art-of-creativity/</guid><description>Delve into the insightful world of Kevin Kelly, the visionary behind Wired Magazine, as he explores the depths of writing, creativity, and innovation in the digital age.</description><pubDate>Thu, 21 Dec 2023 10:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;The Architecture of Words: Kevin Kelly&apos;s Writing Philosophy&lt;/h2&gt;
&lt;p&gt;Picture a towering skyscraper. Its majesty and elegance are not merely a product of the materials it&apos;s built from, but the invisible structure that binds it all together. This is the lens through which Kevin Kelly, the visionary founder of Wired Magazine, views the art of writing.&lt;/p&gt;
&lt;p&gt;Imagine writing as a towering skyscraper. Its majesty and elegance are not merely a product of the words it&apos;s built from, but the invisible structure that binds it all together. This is how Kevin Kelly, the visionary founder of Wired Magazine, crafts his writing. ==The structure is so seamless that it recedes into the background, allowing the reader to immerse themselves in the ideas being conveyed.== This approach transcends the mere arrangement of words, focusing instead on refining and presenting ideas in their most captivating form.&lt;/p&gt;
&lt;p&gt;Just as an architect designs a building with a specific purpose and audience in mind, Kelly crafts his essays. A clear, guiding structure navigates the reader through a journey of idea exploration. This approach underscores the critical role of thoughtful organization and presentation in the craft of writing.&lt;/p&gt;
&lt;h2&gt;Provoking Thought by Challenging Conventions&lt;/h2&gt;
&lt;p&gt;Kelly&apos;s writing doesn&apos;t just convey information; it sparks curiosity, provokes thought, and stimulates intellectual exploration. A piece of writing that causes a bit of disturbance or discomfort in the reader&apos;s mind is not necessarily a negative thing. Instead, it can be a sign of reaching new depths in thought, of pushing the reader out of their comfort zone and into a space of deeper understanding and reflection.&lt;/p&gt;
&lt;p&gt;This approach is not about being controversial for the sake of it. ==It&apos;s about challenging the status quo, questioning accepted norms, and encouraging readers to think critically.== It&apos;s about using writing as a tool to advance conversations, to push boundaries, and to explore new perspectives. This commitment to challenging conventional thinking is a key aspect of Kelly&apos;s writing philosophy and is reflected in his work.&lt;/p&gt;
&lt;h2&gt;The Mirror of the Self: Writing as a Tool for Self-Discovery&lt;/h2&gt;
&lt;p&gt;Kelly&apos;s philosophy of writing extends beyond the mere arrangement of words. He views writing as a mirror, a tool for self-discovery and introspection. The act of writing, especially when done consistently, can reveal the deepest layers of the self.&lt;/p&gt;
&lt;p&gt;Imagine staring at a blank white page, pushing yourself to fill it with words, not just once or twice, but hundreds of times. The first few attempts might be filled with pretense, a carefully crafted facade. But as you continue, as you reach your 200th piece, the truth begins to seep through. The act of writing, in its arduous and relentless nature, peels away the layers of pretense.&lt;/p&gt;
&lt;p&gt;==The blank white page becomes a mirror, reflecting who you truly are.== It answers questions like: What do you actually want? Who do you actually want to be? It&apos;s a process of self-discovery, a journey towards authenticity.&lt;/p&gt;
&lt;p&gt;==Kelly believes that every person who has been writing for a few years is closer to their true self, having shed the fake and manufactured mask they used to wear.== This transformation is not just about becoming a better writer, but about becoming a more authentic individual.&lt;/p&gt;
&lt;p&gt;In essence, Kelly&apos;s philosophy of writing is a call to writers to be true to themselves. It&apos;s a call to embrace the challenge of originality, to dare to be different, and to have the courage to express one&apos;s unique perspective and voice. This commitment to honesty and originality is a key aspect of Kelly&apos;s writing philosophy and is reflected in his own work, which is renowned for its insightful, thought-provoking, and often unconventional perspectives.&lt;/p&gt;
&lt;h2&gt;Writing as a Path to Self-Discovery and Personal Growth&lt;/h2&gt;
&lt;p&gt;Despite his initial reluctance as a writer, Kelly has come to recognize and appreciate writing as a powerful tool for self-discovery and personal growth. He describes the act of writing as a deeply introspective process, a journey of exploration where he uncovers and articulates his deepest thoughts, beliefs, and convictions. It&apos;s through this process that he discovers what he truly thinks about a subject, often surprising himself with the depth and complexity of his own understanding.&lt;/p&gt;
&lt;p&gt;But it&apos;s not just about understanding oneself. Kelly believes that writing also allows him to communicate his insights and perspectives to others in a way that is both meaningful and impactful. It&apos;s a way for him to share his unique worldview, to contribute to the collective wisdom of humanity, and to engage in the larger conversation of ideas. This transformative journey, from the initial spark of an idea to the final crafted piece of writing, is what elevates writing from a mere task to a profound, enriching experience.&lt;/p&gt;
&lt;h2&gt;The Art and Importance of Editing&lt;/h2&gt;
&lt;p&gt;Kelly places a significant emphasis on the editing process. He doesn&apos;t see it as a mere final touch or a cursory review before publishing. Instead, he views it as an integral part of the creative process, a vital component that is as important as the initial act of writing itself.&lt;/p&gt;
&lt;p&gt;In his perspective, editing is a continuous dance between writing and revising. It&apos;s about meticulously going through each sentence, each word, and ensuring that they contribute meaningfully to the narrative. It&apos;s about refining the raw material of thoughts and ideas, shaping them into a coherent and compelling piece of writing.&lt;/p&gt;
&lt;p&gt;This approach to editing reflects Kelly&apos;s belief in the power of refinement in the creative process. ==It&apos;s not just about getting the words down on paper, but about polishing them, honing them until they shine with clarity and precision.== It&apos;s about making sure that every word, every sentence, every paragraph serves a purpose and adds value to the overall narrative.&lt;/p&gt;
&lt;p&gt;In essence, for Kelly, editing is not just a task to be completed, but an art to be mastered. It&apos;s a crucial part of the writing process that requires as much thought, effort, and creativity as the initial act of writing itself.&lt;/p&gt;
&lt;h2&gt;The Value of Practical and Helpful Writing&lt;/h2&gt;
&lt;p&gt;Kelly notes that his writing is more successful when it veers towards being practical and helpful. This perspective shows his commitment to creating content that is not only intellectually stimulating but also of tangible value to the reader.&lt;/p&gt;
&lt;h2&gt;The Role of Technology and Digital Platforms&lt;/h2&gt;
&lt;p&gt;Discussing his work with Wired Magazine, Kelly reflects on the impact of technology on writing and culture. He observes how platforms like YouTube have become hotbeds of creativity and innovation, where ideas are continuously exchanged and evolved. This recognition of digital platforms&apos; role underlines the importance of adapting to new media in creative work.&lt;/p&gt;
&lt;h2&gt;The Concept of a Thousand True Fans&lt;/h2&gt;
&lt;p&gt;Kelly talks about the importance of cultivating a dedicated fan base, citing his concept of a &quot;Thousand True Fans.&quot; This idea has influenced how creators and artists approach their work and audience, emphasizing the value of a loyal community over a vast but passive audience. A popular example of this concept in action is the YouTube personality Mr Beast. His dedicated fan base, though numbering in the millions, exhibits the kind of loyalty and engagement that Kelly&apos;s &quot;Thousand True Fans&quot; concept describes. Mr Beast&apos;s success is a testament to the power of this approach in the digital age.&lt;/p&gt;
&lt;h2&gt;Balancing Creativity and Marketing&lt;/h2&gt;
&lt;p&gt;Kelly acknowledges the essential role of marketing in the creative process. ==He argues that in today&apos;s landscape, creating great work is only half the battle; the other half is effectively marketing it.== This insight is crucial for modern writers and creators who must navigate both the creative and commercial aspects of their work.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Kevin Kelly’s insights offer a unique perspective on writing and creativity, blending honesty, originality, and practicality. His reflections provide valuable guidance for anyone looking to deepen their understanding of the creative process and the evolving role of technology and media in shaping our cultural and intellectual landscapes.&lt;/p&gt;
</content:encoded><enclosure url="https://sharadja.in/kevin-kelly.webp" type="image/png" length="0"/><category>writing</category><category>prompt-engineering</category><category>Kevin Kelly</category><author>Sharad Jain</author></item><item><title>Business Success through Dreamers and Doers: A TED Talk Analysis</title><link>https://sharadja.in/blog/business-success-through-dreamers-and-doers/</link><guid isPermaLink="true">https://sharadja.in/blog/business-success-through-dreamers-and-doers/</guid><description>Dive into the crucial role of &apos;zero-to-one&apos; dreamers and &apos;one-to-n&apos; doers in driving business growth and innovation, as explained by Beth Viner in her TED Talk. Discover strategies for bridging the gap between these two types of individuals to maximize organizational success.</description><pubDate>Thu, 21 Dec 2023 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Hello, everyone!&lt;/p&gt;
&lt;p&gt;Welcome, fellow explorers! Today, we embark on a journey into the heart of organizations, where we&apos;ll unravel the intricate dance between &apos;dreamers&apos; and &apos;doers.&apos; Our quest is not merely academic; we seek actionable wisdom to apply in our own realms.&lt;/p&gt;
&lt;h3&gt;The Dreamer-Doer Dynamic&lt;/h3&gt;
&lt;p&gt;In the heart of every successful organization, there lies a vibrant diversity. This diversity is not confined to the traditional parameters of demographics, but it extends to the very essence of how individuals perceive and interact with the world around them. It is a diversity of thought, of approach, and of work styles. This rich tapestry of diversity is beautifully epitomized in the existence of two distinct types of individuals. ==On one end of the spectrum, we have the &apos;zero-to-one&apos; humans, the dreamers.== These are the visionaries, the pioneers who dare to imagine what has not yet been conceived. ==On the other end, we find their &apos;one-to-n&apos; counterparts, the doers.== These are the pragmatists, the builders who take the dreamers&apos; visions and transform them into tangible realities.&lt;/p&gt;
&lt;h4&gt;Who are the Dreamers?&lt;/h4&gt;
&lt;p&gt;Let&apos;s take a moment to talk about the dreamers. Who are they, you ask? Well, imagine someone who can look at the world and see not just what is, but what could be. These are the visionaries, the ones who dare to dream of things that don&apos;t yet exist. They&apos;re the founders, the inventors, the ones who thrive in the face of ambiguity and aren&apos;t afraid to seek out-of-the-box solutions. They&apos;re the ones who look at a problem and see not a roadblock, but a challenge to be overcome. But it&apos;s not always smooth sailing for these dreamers. Their unique approach to reality, while exciting, can sometimes create tension in environments that are more traditionally structured. But don&apos;t worry, we&apos;ll get to that later.&lt;/p&gt;
&lt;h4&gt;And the Doers?&lt;/h4&gt;
&lt;p&gt;Doers, on the other hand, are the builders. They transform visions into tangible successes. They are the glue of the organization, maintaining its stability and ensuring long-term success. While they might not always grasp the dreamers&apos; vision initially, their role in realizing these dreams is indispensable.&lt;/p&gt;
&lt;h3&gt;Bridging the Gap&lt;/h3&gt;
&lt;p&gt;The real magic happens when these two groups come together. But how do we bridgethis gap effectively?&lt;/p&gt;
&lt;h4&gt;Buy-in Through Incentivization&lt;/h4&gt;
&lt;p&gt;A key strategy is aligning interests through incentives. For instance, in the case of Lola, a feminine health reproductive company, the founders realized the need to expand beyond a single product. To achieve this, they had to integrate doers with retail experience into their dreamer-heavy team. The solution? Tying their compensation to the company&apos;s overall performance, ensuring a vested interest in each other&apos;s success.&lt;/p&gt;
&lt;h4&gt;Creating a Corporate Mosh Pit&lt;/h4&gt;
&lt;p&gt;It&apos;s also about creating spaces where dreamers and doers can coexist and collaborate effectively. This &apos;corporate mosh pit&apos; acknowledges and utilizes the strengths of both. At Marriott, for example, the dreamer team working on future experiences collaborates closely with doers, like franchisees and frontline staff, to prototype and refine new concepts.&lt;/p&gt;
&lt;h4&gt;Building Speed Bumps, Not Roadblocks&lt;/h4&gt;
&lt;p&gt;It&apos;s crucial to implement processes that guide, not hinder. While dreamers might chafe at too many restrictions, a complete lack of structure can alienate doers. ==The key is to create &apos;speed bumps&apos; - processes that provide enough structure to maintain alignment without stifling innovation.==&lt;/p&gt;
&lt;h3&gt;Celebrating All Wins&lt;/h3&gt;
&lt;p&gt;Finally, it&apos;s essential to celebrate the achievements of both dreamers and doers. ==Every success, whether from a dreamer&apos;s innovation or a doer&apos;s execution, is a win for the organization and should be recognized as such.==&lt;/p&gt;
&lt;h3&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;==The synergy between dreamers and doers is not just beneficial but necessary for the growth and innovation of any organization.== ==By understanding and valuing the contributions of both, creating environments for effective collaboration, and aligning incentives, we can harness the full potential of our teams.==&lt;/p&gt;
&lt;p&gt;Thank you for taking the time to read this post. I hope it inspires you to look at your team dynamics differently and find new ways to innovate and succeed together!&lt;/p&gt;
</content:encoded><enclosure url="https://sharadja.in/yin-and-yang-2.webp" type="image/png" length="0"/><category>business-growth</category><category>innovation</category><category>teamwork</category><category>TED-Talk-analysis</category><author>Sharad Jain</author></item><item><title>The Future of Large Language Models: A Deep Dive</title><link>https://sharadja.in/blog/future-of-large-language-models-deep-dive/</link><guid isPermaLink="true">https://sharadja.in/blog/future-of-large-language-models-deep-dive/</guid><description>Exploring the potential and challenges of large language models in the evolving landscape of artificial intelligence.</description><pubDate>Wed, 10 May 2023 11:39:36 GMT</pubDate><content:encoded>&lt;h2&gt;Part 1&lt;/h2&gt;
&lt;p&gt;Think of an app that&apos;s like a parrot. But this isn&apos;t any ordinary parrot. It doesn&apos;t just copy what it hears. Instead, it can come up with a wide range of clear and well-structured responses. And instead of eating bird food, it&apos;s been &apos;fed&apos; with all the different types of information on the internet. This is what Large Language Models (LLMs) like ChatGPT or GPT-4 are about.&lt;/p&gt;

&lt;p&gt;These LLMs work using two main parts: a parameters file and a run file. Think of the parameters file as the parrot&apos;s memory of phrases it&apos;s learned. It holds the &apos;weights&apos; of the neural network, which are like the strength of connections in the parrot&apos;s brain. ==These weights are learned during training on a vast corpus of internet data and are the core of what the model knows - they&apos;re like the digital food that our parrot eats, sourced from the entire internet.==&lt;/p&gt;
&lt;p&gt;The run file is akin to the parrot&apos;s voice box. It takes the &apos;weights&apos; or learned phrases from the parameters file and uses them to form sentences. This file sets up the structure of the neural network, which is like the parrot&apos;s brain. It then performs a forward pass of the network, which is a process of turning inputs (like prompts or questions) into outputs (the parrot&apos;s responses). In essence, it&apos;s the part of the system that allows our digital parrot to &apos;speak&apos;.&lt;/p&gt;
&lt;p&gt;Training these digital parrots is no small feat. It&apos;s akin to teaching a real parrot to understand and mimic human language, but on a much larger scale. This process requires a high-performance computer, known as a GPU cluster, capable of processing and learning from the vast expanse of the internet. It&apos;s a time-consuming and costly process, but the result is a model capable of generating articulate and coherent responses to a wide array of prompts.&lt;/p&gt;
&lt;p&gt;==The magic of these LLMs lies in their ability to predict the next word in a sequence.== ==This seemingly simple task is actually a powerful learning mechanism.== ==It forces the model to understand the context of the sentence, the subject of the conversation, the tone of the writing, and much more.== It&apos;s not just predicting a word; it&apos;s understanding language and the world it describes.&lt;/p&gt;
&lt;p&gt;In essence, LLMs are like digital parrots that have read the entire internet. ==They don&apos;t understand the world in the way humans do, but they can mimic human-like text based on the patterns they&apos;ve learned.== This makes them powerful tools for a wide range of tasks, from writing articles to answering questions, and much more.&lt;/p&gt;
</content:encoded><enclosure url="https://sharadja.in/training-corpus.png" type="image/png" length="0"/><category>AI</category><category>GPT-4</category><category>technology</category><category>security</category><author>Sharad Jain</author></item><item><title>Mastering Marketing Attribution in the age of AI</title><link>https://sharadja.in/blog/mastering-marketing-attribution-in-gpt4/</link><guid isPermaLink="true">https://sharadja.in/blog/mastering-marketing-attribution-in-gpt4/</guid><description>A comprehensive guide on mastering marketing attribution using Snowplow and GPT-4</description><pubDate>Thu, 29 Dec 2022 11:39:36 GMT</pubDate><content:encoded>&lt;p&gt;==Crafting a robust marketing strategy across diverse channels and platforms, each with their unique and intricate user journeys, is a complex task.== Yet, assessing the effectiveness of your marketing expenditure and calculating its ROI should be straightforward.&lt;/p&gt;
&lt;p&gt;Grasping how your marketing expenditure contributes to conversions and sales is vital. It empowers you to continually adapt your strategy to evolving customer preferences and market conditions. While numerous commercial solutions exist to help you understand your marketing campaigns&apos; performance, they often suffer from two main drawbacks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;==They are siloed.== For instance, Facebook provides an overview of your performance on their platform, but comparing that to the performance of other channels can be challenging.&lt;/li&gt;
&lt;li&gt;==They are one-size-fits-all black boxes that remove flexibility and control.== For example, Google Analytics attribution can only attribute one channel to each session.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;==This is where GPT (Generative Pretrained Transformer) and other Language Models (LLMs) come into play.== ==They can help early-stage marketers leverage scientific methods to measure marketing effectiveness.==&lt;/p&gt;
&lt;h2&gt;Modeling Sessions&lt;/h2&gt;
&lt;p&gt;The Snowplow JavaScript tracker also captures a session ID with each event, the domain_sessionid, as well as a session index. The session ID is used to model sessions in Snowplow’s web data model. This SQL model aggregates out-of-the-box page views and page pings into a set of derived tables: page_views, sessions, and users. These tables have one row per page view ID, session ID (i.e., the domain_sessionid), or user ID (i.e., the domain_userid).&lt;/p&gt;
&lt;p&gt;The Snowplow JavaScript tracker also captures a session ID with each event, the &lt;code&gt;domain_sessionid&lt;/code&gt;, as well as a session index. The session cookie is set against the same domain as the &lt;code&gt;domain_userid&lt;/code&gt; cookie (a first-party cookie set against the domain the tracking is on). By default, it expires after 30 minutes of inactivity, but a different interval can be picked in the &lt;a href=&quot;https://github.com/snowplow/snowplow/wiki/1-General-parameters-for-the-Javascript-tracker#session-cookie-duration&quot;&gt;tracker initialization&lt;/a&gt; (i.e. &lt;code&gt;sessionCookieTimeout: 3600&lt;/code&gt; ).&lt;/p&gt;
&lt;p&gt;The session ID is used to model sessions in Snowplow’s &lt;a href=&quot;https://github.com/snowplow/snowplow-web-data-model#web-data-model&quot;&gt;web data model&lt;/a&gt;. This SQL model aggregates out-of-the-box page views and page pings into a set of derived tables: page_views, sessions and users. These tables have one row per page view ID (as captured in the &lt;a href=&quot;https://github.com/snowplow/snowplow/wiki/1-General-parameters-for-the-Javascript-tracker#webPage&quot;&gt;web page context&lt;/a&gt;), session ID (i.e. the &lt;code&gt;domain_sessionid&lt;/code&gt;) or user ID (i.e. the &lt;code&gt;domain_userid&lt;/code&gt;).&lt;/p&gt;
&lt;h2&gt;Adding Marketing Costs (Google Ads Example)&lt;/h2&gt;
&lt;p&gt;If marketing costs are pulled into the data warehouse (using an ETL tool such as Stitch), they can be added to the sessions table based on the marketing parameters. For example, if the Google click and keyword performance reports are available, the average cost per click can be added to sessions that originated from a paid Google search using the marketing click ID:&lt;/p&gt;
&lt;p&gt;`
CREATE TABLE .ad_kw_click_perf
DISTKEY(gclickid)
SORTKEY(gclickid)
AS
(
WITH click_perf AS
(
SELECT
cpr.googleclickid AS gclickid,
cpr.day::date as date,
cpr.adgroupid AS adgid,
cpr.keywordid AS kwid
FROM .click_performance_report AS cpr
WHERE cpr.googleclickid IS NOT NULL
GROUP BY 1,2,3,4
)
SELECT
cp.gclickid,
kpr.keywordid as kw_id,
kpr.keyword as kw,
kpr.adgroup as ad_group,
kpr.adgroupid as ad_gid,
kpr.adgroupstate as ad_g_state,
kpr.campaign as campaign,
kpr.campaignid as camp_id,
kpr.campaignstate as camp_state,
kpr.customerid as cust_id,
kpr.clicks as clicks,
kpr.impressions as impressions,
cast((kpr.cost::float/1000000::float) as numeric(38,6)) as total_cost,
cast((kpr.avgcpc::float/1000000::float) as numeric(38,6)) as avg_cpc,
kpr.day::date as date
FROM .keywords_performance_report AS kpr
INNER JOIN click_perf AS cp
ON kpr.keywordid = cp.kwid
AND kpr.day::date = cp.date
AND kpr.adgroupid = cp.adgid
GROUP BY 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
);&lt;/p&gt;
&lt;p&gt;`&lt;/p&gt;
&lt;h2&gt;Next Steps&lt;/h2&gt;
&lt;p&gt;Once you have developed an understanding of what channels drive customers to your digital products, you can proceed with defining what activities you want to attribute - whether its newsletter signups, pdf downloads, product purchases, subscriptions, etc. This information can also be added to sessions as additional metrics, or flags. The resulting table can then be used as the basis for your various attribution models.&lt;/p&gt;
&lt;h2&gt;Marketing Attribution with GPT and Other LLMs&lt;/h2&gt;
&lt;p&gt;Data is driving more high-stakes decisions across companies and industries, and marketing strategies are no exception. As your channel mix and user journeys grow more complex, it becomes less likely that siloed or one-size-fits-all commercial tools will deliver what you need to attribute and optimize your marketing spend accurately.&lt;/p&gt;
&lt;p&gt;Attributing credit to different events in the journey provides evidence of what is and isn&apos;t working. ==But without being able to take charge of your data to choose the attribution logic that reflects your customers&apos; journeys (and their touchpoints), you cannot truly understand the real return on your investment.== ==With GPT and other LLMs, you have that flexibility and control.==&lt;/p&gt;
</content:encoded><enclosure url="https://sharadja.in/marketing-attribution.png" type="image/png" length="0"/><category>marketing</category><category>attribution</category><category>GPT-4</category><category>Snowplow</category><author>Sharad Jain</author></item></channel></rss>