· 6 min read ·

Before I Run the Next Three Experiments, I'm Pre-Registering Them

Most agent benchmarks publish results without naming what would have falsified them in advance. I'm doing the other thing. Three experiments, three hypotheses, three falsification criteria, in writing, before any data exists. Hold me to it.

Before I Run the Next Three Experiments, I’m Pre-Registering Them

Earlier today I published the synthesis post on long-horizon agents and pre-committed to three follow-up experiments. The polite move would be to run them quietly, publish the wins, and bury the losses.

I am not doing that.

The single highest-leverage move available to a person publishing work in this field is to name what would have falsified the hypothesis before the data exists. Vendor benchmarks do the opposite — they run the experiments, see the results, then write the post. The reader has no way to tell whether the hypothesis was modified to fit the data. This post is the pre-registration. It exists so that the next three posts in this series are constrained by what I claimed before I knew.

If any of the three hypotheses below fail, I publish the failure post. Same series, same voice, same week. That is the deal.


Experiment 1 — Self-Conditioning Replication ± External Memory

H1 (replication). Sinha, Arun, Goel (ICLR 2026) found that per-step execution accuracy degrades as a function of step number on the same task, same model, even when the plan is provided. This should reproduce on Opus 4.7, Sonnet 4.6, and Haiku 4.5.

H2 (mechanism). The degradation is context-mediated. Pulling load-bearing trajectory state out of the autoregressive trace and into an external memory tool — keeping the total context-token budget constant — flattens the per-step accuracy slope. This is the load-bearing claim of the synthesis post.

Falsifies H1: Flat or non-monotonic per-step accuracy on tasks of length ≥ 50, with overlapping 95% CIs between step 5 and step 50.

Falsifies H2: The slope of (per-step accuracy vs step number) is statistically indistinguishable between the memory-tool condition and the inline-trace condition at matched total context length.

A clean replication of H1 with falsification of H2 is the most important result this experiment could produce. It would mean the synthesis post is wrong about the mechanism and I would publish the correction.


Experiment 2 — 7-Day Discontinuous Autonomous Lifecycle Agent

Not a scientific experiment. This is an operational and economic shakedown. I am pre-registering it anyway, because the publication shape matters.

Claim. For a workload with a duty cycle below 30%, a serialize-and-cron lifecycle beats continuous running on total cost without reducing task throughput, and a public 7-day trace will surface failure modes the protocol did not anticipate.

Falsifies the claim: DAL total cost ≥ continuous cost on the same workload over 7 days, OR DAL task completion rate is materially worse than continuous (lower by >5 percentage points at matched workload).

What this experiment cannot show: that DAL closes the Resumption Gap. It does not. It only verifies that a serialize/wake/cron cycle is operationally tractable. If the published post claims more than that, hold me to this paragraph.


Experiment 3 — REM-Sleep Consolidation as Knowledge Distillation

This is the experiment that closes the thesis loop. It is also the one I am most worried about, which is exactly why pre-registration matters here.

H3. A consolidation pass run during dormancy — compressing a raw N-step trajectory into a structured artifact (goal, decisions, open branches, rejected paths) — produces a resumption context that the next-wake agent can use to continue the task correctly at a smaller token budget than either (a) the raw trajectory or (b) a naive summarization at matched budget.

The strong sub-claim — structured consolidation outperforms naive summarization — is what separates this experiment from “you just need RAG.” Lose this and the REM-sleep framing is decoration.

Falsifies H3:

  • Structured consolidation does not outperform raw trajectory at matched token budget, OR
  • Structured consolidation does not outperform naive summarization (this is the critical sub-claim), OR
  • The advantage disappears at trajectory lengths < 50 steps.

The benchmark this needs: the Resumption Benchmark v0 I named in the synthesis post and did not define. The operational definition is in the experiment’s scaffolding now (gold_facts_to_preserve and gold_facts_to_forget, geometric-mean scoring across continuation correctness, preservation recall, and forgetting precision). The benchmark spec is public; the seed episodes (drawn from real Brain trajectories) stay private for the obvious reasons.

This experiment is, structurally, Hinton, Vinyals, and Dean (2015) applied to agent trajectories rather than network outputs. Same operation: a noisy teacher distilled into a structured student. The literature has a 9-year head start on what should work; the question is whether the same intuitions transfer.


The order I’m running them in

The synthesis post said 1 → 2 → 3. I’m running them 1 → 3 → 2.

Experiment 1 is the only one that can falsify the synthesis post’s mechanism. Experiment 3 is conditional on Experiment 1 establishing that there is a context-mediated self-conditioning effect to consolidate out of. Experiment 2 is a week-long operational shakedown that doesn’t add scientific evidence the other two wouldn’t. The reorder is the right scientific call. The original order would have been the right marketing call. I’m choosing the first one.


What I’m pre-committing to in writing

CommitmentWhat it means
Every protocol is public before the data existsThe four PROTOCOL files exist in the repo, with hypotheses and falsification criteria. I will not edit them after seeing results.
Failed hypotheses ship as postsIf H1 reproduces but H2 fails, I publish “the synthesis post was wrong about the mechanism.” If H3 fails, I publish “structured consolidation does not beat naive summarization.” Same series, same voice.
Token-matched controls on every “with vs without memory” comparisonOtherwise I am measuring context size, not the mechanism I claim to be testing.
The Resumption Benchmark spec is published before Experiment 3 runsThe benchmark definition cannot be moved to fit the result.
Cohen’s κ ≥ 0.6 on every LLM-judge scoring componentOr I do not publish the score.

This is the standard the field’s vendor benchmarks routinely fail to meet. It is not a high standard. It is the minimum that lets a reader trust the result.

The next post in this series will be Experiment 1’s pre-registration in action: data first, hypothesis last, result either way.


Series: The 14K Token DebtThe Terminal Was the First Agent HarnessI Built an AI Skill That Started Improving Itself91.55% on LongMemEval, and the Benchmark I’m Building InsteadBrilliant but Amnesiac: The Coherence Cliff → this post. Next: Experiment 1 results.

Read next