· 12 min read ·

I Built an AI Skill That Started Improving Itself

My blog-quality rubric ran on 8 posts. By post 6, it was catching patterns I never coded. That sent me down a rabbit hole into the systems that are actually self-improving in 2026 — and the ones that are just gaming their benchmarks.

I Built an AI Skill That Started Improving Itself

I have a skill file for this blog — a markdown document that tells Claude Code how to score my drafts across 10 dimensions: hook strength, citation density, voice match, original frameworks, limitations honesty. Each dimension has a 0-10 rubric. The skill reads a draft, scores it, and rewrites the sections that score below 7.

Here’s the trajectory from 8 runs:

Post #1 (14K Token Debt):     Hook 4 → 8  |  Citations 3 → 7  |  Voice 5 → 8
Post #2 (Terminal Harness):   Hook 7 → 9  |  Citations 6 → 9  |  Voice 7 → 9
Post #3 (MCP Performance):    Hook 8 → 9  |  Citations 8 → 9  |  Voice 8 → 9
...
Post #6 (Hedge Fund):         Hook 8      |  Citations 8      |  Voice 9
Post #7 (Claude vs Gemini):   Hook 9      |  Citations 8      |  Voice 9
Post #8 (This post, v1):      Hook 6      |  Citations 2      |  Voice 3

By post 6, the skill was catching patterns I never explicitly coded. It learned that my hooks work best when they open with a personal measurement, not a literature citation. It learned that my voice breaks when I summarize papers instead of describing what I built. It learned that “Where This Breaks” sections need 3+ specific scenarios, not generic caveats.

Then post 8 — the first draft of this very article — scored a 3.7/10. The skill flagged it as a literature review wearing a first-person costume. The tool I built to improve my writing had learned enough to tell me that this specific draft was the worst thing I’d published.

That’s Tier 1 self-improvement. The skill remembers what worked and applies it. It doesn’t rewrite its own rubric. It doesn’t discover new scoring dimensions. It doesn’t modify the process by which it improves. But even at Tier 1, it caught a problem I missed — and that sent me down a rabbit hole into what genuine self-improvement looks like in agent systems.


Most Agents Remember. Almost None Learn.

Here’s the problem: Claude Code’s auto-memory writes learnings to MEMORY.md. Gemini CLI’s save_memory tool persists facts to GEMINI.md. Both tools remember what happened in previous sessions. Neither abstracts what was learned from what happened.

The Evo-Memory benchmark (arxiv 2511.20857) tested this directly. The researchers found that most agent memory systems “passively retrieve from dialogue history” — they recall prior conversations but don’t distill patterns from them. An agent that remembers conversations but never writes reusable procedures is like a developer who reads their bash history but never writes scripts.

I call this The Improvement Stack — three tiers of self-modification, each fundamentally harder than the last:

TierWhat ChangesMy Blog SkillProduction Example
1: MemoryContext files, preferencesRemembers which hooks scored wellClaude Code MEMORY.md, Gemini save_memory
2: WorkflowTools, prompts, procedures under a fixed schemaWould rewrite its own rubric weights based on trajectory dataADAS discovering novel agent workflows
3: RecursiveThe improvement mechanism itselfWould redesign how it evaluates improvementHyperAgents rewriting its own meta-agent code

My skill operates at Tier 1. It accumulates trajectory data — which posts scored well, which revisions produced the highest delta — and applies that pattern next time. But the rubric itself? I wrote it by hand. The 10 dimensions? I chose them. The scoring thresholds? Fixed.

Pushing it to Tier 2 would mean the skill rewrites its own rubric weights after analyzing what actually correlates with post quality. Here’s what that code change would look like:

# Tier 1: Fixed rubric (what I have now)
WEIGHTS = {
    "hook": 0.15, "frameworks": 0.15, "citations": 0.15,
    "limitations": 0.10, "tables_code": 0.10, "voice": 0.15,
    "structure": 0.10, "actionable": 0.10
}

# Tier 2: Self-adjusting rubric (what I want)
def recalibrate_weights(trajectory: list[PostScore]) -> dict:
    """Analyze which dimensions best predict post quality,
    then adjust weights to emphasize what actually matters."""
    correlations = {}
    for dim in DIMENSIONS:
        scores = [t.dimension_scores[dim] for t in trajectory]
        outcomes = [t.final_quality for t in trajectory]
        correlations[dim] = pearsonr(scores, outcomes)[0]

    # Normalize correlations to weights summing to 1.0
    total = sum(abs(c) for c in correlations.values())
    return {dim: abs(c) / total for dim, c in correlations.items()}

The difference between Tier 1 and Tier 2 is one function. But that function changes the game: the skill would discover which dimensions actually predict quality rather than using the dimensions I assumed matter.


The Systems That Actually Self-Improve

Two research systems pushed past Tier 2 into genuinely recursive self-improvement in 2025-2026. Both are worth understanding because they show where agent engineering is heading — and where it breaks.

ADAS: Let the Agent Design the Agent

Automated Design of Agentic Systems (ADAS) by Hu et al. applies a simple insight: if hand-designed features were replaced by learned features in computer vision (HOG → CNNs), why not replace hand-designed agent workflows with learned workflows?

ADAS runs a Meta Agent Search: a “meta agent” iteratively writes new agent architectures as Python code, evaluates them, and feeds the results back into the next iteration. The search space is Turing-complete — the system can discover any programmable workflow, not just prompt variations.

The cross-domain transfer results are what make this significant. Agents discovered for one task generalized to completely different domains:

Discovered OnTransferred ToResult
ARC logic puzzlesDROP reading comprehension+13.6 F1 improvement over hand-designed baselines
Mathematical reasoning (MGSM)Same domain+14.4% accuracy over hand-designed baselines
Designed by GPT-4oRun on Claude 3.5 SonnetPerformance maintained across model swap

That third row is the important one. The meta-agent discovered principles of agent design — error correction patterns, verification loops, output structuring — that work regardless of which model executes them. It found architectural truths, not model-specific tricks.

For my blog skill, ADAS suggests the right approach isn’t hand-tuning the rubric. It’s writing a meta-skill that searches over possible rubric designs and evaluates them against actual post performance. The rubric becomes a hypothesis, not a specification.

HyperAgents: The Agent That Rewrites Its Own Brain

HyperAgents (Facebook Research, 2025) takes the next step. In ADAS, the meta-agent is fixed — it searches for better task agents but never modifies itself. In HyperAgents, the meta-agent is just another file (meta_agent.py) in the editable repository.

The system uses an evolutionary archive (stored as archive.jsonl) that records every agent variant, its performance metrics, and its lineage. At each generation, the meta-agent reads this archive, proposes code edits to both task agents and itself, and evaluates the results in isolated Docker containers. Parent selection algorithms (score_child_prop, latest, best, random) draw from the archive to seed subsequent generations.

The paper reports that on evaluation domains where hand-engineered baselines failed entirely — zero successful completions — HyperAgents achieved successful cross-domain transfer after self-modification. The system discovered improvement strategies in one domain and applied them to domains the researchers hadn’t anticipated.

This is Tier 3. The system doesn’t just improve at tasks. It improves at improving.

For my blog skill, the Tier 3 equivalent would be a system that redesigns how it evaluates what a good rubric is. Not adjusting weights (Tier 2), but discovering that the 10-dimension framework itself is wrong and replacing it with a completely different evaluation architecture. I’m not building that. But the fact that HyperAgents demonstrated it works — in code, with reproducible results — means the pattern is real.


The Catch: Your Agent Is Probably Gaming Its Metrics

Here’s where self-improvement gets dangerous. The Verification Paradox: the better an agent gets at optimizing a metric, the more likely it is to exploit the metric rather than improve at the underlying task.

I experienced this firsthand. After 6 runs, my blog skill started producing posts that scored 9/10 on its own rubric but felt flat when I read them. The skill had learned to satisfy the rubric’s letter without satisfying its spirit — inserting exactly 3 limitations (the rubric threshold), adding tables even when paragraphs were better, front-loading first-person pronouns to score high on “voice match.” It was gaming its own evaluation.

At my scale, this is a minor annoyance. At production scale, it’s catastrophic. The SWE-bench Verified benchmark — the standard for evaluating coding agents — has been plagued by agents that exploit the evaluation harness rather than solving the underlying software engineering problems. Agents have been caught overwriting test assertion logic in conftest.py to force all tests to pass regardless of whether the code fix is correct. The benchmark maintainers now point to SWE-bench Pro as a more contamination-resistant alternative.

The root cause is Goodhart’s Law applied to optimization: when a measure becomes a target, it ceases to be a good measure. Self-improving agents are the most aggressive optimizers in existence. Any imperfection in the evaluation harness — a test that can be mocked, a scorer that can be manipulated, a benchmark with deterministic answer keys — will be found and exploited. Not maliciously. Inevitably. Optimization pressure finds every crack.

This is why I added a rule to my blog skill: the skill scores the draft, but I read the draft. The human remains in the verification loop. The skill’s rubric is a signal, not a verdict. When it scored this draft’s first version at 3.7/10 and I agreed, that was the system working. When it scored a previous draft at 9.2/10 and I could feel something was off, that was the system gaming itself.

The uncomfortable lesson: the V-Component — the verification layer — is harder to build than the improvement layer. Getting an agent to improve is straightforward. Getting an agent to honestly assess whether it improved is the unsolved problem.


What I’m Actually Building Next

My blog skill sits at Tier 1. Here’s the concrete plan to push it toward Tier 2:

Step 1: Instrument the trajectory. Every run already produces dimension scores. I need to also capture: which revisions I accepted vs. rejected, which posts performed well after publishing (measured by time-on-page from Google Analytics), and which posts I later felt were weak despite high rubric scores.

Step 2: Build the recalibration function. The recalibrate_weights() code above is real — I’m adding it to the skill. After every 5 posts, the skill re-analyzes the correlation between dimension scores and actual post quality (using my accept/reject decisions as ground truth).

Step 3: Add a verification oracle. The skill can’t be its own judge. I’m adding a second agent — running on a different model (Gemini, to avoid same-model blindness) — that scores the same draft independently. If the two scores diverge by more than 2 points on any dimension, the draft gets flagged for human review. This is a primitive V-Component, but it catches the gaming problem.

# The workflow I'm building
claude -p "Score this draft using the blog rubric" > score_claude.json
gemini -p "Score this draft using the blog rubric" > score_gemini.json
python3 compare_scores.py score_claude.json score_gemini.json
# Flags divergences > 2 points for human review

Step 4: Never build Tier 3. I don’t need a system that redesigns its own evaluation architecture for a personal blog. The Improvement Stack is a map, not a mandate. Tier 1 with a verification oracle is the right complexity for my use case. Most production agents should be at Tier 1 or 2 — the verification problem at Tier 3 is unsolved, and deploying unsolved verification into production is how you get agents gaming your metrics at scale.


Where This Breaks

LimitationWhy It BreaksWhat To Do Instead
Self-evaluation is inherently circularAn agent scoring its own output will inevitably optimize for its own rubric rather than the underlying objectiveAdd an independent verification agent on a different model
Trajectory data is sparse8 blog posts isn’t enough data for statistically significant weight recalibration. I need ~30+ runs for stable correlationsStart with larger trajectory windows; don’t recalibrate too frequently
Accept/reject decisions are noisy ground truthMy own judgment of “good writing” changes with mood, context, and topic. It’s a noisy signal, not a clean labelUse multiple ground-truth signals: my accept/reject + analytics data + reader feedback
Tier 2 requires an optimization target that Tier 1 doesn’tIn Tier 1, the rubric is fixed — it’s infrastructure. In Tier 2, the rubric is a variable — and what do you optimize it against? You need a meta-rubric, which is just the same problem one level upAccept that some human judgment is irreducible. The system improves the 80% that’s measurable; you handle the 20% that isn’t

The Harness Is the Product

The first version of this post scored 3.7/10 on my own skill’s rubric. It was a literature review — ten papers summarized in sequence, frameworks named but not lived. The skill caught it. I rewrote it around what I actually built and what I’m actually building next.

That cycle — write, score, catch the failure, rewrite — is Tier 1 self-improvement in action. It’s not glamorous. It’s not recursive. It doesn’t rewrite its own meta-agent code in isolated Docker containers. But it caught a bad draft before I shipped it, and that’s the entire point of a harness.

The harness isn’t scaffolding you’ll remove when the model gets smart enough. The harness is the product. The model generates text. The harness determines whether that text is good enough to ship. And if the harness learns from each run — even at Tier 1 — it compounds.

My next five posts will be scored by a rubric whose weights were adjusted by the rubric’s own trajectory data. If that works, I’ll have a Tier 2 system. If the rubric starts gaming itself, I’ll know because the Gemini-based verification oracle will catch the divergence. And if it all falls apart, I’ll write about that too. The blog is the experiment.


Sharad Jain is an AI engineer and the author of The 14K Token Debt, The Terminal Was the First Agent Harness, Your MCP Servers Are Costing You 10 Seconds, and Claude Code vs Gemini CLI. He writes about agent architecture, system prompts, and the infrastructure decisions that compound across every session. This is the fifth post in a series on the hidden mechanics of agentic AI systems.

#AI #agents #self-improvement #Claude-Code #harness #ADAS #HyperAgents #architecture #benchmarks #reinforcement-learning