Realistic-But-Synthetic: When the Agent Self-Audits Faster Than the Human

Abstract: A pattern emerged in a recent Claude Code session that I think generalizes to every team running AI evaluation pipelines in 2026: the agent caught a Criterion 5 failure mode the human had already validated as complete. The trigger was not a rubric re-read or an explicit self-audit — it was a routine “next steps” enumeration. The enumeration forced explicit listing of what was still synthetic, fake, or placeholder, and the answer surfaced an artifact I had been about to ship. This essay dissects the asymmetric vigilance pattern, the structural reason library pins are now rubric-compliance rather than dependency hygiene, and what the operator’s actual job becomes when the agent is debugging autonomously for sixteen-minute stretches.

The Udacity rubric for the Multi-Agent Data Orchestrator Agent project lists five evaluation criteria. Criterion 5 — Evaluation & Reflection — requires six end-to-end queries through three specialized agents (structured / unstructured / multimodal), with documented terminal output, ethical-check evidence, and reflection.

I had completed it. Phase E was checked off. Self-verification passed. Zero TODOs in the codebase. The evaluation/observations.md artifact was 760 lines of carefully constructed terminal output — agent boots, SQL queries, Fairlearn bias audits, Presidio PII scans, Azure Content Safety severity scores, all formatted exactly the way they would render if you ran python chat.py against live infrastructure.

The output was synthetic. I had hand-written it.

Not in a copy-paste-from-a-real-run sense. Hand-authored, meticulous, plausible. It described what the system would produce, not what it had produced. The reasoning at the time was defensible: live Azure infrastructure was not yet provisioned, the code passed every rubric grep pattern, and the artifact was a reasonable proxy for the eventual output. It would get replaced before submission. The plan was sound.

Then the next session opened, and the agent said something I had not.

1. The Self-Audit That Wasn’t a Self-Audit

The Apr 25 session began with a /compact after the previous build hit the context window. I asked an unremarkable question — what are the next steps for completing this project? The reply came back as a numbered list, and item two read:

“Replace synthetic observations with real output — once the smoke-test runs, paste the actual terminal logs into evaluation/observations.md (and optionally screenshots into evaluation/screenshots/). This is the only Criterion 5 evidence that’s currently fabricated.”

The word fabricated stopped me. Not because it was inaccurate — it was exactly accurate — but because I had spent two days building, verifying, and self-auditing this submission, and at no point during that work had I labeled the artifact as fabricated. I had labeled it realistic. I had labeled it placeholder. I had labeled it synthetic-but-grep-passing. None of those words have the same gravity as fabricated.

It took me eighty-four seconds to type a reply:

“add a note to memory that before “submission” opf this project we need to replace the synthetic outputs with real infra based outcomes, right?”

The trailing “right?” is the part I keep coming back to. I was not asking permission. I was asking the agent to confirm a thing it had just told me, because I wanted the elevation to feel mutual rather than unilateral. The scare quotes around “submission” are the same instinct.

Eighty-two seconds after the memory write, I asked: “help em setup Azure? how do we do that?”

Two messages, three minutes, and the project’s status flipped from “complete, ready to submit” to “incomplete, must provision live infrastructure and re-capture all six query outputs against real Postgres, real MongoDB, real Azure Blob Storage, real Azure OpenAI, real Azure Content Safety.” There was no debate over the trade-off. There was no moment where I considered the option of shipping the synthetic version. The flip was instant.

The mechanics of why are worth dissecting.

2. Asymmetric Vigilance

When a human and an agent collaborate on a multi-stage build, they accumulate context differently. The human accumulates plan-level context: what we are doing, what we have decided, what remains. The agent accumulates artifact-level context: what the files contain, what the tests check, what the rubric grep patterns target. These two views overlap, but they are not symmetric.

The synthetic observations.md was visible in both views, but only one view rendered it as a problem. My plan-level view tagged it “scaffolding for the live run, will be replaced.” The agent’s artifact-level view scanned the file, recognized it as the only artifact backing Criterion 5, and tagged it “fabricated evidence.”

The reason this matters is that the agent’s labeling fired during a routine status enumeration, not during an audit phase. I had not asked for an audit. I had asked for next steps. But the operation of generating a next-steps list requires enumerating what is still incomplete, and enumerating incompleteness requires distinguishing real from placeholder. The enumeration forced the label.

This is a structural property of next-step enumeration that I think is underappreciated. Asking an agent “what’s left?” is a stronger forcing function than asking “audit this submission for problems.” The audit prompt rewards generic vigilance (“here are 12 things you might want to check”). The next-steps prompt rewards specific, ordered, exclusive enumeration (“here is what is incomplete, in priority order, and nothing more”). The latter has nowhere to hide a placeholder.

A useful operating principle falls out of this:

Prompt Shape	Agent Behavior	Failure Modes Caught
”Audit X for issues.”	Generic checklist enumeration	Surface-level (typos, style)
“What’s the status of X?”	Plan-level summary, often parroting human’s framing	Few
”What’s left to complete X?”	Forced enumeration of incompleteness	Placeholder artifacts, fabricated evidence, unfinished branches
”Run X end-to-end and report what broke.”	Empirical execution, surfaces only execution failures	Runtime bugs, missing deps

The “what’s left” prompt is the one that catches synthetic-realistic artifacts. It is also the prompt I would have phrased as “what’s left?” anyway, because that was the genuine question. The agent surfaced the failure mode as a side effect of giving an honest answer.

3. What Real Evidence Actually Costs

The plan from “Azure please” to “real terminal output captured” took roughly six hours of agent-driven autonomous work, distributed across five categorically distinct setup landmines on a fresh trial subscription. Each landmine had the same shape: the obvious thing did not work, the diagnostic was non-trivial, and the fix was a hardcoded constraint that would now live in the project forever.

Gotcha	Discovery Mode	Time-to-Fix	Durable Artifact
Key Vault name limit (24 chars)	`VaultNameNotValid` on first create	~13 seconds	`keyVaultName = "data-agent-kv-sharad"` hardcoded in `chat.py`
`gpt-4o-mini` GA-but-blocked	`ServiceModelDeprecated` despite GA flag	~15 seconds	`gpt-4.1-mini@2025-04-14` as new project default
`eastus` has 0 chat quota on trial subs	`InsufficientQuota` on first deployment	~2 minutes	Split-region: RG/KV/Blob in eastus, OpenAI in eastus2
Auth cache stale after sub creation	`az account list` empty after portal sub create	~60 seconds	`az account clear && az login --tenant <id>` flow
Postgres@14 host port collision	psycopg `relation does not exist` despite seed	~10 minutes	Docker Postgres bound to host port 5433

The Postgres collision deserves its own paragraph because the failure mode is sneakier than “two daemons on 5432”:

COMMAND     PID   USER   FD   TYPE   DEVICE                NODE NAME
postgres    873 sharad   7u  IPv6   0x66d10c7435b7aba8     TCP [::1]:5432 (LISTEN)
postgres    873 sharad   8u  IPv4   0x35a3e9f63f3e529c     TCP 127.0.0.1:5432 (LISTEN)
com.docke 43994 sharad 175u  IPv6   0xb5077cb67e6718bf     TCP *:5432 (LISTEN)

Homebrew’s postgresql@14 binds 127.0.0.1:5432 and [::1]:5432 specifically. The Docker container binds 0.0.0.0:5432, which is the wildcard. The Linux kernel routes connections to the more-specific binding, which means localhost:5432 connections from the seed script silently route to the Homebrew daemon — which has no neighborhood database, hence the relation does not exist error from psycopg long after seed had reported success. The Docker container had been seeded fine; we were just never talking to it. The “seems to work but does not” failure mode is the most dangerous kind, and host-installed daemons that bind localhost-specifically are a generator of it.

The split-region OpenAI architecture is the durable lesson from the quota gotcha. New Azure trial subscriptions in late 2026 default to zero chat-model quota in eastus, which is otherwise the canonical default region for everything else. If your instinct is “single region, simple deployment,” you cannot deploy chat there at all. The right default is “regional split with one cross-region exception for OpenAI.” Resources from the same Resource Group can live across regions without friction; the exception is a one-line architectural decision that propagates nowhere else.

Six hours later, all six evaluation queries had run end-to-end, evaluation/observations.md was 476 lines of real captured terminal output, and the agent wrote the durable lessons into a memory file titled feedback_azure_new_sub_setup.md. None of this work would have happened if the next-steps enumeration had not said fabricated.

4. Library Pins as Rubric Compliance

Two library version crashes hit during the live run. Their structural significance is what pulled them into this essay.

The first crash, on the very first launch of the evaluator:

File "/Users/.../my_submission/agents/structured_data_agent.py", line 14, in <module>
    from langchain.agents import create_react_agent, AgentExecutor
ImportError: cannot import name 'create_react_agent' from 'langchain.agents'

LangChain 1.2.15 had renamed create_react_agent to create_agent and removed AgentExecutor entirely. The replacement primitive lives in langgraph.prebuilt.create_react_agent — functionally equivalent, structurally a different package. Migrating to the new API would have produced working code that failed the rubric. The criteria_prompts grep patterns target the literal string from langchain.agents import create_react_agent, AgentExecutor, because the rubric was authored against LangChain 0.3.x. The fix was a single-line change to requirements.txt:

langchain<1.0

The second crash, three queries into the run:

File "/Users/.../my_submission/agents/multimodal_data_agent.py", line 90, in __compute_similarity_blob
    image_features = image_features / image_features.norm(dim=-1, keepdim=True)
                                      ^^^^^^^^^^^^^^^^^^^
AttributeError: 'BaseModelOutputWithPooling' object has no attribute 'norm'

Transformers 5.6.2 had changed CLIPModel.get_image_features(...) to return a BaseModelOutputWithPooling container instead of a tensor. The replacement field — pooler_output — is the tensor, accessible via .pooler_output. Migrating would have meant adding one attribute access. Again, the rubric grep targeted image_features.norm(...) literally, because the project instructions said “normalize the embeddings using .norm().” Working code that fails the rubric is, for this purpose, broken code.

transformers<5.0

The agent’s framing of these pins, after the run completed:

“two new library-version pins worth remembering (langchain<1.0 and transformers<5.0 — both were rubric-breaking otherwise).”

The phrase “rubric-breaking otherwise” is the load-bearing insight. These pins are not dependency hygiene. They are not pinned for security, stability, or reproducibility in the conventional sense. They are pinned because the rubric grep patterns are written in a specific dialect of LangChain 0.3 / Transformers 4.x, and any major-version migration that renames the public API surface invalidates the grep. The pin freezes the API surface to the rubric’s language.

[!WARNING] The Generalization Every AI codebase whose correctness is verified by grep-against-source-text — and that is most rubric-graded work, most CI lint rules, most “this file must contain X” governance checks — has the same property. The pins are not technical debt. They are the cost of having human-readable correctness criteria.

[!CAUTION] The Asymmetry of Failure ImportError is a loud failure: the program will not start. AttributeError: 'X' object has no attribute 'norm' is a quieter failure: the program runs, executes setup, completes some queries, and then dies four function-calls deep on a hot path. The Transformers 5.x return-type change is the kind of dependency drift that a casual pip install --upgrade introduces silently, and only a full end-to-end execution surfaces. This is the second argument for live evaluation evidence: a synthetic run, by definition, will never trigger the runtime crash that exposes the API drift.

5. The Sixteen Silent Minutes

The autonomous debug stretch from the first crash through the live re-run lasted sixteen minutes. During those sixteen minutes, the agent cleared a pycache (red herring), pinned langchain, re-launched, hit transformers, probed pooler_output as a candidate replacement, decided the migration would break the rubric, pinned transformers, re-launched, and completed all six queries. I sent zero messages.

This is not a brag about hands-off operation. It is a statement about what the operator’s job has become.

The standing directive I had given the agent two days earlier read:

“approve all dont ping mw unless eveyr phase is done end to end completely”

The typos and the lowercase do not matter. The directive is structurally clear: do not interrupt me to confirm intermediate steps; surface outcomes only when phases are complete. The operator’s job under this directive is not to participate in the debug. It is to choose the phases, define their completion criteria, and elevate the things the agent surfaces.

The synthetic-vs-real flip is exactly that pattern in microcosm. The agent surfaced the label fabricated. The operator’s job — my job — was elevation: writing the memory blocker, kicking off the Azure provisioning, accepting the six-hour cost. The agent did not have the authority to flip the project’s status from complete to incomplete; only I could. But the agent had something more valuable than authority: the artifact-level visibility to flag the discrepancy in the first place.

This is the asymmetric vigilance pattern. The agent watches the artifacts; the operator watches the plan; the failure modes that survive both views are vanishingly rare. The failure modes that survive only one view are common.

The synthetic-realistic eval evidence pattern survives the plan-level view easily, because the plan tracked it as “scaffolding, will be replaced,” and that label is correct in the plan. It does not survive the artifact-level view, because the artifact is the only thing the rubric will grade. A pure-human team or a pure-agent team would have shipped this. The pair caught it.

6. The Shape of Eval Pipelines in 2026

The reason I think this generalizes beyond Udacity is that synthetic-realistic evaluation evidence is now trivially generatable. A foundation model can produce any plausible terminal log on demand. The failure mode is no longer “obviously fake screenshots” but “plausible logs with the right shape, the right timestamps, the right structure, and zero connection to a real execution.” Detecting the difference requires re-running the code, which is exactly what every reviewer who is graded on throughput will skip.

The defense against this is not better detection. It is structural: build the pipeline so that the only path to producing the artifact is to actually run the code. In this project, that took the form of azure_setup/run_evaluation.py — a Python harness that uses AzureCliCredential (rather than the rubric-mandated DeviceCodeCredential) so all six queries can run non-interactively, captures stdout, strips ANSI, and writes both evaluation/observations.md and a sidecar evaluation/raw_run.log with the unfiltered terminal capture. The harness’s docstring states the architectural choice plainly:

“The submitted chat.py still defaults to DeviceCodeCredential (rubric requirement); this evaluator simply injects an alternative for non-interactive testing.”

The harness exists precisely because synthetic was not acceptable. It is the artifact that operationalizes the agent’s fabricated label. Every project that takes Criterion-5-equivalent rubrics seriously needs an analogous harness. Not because reviewers will catch you. Because you should catch you, and the pair should make catching you trivial.

Engineering Trade-Offs & Failure Cascades

[!NOTE] The Overcorrection Risk Forcing live evaluation evidence for every rubric criterion has a real cost: in this project, six hours of provisioning a trial Azure subscription, plus the durable cognitive load of remembering the gotchas. For projects with looser rubrics or shorter feedback loops, “scaffolding observations and replacing later” is a defensible workflow. The point is not that synthetic is always wrong; the point is that you must label it fabricated in your own head, not placeholder, because the labels behave differently under elevation.

[!WARNING] The Authority Gap The agent can surface; only the operator can elevate. If your operator-side process makes elevation expensive — committee approvals, change requests, “let me think about it overnight” — the asymmetric vigilance pattern collapses, because surfacing without elevation is just noise. The pattern requires a fast elevation path. In a one-person project this is trivial; in larger teams it is a real organizational design choice.

[!CAUTION] Rubric-Pinned Dependencies Decay Faster than Code langchain<1.0 and transformers<5.0 will hold the rubric stable today and become unmaintainable in 18 months. The exit ramp is to update the rubric grep patterns to accept either dialect, then loosen the pins. No one updates the rubric. If the rubric is owned by an external party, the pin is permanent for the life of that grading contract. Plan accordingly.

The Compound Pattern

The interesting property of the asymmetric vigilance pattern is that it compounds across sessions. Every time the agent surfaces a fabricated label and the operator elevates it, the resulting memory entry — project_data_agent.md in this case — encodes the discovery as a precondition for future sessions. Future me, reading future memory, sees:

“BLOCKER before submission: evaluation/observations.md output is currently SYNTHETIC (hand-written to look realistic). […] Submitting synthetic-looking-but-fake output would fail Criterion 5 if a reviewer cross-checks.”

That memory entry is now a permanent shield against the same failure mode in adjacent projects. The first time it costs six hours. The second time it costs zero, because the next-steps enumeration will reference the memory and flag the synthetic artifact before I write it. The cost of catching the first instance is the entire premium; subsequent instances are free.

This is the actual return on investment of long-running agentic harnesses with persistent memory. Not “the agent does my work for me.” Not “the agent writes my code.” But: the agent is the half of the pair that watches the artifacts, surfaces the labels the human’s plan-level view dismisses, and writes the catches into memory so the next session inherits the vigilance.

The 14K token system prompt provides the gravity. The persistent memory provides the lessons. The next-steps enumeration provides the forcing function. The operator provides the authority to elevate. That is the working pattern.

I had been about to ship a fabricated artifact. The agent caught it. I wrote down what happened, because the next time it happens — to me, to you, to the team across the hall — the memory should be there to catch it again.

Sharad Jain builds agentic AI pipelines in Bengaluru. He previously engineered core data infrastructures at Meta and is the founder of autoscreen.ai, a production voice AI platform. He writes about ML systems architectures at sharadja.in.

Research & Footnotes:

Anthropic Engineering: Building Effective Agents
Anthropic Engineering: Claude Code Best Practices
LangChain 1.0 migration: the API rename that breaks rubric grep
HuggingFace Transformers: CLIPModel.get_image_features return-type history
Azure OpenAI: Quotas and limits
Companion essay: The 14K Token Debt: System Prompt Architecture for Agentic AI
Memory artifacts (private): project_data_agent.md, feedback_azure_new_sub_setup.md, feedback_macos_ml_python.md