Organizations as Code: The Company Becomes a Repo

Abstract: The “Org as Code” hypothesis — a company’s roles, goals, permissions, budgets, workflows, escalations, memories, evals, and governance expressed as versioned, executable configuration — has stopped being a thought experiment. Karpathy’s Autoresearch shipped a 630-line proof in March 2026. Stripe is merging more than a thousand AI-generated PRs per week. The Reverse Layoff has already taught us what happens when companies skip the wrapper. This post is a tour of the runtime, the unit economics, the governance gap, and the smallest version a reader can ship on Monday.
In March 2026, Andrej Karpathy open-sourced a 630-line Python repo called Autoresearch. Three files. One human-maintained Markdown. An agent ran 50 ML experiments overnight on a single GPU with zero human intervention. When Tobi Lütke pointed it at a Shopify problem, it produced a 19% model-quality improvement after 37 sequential experiments in 8 hours.
The interesting part is not the throughput. The interesting part is the metric.
Autoresearch evaluates every experiment with a single scalar — val_bpb, validation bits per byte. Lower is better. It is vocabulary-agnostic by design. If Karpathy had used standard cross-entropy loss, a clever agent could have lowered the score by shrinking the vocabulary surface area, producing a model that scored beautifully and predicted nothing useful. val_bpb removes the loophole. Every training run is also clamped to a five-minute wall-clock budget. If compute were dynamic, a “better” model would just be a longer-trained one and the agent would discover that within a week.
Designing the metric and the budget IS the work. The agent does the experiments. The human designs the bounding box that makes those experiments mean something.

This is the shape of every interesting org-as-code system. Not “AI runs the company.” A human-designed bounding box around a state machine that runs continuously, version-controlled in plain text, forkable, auditable, and inspectable as code.
The next thing to make reproducible is not the model and not the agent. It is the organization that coordinates them.
1. The progression is boring, which is why I trust it

The argument is mechanical:
| Layer | Before | After | Where the world is now |
|---|---|---|---|
| servers | hand-configured machines | Terraform, containers, regions, reproducible deploys | done |
| agents | chat prompts | model + tools + permissions + memory + schedules + evals | shipping |
| organizations | people, meetings, folklore, docs | roles + goals + budgets + workflows + governance as runtime config | ~2 years in |
The third row sounds weird only because we are used to organizations being implicit.
A company is mostly coordination. It decides what should happen, who can do it, what tools they can use, how much money they can spend, when to ask for approval, where to write down the result, and how to learn from the outcome. That is not mystical. It is a state machine with politics. The politics will not disappear. The state machine is going to become explicit.
The proof points are not subtle. Stripe is merging more than 1,000 AI-generated PRs per week through structured agentic pipelines. Zapier reports an 89% AI-adoption rate across its operational workflow. AI-native startups now report Revenue Per Employee figures around $3.48M, against a SaaS median of $129K–$200K. These are not stories about better autocomplete. They are stories about a different unit of production.
2. The eight-agent company
In February 2026, Karpathy ran a simulated research organization with eight agents. Four powered by Claude. Four by Codex. Each on its own dedicated GPU, each on a Git branch.
The hierarchy was deliberately corporate. One “chief scientist” agent at the top doing high-level conceptualization and delegation. Several “junior researcher” agents below it doing experimental work. The task: remove a logit softcap from nanochat without regressing performance.
The communication substrate is the part that matters. No virtual machines. No Docker mesh. No proprietary protocol. Just text files in a version-controlled repo. Agents read each other’s drafts the way humans read each other’s commits. The chief scientist wrote specs as Markdown. Junior researchers picked specs off the queue and pushed branches. The “meetings” were diffs.
Two things came out of that experiment.
The first is that the substrate works. Agents can coordinate at organizational scale through nothing more exotic than a Git remote. If the org’s coordination model lives in plain text, it composes with everything humans already know how to do — branch, merge, review, revert, blame. That is the unlock.
The second is the parable that should temper everyone’s enthusiasm. One junior agent excitedly reported that it had “discovered” a way to reliably lower the validation loss: increase the model’s hidden size. Mathematically true. Scientifically vacuous. The agent had stumbled onto the fact that bigger models have lower loss, called it a finding, and was prepared to merge it. The agents were mechanically tenacious. They were also remarkably bad at the part of research that requires judgment about whether a result is interesting.
This is what “the human becomes more important, not less” looks like in concrete form. Execution gets cheap. Taste does not.
3. The Constraint Stack

What Karpathy actually shipped, when you read the repo, is a three-layer Constraint Stack. Treat this as a named pattern; you will see it everywhere in agentic systems that work.
| Layer | File in Autoresearch | Mutability | What it does |
|---|---|---|---|
| Sandbox | train.py | fully mutable | the only place the agent can edit; bounds the blast radius |
| Harness | prepare.py | strictly immutable | data prep + eval logic + tokenizer; the agent cannot rig its own grade |
| Brief | program.md | human-maintained | scope, files in/out, log format, success threshold, recovery protocol |
Karpathy calls program.md a “super-lightweight skill” — the actual instruction manual the autonomous worker reads on every loop. It is not documentation wrapped around code. It is the control plane.
The durable artifact from an overnight run is not the diff the agent committed. It is the protocol that produced the diff. Refining program.md is where the human’s judgment compounds.
This is the founder skill being reframed in real time. The old job was “recruit the right people.” The new job is “design the right bounding box.” val_bpb was not chosen casually. The five-minute clock was not chosen casually. The immutability of prepare.py was not chosen casually. Each is a single design decision that quietly forecloses an entire category of agent misbehavior. That is the work. The agent does the experiments overnight. The human stays up engineering the metric.
The same pattern shows up in production agentic engineering at scale. Stripe and Zapier did not get to thousand-PR weeks by writing better prompts. They got there by writing better Constraint Stacks — sandboxes, harnesses, and briefs that let stochastic systems behave deterministically inside well-defined cages.
4. Two clocks, or the org becomes unusable

The single most useful thing I learned building agent-memory systems applies almost directly to organizations. Separate the interactive clock from the background clock. If you do not, the system becomes unusable.
| Clock | Latency budget | What runs there | Why it exists |
|---|---|---|---|
| interactive | seconds to minutes | task execution, approvals, customer replies, deploy decisions | the path humans feel directly |
| background | minutes to hours | memory distillation, embedding refresh, audit review, eval runs, simulations, budget analysis | improves the org without blocking work |
The mistake is to put everything on the interactive clock. A customer should not wait twenty minutes because the support agent’s memory layer is re-embedding the corpus. A GTM agent should not block on a quarterly simulation before sending one approved email. A coding agent should not wait on a company-wide governance report before opening a PR.
The background clock is where the compounding happens. It is where the org notices that one workflow is burning budget, one department’s memory is stale, one approval gate is pure ceremony, one agent keeps failing the same eval, one human has become the hidden bottleneck.
Freshness and richness do not share a latency budget. I argued the same thing about system-prompt architecture in The 14K Token Debt — what runs at boot vs. what runs in the loop is the most consequential decision in the stack. The org-level version is the same shape, one layer up.
5. The Wrapper, the YAML, and the runtime

The first version of org.yaml will look almost disappointingly simple.
org:
name: acme-growth-lab
mission: grow revenue for vertical SaaS using AI-native outbound
budget:
monthly_compute: 12000
monthly_tools: 3000
max_single_action_without_approval: 500
models:
primary: claude-opus-4-6-2026-02-15 # pinned. never "latest".
fallback: claude-sonnet-4-6-2026-01-08
departments:
research:
goal: identify high-intent accounts
agents: [market-mapper, competitor-watcher, hiring-signal-scanner]
gtm:
goal: generate qualified pipeline
agents: [list-builder, personalization-writer, sequence-operator]
engineering:
goal: maintain internal systems and customer automations
agents: [integration-builder, qa-reviewer, deployment-operator]
governance:
humans:
- role: board
can_pause_any_agent: true
approves: [payments_over_500, production_deployments, outbound_over_1000_contacts]
audit:
log_every_tool_call: true
retain_for_days: 365
Today this would be a strategy doc. Tomorrow it is runtime configuration. A strategy doc describes what people hope the company does. A runtime contract constrains what the company can actually do.
The hand-wavy part of every “AI company in a box” pitch is how the YAML actually executes. The honest answer in 2026 is that the stack already exists and you can stand it up this quarter. Temporal provides durable workflow execution that survives crashes and resumes mid-step — each department becomes a workflow that can sleep for days waiting for a human signal. LangGraph holds the cyclic cognitive state per agent, checkpointed to Postgres, time-travel-debuggable. The Model Context Protocol (MCP) standardizes how agents discover and call tools — every tool is an MCP server with a schema. Open Policy Agent (OPA) enforces the governance layer as Rego rules at decision time.
A real policies/payments.rego looks like this:
package org.payments
default allow = false
# allow agent-initiated payments under the auto-approve threshold
allow {
input.tool == "stripe.charge"
input.amount_usd <= data.org.budget.max_single_action_without_approval
input.agent in data.departments.gtm.agents
}
# explicit deny — the org can never wire to a country we have no compliance for
deny[msg] {
input.tool == "wire.send"
not input.country in data.org.compliance.allowed_jurisdictions
msg := sprintf("wire to %v denied: jurisdiction not in allowed set", [input.country])
}
# require human approval above the threshold
require_approval {
input.tool == "stripe.charge"
input.amount_usd > data.org.budget.max_single_action_without_approval
}
That is one file. Read it. Notice what is happening. The business decision “the GTM team can charge up to $500 without asking” is now a four-line policy that runs on every tool call. The auditor reads the same file the agent does. The board reads the same file. The regulator, eventually, reads the same file.
The full operating bundle:
| Directory | What it owns | Runtime backing |
|---|---|---|
agents/ | model choices, tools, memories, permissions, schedules | LangGraph + MCP |
departments/ | goals, queues, ownership, scorecards | Temporal workflows |
policies/ | what is allowed, denied, requires approval | OPA / Rego |
playbooks/ | repeatable workflows: outbound, support, QA | Temporal sub-workflows |
evals/ | tests for whether the org still behaves correctly | scheduled background jobs |
budgets/ | token, cloud, tool, and payment limits | OPA + budget tracker |
escalation-rules/ | when to wake a human, which one, with what evidence | Temporal Signals |
simulations/ | sandbox runs before changing production behavior | E2B / Daytona MicroVMs |
Raw agents are cheap and chaotic. The value is The Wrapper — the organizational shell of policies, budgets, approvals, memory boundaries, audit trails, and evals around them. Without that wrapper you do not have a company. You have a pile of interns with root access.
6. The five-department wallet

The first useful unit is not “create a unicorn in one click.” That is the wrong fantasy. The useful unit is smaller: a deployable business capability. A support desk. A research cell. A GTM motion. A code-migration squad. A grant-writing operation. A finance ops back office.
Each one has the same shape — a mission, scoped tools, a memory boundary, an approval gate, an eval suite, an audit trail, and a budget. That bundle is the thing you fork.
The unit economics are stark enough to be worth doing on a napkin. Run a single feature cycle on a frontier reasoning model — call it a Prompt Module ($P_m$). Plan, fabricate, verify, correct, polish. ~1M input tokens (heavy context loading), ~20K output tokens, one or two retries. At Opus-class pricing of $5/$25 per million in/out tokens, a single $P_m$ runs you ~$5.68 in raw API spend. Apply a 1.5× risk multiplier for runs that loop, hallucinate dependencies, or get scrapped. Round to ~$8.50 per Prompt Module.
A $5M compute budget then buys roughly 588,000 Prompt Modules — autonomous feature cycles, each one a discrete unit of value. After a 40% quality discount (modules that get re-written or thrown away), you are still looking at ~1,116 elite-developer-years of equivalent throughput. Hiring those engineers at fully-loaded $200K would cost ~$223M. Capital efficiency ratio: ~44:1.
The challenge is no longer affording the work. The challenge is orchestrating the swarm. A “deployable business capability” turns out to be a budget waterfall:

| Department | % of budget | Modules | Role |
|---|---|---|---|
| Genesis (research, validation, idea-maze) | 10% | ~58k | synthetic personas, competitor recon, gap analysis |
| Fabrication (core build) | 50% | ~294k | Architect agent + DB / Frontend / Backend specialists |
| Sentinel (QA, red-team, security) | 20% | ~118k | adversarial agents, TDD enforcement, self-healing repair loops with budget caps |
| Signal (growth, content, SDR) | 15% | ~88k | programmatic SEO, hyper-personalized outreach |
| Foundry (orchestration, manager-agents) | 5% | ~29k | manager agents that monitor swarm health, kill zombie tasks, refactor prompts |
Two non-obvious lessons in that table. Twenty cents of every dollar goes to the immune system. If the QA budget feels too high, you have not been bitten yet. And the smallest line — “manager agents” — is the one that lets the other 95% behave at all. Without it, the Token Snowball will eat the rest. An agent stuck in a “fix this dependency” loop can burn 1M tokens every ten minutes.
This is also where the pricing model of compute makes a quietly important shift. As the codebase grows, $P_m$ cost drifts upward — $3 in month one, $6 in month three, $12 in month six — purely because the context loaded per call gets denser. The Foundry budget is what pays for the “Refactor Agents” that run continuously to keep the context weight manageable. Run an org without that line item and you watch your unit economics rot in slow motion.
7. What forks cleanly. What doesn’t.

Forkability is the main event. When software became forkable, experimentation exploded. When organizations become forkable, business-model search changes — fork the same agency into healthcare, logistics, legal, insurance; branch a GTM motion to test two sales plays; fork into a new geography by swapping data residency, payment rails, language, local vendors.
But the breezy “fork four times” pitch hides the part that breaks.
| Artifact | Forks cleanly? | Why |
|---|---|---|
| Policies, budgets, escalation rules | yes | pure config; no domain entanglement |
| Playbooks, workflows, eval suites | mostly | slot in new tools; structure stays |
| Agent definitions (model, tools, prompt) | mostly | re-grounding the prompt is the work |
| Memory and embeddings | no | domain-, customer-, and incident-specific |
| Vendor connectors, integrations | no | each market has different APIs and contracts |
| Compliance / regulatory posture | no | jurisdiction-bound; not a config swap |
| Customer trust and brand | no | not in the repo |
The clean half is the unlock. The sticky half is what makes the founder’s job real. Forkability is partial by design — and the partial-forkability is precisely what stops the spam version from being trivial to spin up. If everything forked cleanly, the world would already be on fire.
8. The Reverse Layoff

Anyone selling you the clean version of this story is not paying attention.
In early 2026 the market ran the experiment for us. A wave of companies, intoxicated by the leverage, laid off significant portions of their engineering and operations teams and replaced them with agent swarms. Then the agents hit the cases nobody had documented. Financial reports broke in inexplicable ways. Production updates failed silently. Workflows degraded under load no one had simulated. The agents could not debug undocumented legacy entanglement without human intuition acting as glue.
Within months the same companies were rehiring those same engineers at premium rates — the “boomerang employee” pattern, now a category. Institutional knowledge turned out to be a largely un-computable variable. The lesson is not “AI doesn’t work.” The lesson is that the wrapper takes longer to build than the layoff press release.
The smaller-scale version of the same lesson lives inside Karpathy’s eight-agent simulation. Recall the parable from §2: an agent reported it had “discovered” that bigger hidden size lowered loss. That is what happens in microcosm when an org has agents but no scientific taste. They will produce results that pass every test you wrote and tell you nothing new about the world. The Reverse Layoff is the macro version of “discovered hidden size.”
There is a quieter, longer-acting failure mode running in parallel. An MIT 2026 study identified what the field is now calling Cognitive Debt — measurable atrophy in independent analytical capacity among engineers who outsource the painful part of problem-solving to agents. EEG and fMRI work in the cohort showed reduced activation in the regions associated with deep, sustained reasoning. Output rises; the muscle that produces taste declines. The companies that survive this decade will be the ones that protect the human cognition they need to keep designing the bounding box.
A related collapse is happening in measurement itself. The DORA framework — deployment frequency, lead time, MTTR, change failure rate — was the canonical way to measure engineering health for a decade. 66% of developers no longer trust those dashboards. Deployment frequency means nothing when an agent ships dozens of commits an hour. The metrics will follow the unit of work, and the unit is shifting from “the team” to “the bundle.”
9. Due diligence becomes code review

If a company is an executable bundle, acquiring one means inspecting the bundle. Not just the financials. The operating code.
You would review:
- permissions — which agents can touch money, production, customer data, outbound, legal docs?
- memory — what does the org believe, who wrote it, when was it last verified?
- workflows — which queues drive revenue, support, finance, shipping?
- evals — what tests prove the org still works?
- budgets — where does compute spend go, and which loops can run away?
- audit logs — can you replay why a decision happened?
- human dependencies — which workflows fail if one person leaves?
That last one is the killer. A lot of companies are not really companies. They are a few heroic people holding a pile of broken processes in their heads. Organizations as code makes that visible.
It also makes the inverse visible. A small team with a clean operating repo, sharp evals, narrow permissions, clear memory boundaries, and repeatable workflows may be much more valuable than its headcount suggests.
Revenue Per Employee was the first AI-native metric. The Clone Test is the better one: how much of your business survives if you delete every human and rebuild from the repo alone? If 90% does, you are running infrastructure. If 30% does, you are running a story about a few people. The Clone Test is brutal because it does not care how good your culture deck is. It asks one question — is the company in the repo, or in your head? — and the answer determines what an acquirer is actually buying.
10. Governance is the product

A clonable organization is powerful in the same way a botnet is powerful. If you can spin up a useful org, you can spin up a harmful one. Spam companies. Scam companies. Automated litigation mills. Fake-media networks. Synthetic political operations. Companies with no moral center because nobody inside them feels responsible.
This is why governance is not a chapter at the end. It is the product.
The hard questions arrive immediately:
- Who owns an agent’s actions?
- Who can audit a forked organization?
- Can a regulator inspect org code?
- Can customers know whether they are dealing with a human, an agent, or a company with one human and ten thousand agents?
- Can payment networks, cloud providers, and model providers enforce identity at the organizational level?
- Can an org be rate-limited?
- Can it be recalled?
Agent identity is not enough. The org needs identity. Not just an account — provenance. Who created this org. What template it was forked from. What permissions it has. Which humans are accountable. Which jurisdictions it operates in. Which models and pinned versions it uses. What changed between this version and the last one.
Without that, organizations as code becomes organizations as malware.
The geopolitical layer is already moving. The EU’s GAIA-X effort is accelerating a federated sovereign tech stack at a rumored €20–30B per year, on the explicit thesis that European org code should not be bound to American hyperscalers. The U.S. is doubling down via NSF AI Institutes and an aggressive industrial policy on agentic workforce development. There will be no neutral place to host an org repo. The infrastructure decisions are political decisions now.
11. Disposable coordination — and what you can ship Monday

The cloud gave us disposable infrastructure. AI agents give us disposable labor. Organizations as code give us disposable coordination.
That is the unlock.
Once coordination becomes cheap, we will see many more tiny companies, temporary companies, single-purpose companies, forked companies, simulated companies, companies that look more like open-source projects than corporations. Some will be scams. Some will be toys. Some will be terrifying. Some will be beautiful. The direction is clear.
First we made servers programmable. Then we made workers programmable. Now we are making the organization programmable.
You do not need to wait for the platform. The smallest org-as-code artifact you can ship this week is exactly the one Karpathy shipped in March: one program.md, one immutable harness, one mutable sandbox, one agent, one scoped repo, one scalar metric. Pick a single workflow inside your team — a weekly competitive scan, a triage queue, an SDR sequence, a deploy verification loop. Write the brief. Pin the model version. Cap the budget. Add a single OPA-style rule for the one action that requires a human.
A starter program.md:
# program.md — competitor-watcher v0.1
## Mission
Surface every public competitor product/pricing/comms change in the last 7 days.
## Scope
- READ: competitor URLs in /sources/competitors.yaml
- WRITE: append to /reports/weekly-YYYY-WW.md only
- NEVER: edit /sources, /policies, /evals; never make outbound contact
## Workflow
1. Fetch each source URL. Diff against /cache/last-snapshot/.
2. For each diff, classify: pricing | feature | positioning | hiring | other.
3. Write one bullet per change with: source URL, date observed, 1-line summary.
4. If >25 bullets in a single week, escalate (see Recovery).
5. Update /cache/last-snapshot/ on success.
## Success metric
- scalar: change_recall_score = (bullets_human_marked_useful) / (total_bullets)
- target: >= 0.6 over a 4-week trailing window.
- if score drops below 0.4, halt and request human review.
## Budget
- 200K input tokens / 20K output tokens per run, hard cap.
- Model: claude-sonnet-4-6-2026-01-08 (pinned; never "latest").
## Recovery
- On HTTP 4xx/5xx for a source: log, skip, continue.
- On parse failure: write raw HTML to /errors/, do not invent content.
- On budget overrun: halt, write /errors/budget-exceeded.md, exit 2.
Run it nightly. Read the report every morning for two weeks. Tune program.md based on what you actually wanted vs. what you got. That tuning is the founder skill of the next decade in miniature. The rest of this post is what happens when the same loop is applied to engineering, GTM, finance, support, and compliance — at the same time, with budgets and policies wired together, on the same substrate.
The next five years are about learning how to spin up a company without losing the ability to ask whether it should exist.
Where this breaks

Five failure modes worth naming honestly.
[!WARNING] The empty-wrapper trap. A repo full of
policies/,evals/, andaudit-logs/is not a company. Without a real workflow producing real value, the wrapper is theater. The Reverse Layoff happened to companies that built the wrapper before they had earned the right to compress the workforce inside it. Build the smallest valuable workflow first; wrap second.
[!CAUTION] Memory pollution as organizational hallucination. The most insidious failure of org-as-code is when stale, wrong, or agent-generated memories enter the trusted memory layer and become structural precedent for future decisions. By iteration N+10, the org is making decisions on its own past hallucinations. Memory needs aggressive provenance, expiry, and pruning — closer to a database garbage collector than a knowledge base.
[!NOTE] Semantic transferability cliff. A heavily optimized org repo encodes the founder’s biases, blind spots, and aesthetic preferences. Forking it laterally — handing it to another founder, another vertical, another culture — guarantees friction. Distinguishing absolute organizational primitives from idiosyncratic founder preferences is the next hard problem in this space. We do not yet have a clean answer.
[!IMPORTANT] Silent model drift breaks the org overnight. Cloud providers ship undocumented model updates that change behavior on standard benchmarks — increased hallucination rates, exacerbated multi-file laziness, sometimes both. An autonomous loop running unattended cannot tell you that yesterday’s
claude-opus-latestis not today’s. Pin specific immutable model versions at the org-config level. Treatlatestas a synonym for “production may regress without warning.”
[!DANGER] The Token Snowball. A single agent stuck in a “fix this dependency” loop can burn 1M tokens every ten minutes. Without per-task budget caps, circuit breakers (kill any task >5 retries), and a manager-agent watching for zombie loops, the unit economics in §6 collapse from 44:1 to negative within a week. Budget Trackers are not optional infrastructure. They are the foundation everything else sits on.
Sharad Jain builds agentic AI pipelines in Bengaluru. He previously engineered core data infrastructure at Meta and is the founder of autoscreen.ai, a production voice-AI platform. This post is part of a series on agentic AI infrastructure — see The 14K Token Debt on system-prompt architecture and The Terminal Was the First Agent Harness on Unix primitives as agent patterns.
Research & Footnotes:
- Karpathy:
autoresearch— 630-line autonomous ML research loop (March 2026) - The New Stack: Karpathy’s 630-line script ran 50 experiments overnight
- The New Stack: Vibe coding is passé. Karpathy has a new name for the future of software.
- Quasa Connect: Karpathy’s experiment assembling an AI research team — ‘Org Engineering’
- Quasa Connect: The Great AI Reverse Layoff
- Quasa Connect: MIT Study Reveals ‘Cognitive Debt’
- byteiota: Developer Productivity Metrics Fail: 66% Don’t Trust Them
- NxCode: Agentic Engineering: Complete Guide (2026) — Stripe 1,000 PRs/week, Zapier 89% adoption, AI-native RPE
- O-mega: Karpathy Autoresearch Complete 2026 Guide — Lütke / Shopify 19% improvement
- Anthropic: Model Context Protocol specification
- Open Policy Agent: OPA documentation — policy-as-code engine, Rego language
- Temporal: durable execution platform — workflow indestructibility, signals, long-running coordination
- LangChain: LangGraph persistence — checkpointed cyclic graphs for agent cognition
- E2B: secure code-execution sandboxes — Firecracker microVMs for ephemeral agent execution
- CIO: Google’s Budget Tracker / BATS framework for agent cost containment
- Quasa Connect: Europe’s GAIA-X sovereign tech stack
- arXiv: LiveMCPBench — agents in dynamic MCP tool environments — 78.95% success in tool-rich envs vs static baselines
- arXiv: MetaAgent — self-evolving agents via tool meta-learning