AI Coding Still Needs Specs, Contracts, and Verification

When implementation gets cheaper, delivery risk moves into constraints, handoffs, and review discipline.

Jun 26, 2026

After The Spec Trap, one pushback came back fast. A coworker said something a lot of people say right now: natural language is ambiguous, code is precise, therefore code is the only spec that matters.

This is a smart but wrong objection.

Code is more precise than prose. That part is obvious. What matters now is where intent, constraints, and verification should live now that agents made implementation cheap.

Where It Is Right

There is a reason “code is spec” keeps coming back. It points at a real problem.

In Bootstrapping Coding Agents, Martin Monperrus showed a coding agent bootstrapping itself from a 926-word specification and a first implementation produced by Claude Code. The result is clear. If the specification is good enough, another agent can regenerate the implementation. The paper’s conclusion matters most: the specification, not the implementation, is the stable artifact of record.

Monperrus is right on that point.

But notice what this actually supports. It does not say code is the spec. It says a precise enough specification can sit above code and survive code regeneration. That gets close to the opposite claim.

The appeal of “code is spec” is simple. It sounds rigorous without forcing us to do the uncomfortable part. We want the benefits of precision. We just do not want to pay for the artifact that carries it.

Precision Still Has to Live Somewhere

Formal methods has been hitting the same wall for a long time: the precision does not disappear when you move it out of code. You still have to decide what counts as a valid input, what outputs are acceptable, which edge cases matter, and which invariants survive refactoring.

Recent work says this pretty clearly. Verus-SpecGym frames the new bottleneck as translating informal intent into the right formal specification. The failure modes are ordinary and familiar: generated specs omit important input assumptions, accept incorrect outputs, and reject valid ones. A recent study on natural-language-to-TLA+ generation is harsher still: across 30 models, semantic correctness topped out at 8.6 percent. VeriAct adds another bad detail: verifier-accepted specifications can still be incorrect or incomplete in ways the verifier itself does not catch.

The slogan hides that part. Code generation can be cheap. Specification faithfulness is not.

Where Spec Actually Sits

The simple way to think about this is as a stack.

The spec does not live in the vague paragraph at the top, and it does not live in the Java file at the bottom. It sits between intent and implementation. That bundle of constraints lets you say one implementation is acceptable and another one is wrong.

The middle layers carry most of the spec: structured intent plus executable checks. They constrain one realization of code; they do not replace it.

If you collapse all of that into code, you lose the distinction between what the system should do and what this version happens to do. You need that distinction when another agent, another teammate, or your future self has to change it safely.

What Language Is Actually Good For

This does not mean natural language is useless. It means natural language has a job, and it is not the same job as code.

In Natural-Language Agent Harnesses, the authors show that high-level harness logic can be externalized in editable natural language and executed through explicit contracts and durable artifacts. That division of labor makes sense.

Natural language is good at naming goals, non-goals, trade-offs, delegation boundaries, and failure semantics. Code is good at deterministic operations, tool interfaces, and enforcement.

If you force natural language to carry every precise constraint, you are halfway back to formal specification anyway. If you force code to carry all the intent, you delete the reason the code exists and keep only the current outcome.

Code records one solution. It does not reliably preserve every rejected alternative, every invariant that must survive the rewrite, or every failure mode that a lazy implementation would miss.

The Empirical Part

The real-world evidence is still a little more messy, but it points the same way.

Debt Behind the AI Boom, tracked 302.6 thousand verified AI-authored commits across 6,299 GitHub repositories and found 484,366 introduced issues. More importantly, 22.7 percent of those issues still survived in the latest repository revision. Weak constraint layers produce exactly that kind of shipped code.

Another study, Assessing the Quality and Security of AI-Generated Code, ran SonarQube over 4,442 Java tasks and found that even functionally passing solutions still averaged between 1.45 and 2.11 issues per passing task across the five models tested. Passing the tests did not mean the code was clean. It meant the oracle was too small.

On the spec side, the numbers get better when the structure gets tighter. The BMW case study Streamlining Acceptance Test Generation for Mobile Applications Through Large Language Models generated Gherkin scenarios, page objects, and executable UI tests from JIRA tickets in under five minutes per feature. Practitioners reported time savings often around a full developer-day. A separate study, From Law to Gherkin, found that about 92 percent of time-savings ratings for LLM-generated behavioral specifications landed in the top two categories.

These are different studies in different domains. They do not add up to one neat law. They still point the same way. Weakly constrained code creates debt. Structured intent with executable checks makes review cheaper.

Code Is Not Enough

Code is precise about one thing: what a specific implementation does.

Code does a much worse job preserving why this behavior matters, which alternatives were rejected, what must remain true across rewrites, or where a future agent is allowed to improvise. That makes “code is spec” the wrong answer to the right question.

What matters is how much of the system’s intent you can move into artifacts that another human or another agent can check before trusting the implementation.

Sometimes that artifact is a schema. Sometimes it is BDD. Sometimes it is a typed handoff contract plus validators and tests. Sometimes it is a formal specification, because the risk justifies it. But if the only place the truth lives is the code, you usually learn what the code was supposed to mean only after it goes wrong.

The Spec Trap made the same point. Agents do not remove ambiguity. They make it more expensive to ignore.

Discussion about this post

Ready for more?