AI Coding Governance for Real Software Teams

Why bounded tasks, shaped tool surfaces, and human verification matter more than model demos when the code has to survive production.

Jun 08, 2026

The slide I care about most from my recent JCon keynote only had six words on it: code is cheap, software isn’t.

If you want the full talk, the recording is here and you can find the slides here. This article is the shorter version I would hand to a team before they let an agent loose in a real repository.

The exciting part of AI coding is obvious. You ask for a REST endpoint, a refactor, a migration sketch, or a first-pass UI, and the blank page disappears. That part is real. I use these tools too. They save time, remove some boring work, and make it easier to try things quickly.

The problem starts when people confuse easier code generation with easier software engineering.

Real systems are full of hidden contracts. Authorization rules, deployment assumptions, stale docs, weird integration edges, and business logic nobody wrote down but everybody now depends on. A model can infer some of that from nearby code. It cannot infer all of it just because the prompt sounded confident.

That is still the main lesson for me after the last 18 months. AI can draft a lot of syntax. It does not automatically understand the software around it.

Prompting is not the hard part

I keep seeing teams treat AI coding as a prompting problem. Write better prompts. Add more detail. Find the right magic spell. That helps a bit, but it is not the part I trust.

The bigger lever is context.

If the agent does not know your architecture constraints, your business boundary, your non-negotiable tests, or the parts of the repository it should not touch, it will make something up. That is not a moral failure. That is how the system works. Statistical tools fill gaps with likely-looking answers.

So I would spend less time chasing clever wording and more time building context the model can actually use.

That means repository rules. It means ADRs. It means decision logs. It means tests that define behavior clearly enough that the agent has something better than vibes. It also means being selective. Dumping a pile of stale internal docs into the context window is not context engineering. It is just a different way to confuse the model.

The real question is not “how much context can I fit?” It is “which context changes the quality of the decision?”

Big tasks still fail for familiar reasons

AI has not rescued us from decomposition.

If a change is too messy to explain in two or three clear sentences, it is probably too messy to hand to an agent as one job. The tool may still produce a lot of output. That is not the same as producing a controlled result.

This is one place where the AI discourse sometimes sounds weirdly ahistorical. We already know big-bang rewrites fail. We already know broad “clean this whole area up while you are there” work spreads risk faster than teams can review it. Why would an agent be the magical exception?

The pattern that keeps holding up is still the old one:

Smaller tasks
Tight boundaries
Fast verification
Easy rollback
Explicit ownership

That is less glamorous than “fully autonomous development,” but it is much closer to how production teams stay sane.

I also think local git matters more than people admit. When an agent goes off the rails, the difference between a useful experiment and an annoying afternoon is often whether you can inspect the diff quickly, throw it away, and try again with a narrower request.

The useful tooling exposes systems, not just text

One reason I care about MCP and similar tool surfaces is that they shift the interaction away from pure guessing.

A model is much more useful when it can inspect runtime state, read the logs that matter, query the system you are actually changing, or reach structured docs instead of paraphrasing what it half-remembers. That does not make the model magical. It just gives it better ground to stand on.

For me, that is the real promise of this layer. Not “the model can use tools” as a demo trick. The better promise is that the model can stop pretending text alone is enough to understand a live system.

The same rule still applies, though: more tools are not automatically better. Tool sprawl creates its own tax. A giant tool catalog, a huge context payload, and 10 overlapping ways to do the same thing can make the session worse, not better. Good AI workflows need a shaped surface just as much as they need a capable model.

The bill comes back during review

This is the part I think teams still underestimate.

AI output is fast to produce and often expensive to verify. That cost does not show up in the demo. It shows up when a senior engineer has to read a clean-looking diff and decide whether the system still deserves trust.

That review load is real work. You are checking hidden assumptions, edge cases, failure paths, auth behavior, operational risk, naming drift, and whether the tests prove anything useful or just mirror the implementation. If the change touches infrastructure, security, or business rules, the cognitive bill gets even higher.

This is why I do not trust raw productivity claims that stop at code generation speed. A fast draft plus a slow, exhausting review loop is not automatically a win. Sometimes it is. Sometimes it is just a different queue.

The failure mode is subtle because tired reviewers still look productive from the outside. Files changed. The pull request is large. CI passed. Everybody feels movement. But fatigue is not the same as confidence, and momentum is not the same as understanding.

If I had to keep one professional rule from all of this, it would be simple: if you do not understand an AI-generated change well enough to explain its failure mode, do not merge it.

Java is in a stronger position than the hype suggests

A lot of AI coding conversation still defaults to Python because the AI ecosystem grew up around Python-first tooling. That does not mean Python teams are automatically in a better engineering position.

Java has a very practical advantage here. The code is explicit. Types are visible. Contracts are clearer. Framework conventions are stronger. Build and runtime boundaries are easier to follow than in many loosely structured stacks. That kind of shape helps humans review faster, and it helps models stay closer to the rails.

I do not mean Java makes AI safe. It does not. I mean Java gives both the model and the reviewer more structure to work with, which is one reason I think enterprise Java teams are better positioned for this transition than the public AI narrative suggests.

If anything, Java teams should lean into that advantage on purpose. Strong tests, typed config, explicit boundaries, and boring conventions are not old habits the AI era made irrelevant. They are what make the AI era survivable.

What I would actually tell a team

If I had to compress the whole keynote into one short working agreement, it would look like this:

Give the model better context, not just longer prompts.
Keep tasks small enough that verification stays cheap.
Use tools that expose real system state when the task depends on real system state.
Treat review load as part of the cost, not as free cleanup after the model is done.
Keep a human owner attached to every production change.

That is not a revolutionary message. It is mostly software engineering refusing to disappear just because the draft got cheaper.

AI is changing how we build. I do not think that part is controversial anymore. The part people still keep relearning is that faster code generation does not remove the expensive parts of software. Intent is still expensive. Architecture is still expensive. Verification is still expensive. Ownership is still expensive.

That is why the sentence on the slide still holds up.

Code is cheap. Software isn’t.

Discussion about this post

Ready for more?