How to Stay Responsible When AI Writes Part of Your Code

A practical guide to ownership, review, and guardrails when AI starts doing more than autocomplete.

May 18, 2026

Roanne van Voorst, writing about medical AI, uses a weatherman analogy that I have not been able to shake. The people using the system are still expected to answer for the outcome even when the model itself is hard to inspect (Mayo Clinic Proceedings: Digital Health). Move that discomfort into software and the same question becomes easier to understand. If your AI coding assistant produces a Hibernate query that corrupts data or a workflow that quietly leaks access, how well does the human reviewer need to understand it before merge?

The law gives one answer. Part one of this series made the boring but important point that the team shipping the system still owns the system. That is the base. The harder question is what responsible use actually requires you to change before the incident review, before the audit, and before the customer has a reason to care.

This is where the phrase “responsibility gap” shows up. This gives a name to a mess many teams are already building. The problem is not that nobody cares. The problem is that AI-assisted development makes it easier to spread authorship around, thin out judgment, and lose track of who checked what.

The gap is structural, not just ethical

Zhenyu Mao and coauthors describe responsibility gaps in GenAI-enabled software in much simpler terms than most industry decks do: responsibility gets blurry when oversight is weak, control is partial, and the trail of decisions is hard to reconstruct (arXiv:2511.13069). That is a good description of what many engineering teams are building right now without saying it out loud.

The failure mode is not dramatic. It is incremental. One developer asks an assistant for the first draft. Another accepts most of it and fixes the compile errors. A reviewer scans the diff for obvious nonsense. CI passes. The code ships. Weeks later, when something breaks, each person in the chain can honestly say they touched the system but did not really own the key decision. You end up with several people who were involved and nobody who clearly owned the call that mattered.

That is not just philosophical discomfort. Governance data is already pointing in the same direction. IBM’s 2025 Cost of a Data Breach Report found that 63% of breached organizations had no AI governance policies in place, and that high levels of shadow AI added an average of $670,000 to breach costs. When accountability is vague, process gets improvised. When process gets improvised, responsibility gets spread so thin that nobody can point to where the bad call really entered the system.

I do not think this is mostly a culture problem. Culture matters, but this specific failure is structural. If the workflow makes it easy to generate code, easy to merge code, and hard to reconstruct the reasoning behind the code, you have built a responsibility gap on purpose whether you meant to or not.

The junior developer analogy stops helping when the system gets agentic

“Treat AI like a very fast junior developer” is still the most useful simple analogy in our space. For inline suggestions and small code completions, I still think it helps. The model is productive, often plausible, occasionally reckless, and in need of supervision.

The problem is that the analogy starts to break as soon as the tooling moves from helper to part of the delivery system. Sonar’s 2026 State of Code survey found that 64% of developers had already started using AI agents that can take multi-step actions on their own. That is not autocomplete anymore. That is a delivery system with delegated steps.

Once you have planner agents, implementation agents, test-generation agents, review agents, or chat-driven “vibe coding” loops that produce whole slices of behavior at a time, the human role starts drifting from author to approver. That is a different professional posture. You are no longer reviewing a clearly owned piece of work from a junior teammate. You are approving output from a chain of models and prompts that may have made several design decisions before you ever saw the diff.

Java teams can make this even stranger because the generated code is often wrapping another less predictable system. If an assistant generates a Quarkus service that calls an LLM through LangChain4j, the uncertainty doubles. One model is producing the answer. Another model shaped the code that decides how that answer is called, retried, authorized, and interpreted. That kind of stack can go weird in ways unit tests do not fully cover.

This is why the “junior developer” framing flatters the model a bit too much. A junior developer can explain why they made a choice, even if the choice was wrong. An AI agent can generate a very clean implementation of a bad idea without leaving a clear account of why that idea won. Or even where it came from.

Oversight has to be designed, not merely promised

The most useful idea in the recent responsibility-gap literature is also the least glamorous. Mao and coauthors argue for “human oversight requirements” as something you design into the workflow on purpose, not as vague expectations bolted on after the code exists (arXiv:2511.13069). I think that is exactly the right direction.

Most teams treat oversight as a ritual. Somebody reviews the pull request. Somebody signs off. Somebody says “looks good.” That is review theater unless the workflow makes clear where human judgment is required, what that judgment is supposed to evaluate, and how it is captured.

The useful questions are simple and concrete.

Where does a human have veto power? Which steps are allowed to be fully automated, and which ones are not? What kinds of changes require a named reviewer with domain context rather than a general peer review? If an agent can generate a schema migration, expose a new tool, modify authorization logic, or wire an automated approval flow, where exactly is the human checkpoint?

For Java teams building agentic systems, those questions should show up in the design itself. If a Quarkus service gives an agent the ability to call internal tools, mutate records, or kick off downstream workflows, then the human oversight touchpoints are part of the architecture. They are not forms you fill out later.

That is also why I do not love generic statements like “human in the loop” when they are left hanging in the air. Human in the loop where? Looking at what? With what authority to stop the system? Under what time pressure? Oversight that cannot name a control point is a slogan. Also compare my article about history lessons here.

The paper trail has to survive the sprint

One of the more revealing pieces of research here is Obada Kraishan’s preprint on AI attribution practices in open source. Looking at 14,300 GitHub commits across 7,393 repositories, the study judged 95.2% of its sample to show AI use, while only 29.5% said that explicitly (arXiv:2512.00867). That is not a small etiquette gap. It is a tracking problem.

If teams are already using AI heavily but only sometimes leaving a clear trail, then a lot of governance talk is starting from the wrong layer. You cannot audit what you did not record. You cannot learn from incidents you cannot reconstruct. You cannot defend a responsible process that only existed in private chat tabs and half-remembered prompts.

I do not think this means every commit needs a little compliance ceremony attached to it. I do think significant AI involvement needs to survive the sprint. At a minimum, that means making the pull request say which tool materially contributed, which areas were generated or heavily transformed, and who now owns the behavior. In higher-risk contexts, I would go further and record the relevant prompt or task summary, the controls in force, and the human reviewer responsible for the hard parts.

People hear “paper trail” and immediately imagine bureaucracy. What I imagine is a future outage call where somebody asks a very plain question: why does this logic exist and who checked it? If your best answer is “I think Cursor did the first pass and then we all kind of edited it,” you do not have a paper trail. You have a bunch of half drunk teenagers in a room when an adult turned on the light at the end of the party.

Governance belongs in the harness

This is the part where a lot of organizations still hide behind policy PDFs.

A policy that says “all AI-generated code must be carefully reviewed” is not governance. It is aspiration. Real governance lives in the harness: custom instructions, approved dependency sets, rule files, secret scanning, required checks, architectural linting, forbidden patterns, repository templates, and merge gates that force the expensive questions to show up before production.

GitHub’s own Copilot documentation says Copilot’s suggestions may be insecure and should always be reviewed and tested. That warning is honest, but it still leaves the actual enforcement problem with you. The only durable answer is to encode as much of your organization’s judgment as possible near the generation path and the merge path, not in a training deck nobody rereads after onboarding.

This is why I care more about rule systems than slogans. Whether you call them repository instructions, internal guardrails, or harness policy, the point is the same. If you know your estate should never suggest a deprecated crypto primitive, invent a new internet-facing endpoint without auth review, or pull an unapproved library into a regulated service, then make that constraint executable.

A recent vision paper on trustworthy AI software engineers argues that transparency, a clear trail, and accountability are part of trustworthy practice, not optional extras. I think that is the right way. The harness is where those properties become visible.

Review has to shift from style to failure modes

AI also changes what good review looks like. When I review human-written code, I still care about maintainability, naming, duplication, and design shape. When I review AI-generated code, I care about those things too, but I care even more about failure modes the model is statistically good at getting wrong: bad defaults, weak auth assumptions, injection risk, missing back-pressure, brittle retries, accidental data exposure, and cheerful little happy-path tests that prove nothing useful.

Sonar’s 2026 survey results are helpful again here. Developers reported that reviewing AI-generated code often requires more effort than reviewing human-written code, and that AI frequently produces code that looks correct but is not reliable. That sounds true to me. The review burden did not disappear. It moved.

Also read my other piece on where the real burden now sits for senior Java developers:

The Hidden Cost of AI Coding for Senior Java Developers

Markus Eisele

Apr 20

Read full story

So the review question changes from “does this look clean?” to “how does this fail?” For a Jakarta EE endpoint, that means checking transaction boundaries, validation, retries, timeouts, serialization assumptions, and the ugly path where downstream systems drift or slow down. For AI-generated tests, it means asking whether they probe the behavior that actually scares you or just mirror the implementation back at itself.

This is also where adversarial testing earns its keep. AI is excellent at producing a plausible implementation and an equally plausible test suite that agrees with it. Those two things can be wrong together for a surprisingly long time. We all have already seen this.

The deeper question is whether you delegated work or judgment

I think this is the part most AI coding articles avoid because it is uncomfortable.

There is a real difference between delegating work and delegating judgment. When a senior engineer asks a junior teammate to implement a feature and then reviews the pull request, the work was delegated. The judgment was not. The review step is exactly where judgment stays human.

Agentic AI muddles that line. Once the model is selecting patterns, shaping control flow, choosing abstractions, deciding error handling, and proposing architecture changes, the human reviewer may not be validating a decision they understand. They may be signing off on a decision they inherited.

That is a professional problem, not just a tooling problem. Responsible AI research in software practice keeps landing on a similar point: organizations talk a lot about principles, but accountability and transparency are still under-implemented in day-to-day engineering work (Leca, Bento, and Santos, 2025). We like the ethics language. We are much less enthusiastic about the workflow changes that would make the language real. “Hey, your AI is writing git notes and making merging so much harder!”)

For me, the line is simple. If I cannot explain what an AI-generated block does, why it is shaped that way, and what I am trusting around it, I have not really reviewed it. I may have approved it. That is not the same thing.

And yes, that implies something unfashionable: sometimes the correct professional move is to rewrite working AI-generated code by hand. Not because hand-written code is morally pure, which would be silly. Because if you cannot read it and explain it, you do not really own it.

What this means for Java teams right now

If you are using AI to generate services, REST endpoints, persistence layers, or integration glue, every generated boundary needs an actual human owner before merge. Not a team-shaped cloud of ownership. One person who can answer for the logic.

If you are exposing internal systems as tools for agents, document the approval points the same way you document the API surface. Who decided this tool was safe to expose? What can it change? What is logged? What requires a second gate? If the answers live only in a demo session, they are gone.

If AI-generated code is calling another AI system, test the bad paths on purpose. Timeouts, malformed model output, prompt injection attempts, partial downstream failure, retry storms, and authorization drift are the interesting parts. The successful first demo call is not interesting.

If your team is using multiple agents, stop talking about them like friendly ghosts in the IDE and start describing them as architecture. Which agent is allowed to do what? What output does each one produce? Where are the human checkpoints? If you cannot sketch the workflow, you probably do not control it.

And during review, ask one question that feels almost embarrassingly basic: who owns this logic? The fact that it feels basic is the point. Basic questions are what disappear first when velocity becomes the house style.

The job is not to prompt better. It is to own better.

The tools have outrun the old rituals. That is really what all of this comes down to.

We used to get some accountability for free because human authorship, human intent, and human review were tightly coupled most of the time. AI-assisted development loosens that coupling. Code can arrive faster than judgment, and clean diffs can hide weak accountability surprisingly well.

The answer is not to ban the tools or pretend the productivity gains are fake. The answer is to rebuild the workflow so accountability survives the new speed. That means explicit oversight points, a better paper trail, executable guardrails, review calibrated for AI-shaped failure modes, and a willingness to reject code that nobody can honestly defend.

The hard part of this era is not learning to prompt better. It is learning to preserve responsibility when authorship gets weird. The teams that figure that out will not look slower for long. They will look sane.

The Hidden Cost of AI Coding for Senior Java Developers

Discussion about this post

Ready for more?