Why Smart Developers Get Bad Results from AI Tools

The mental shifts that matter more than the tool itself

Apr 02, 2026

You’ve probably already tried it. A colleague demoed GitHub Copilot, or you read enough about Cursor to give it a weekend. You asked it something real, not a toy problem, something from your actual work, and what came back was plausible enough to be dangerous. It looked right. It compiled. It was subtly wrong in a way that would have taken you an hour to debug if you hadn’t caught it first.

So you caught it, closed the tab, and went back to doing the thing properly.

That’s not an irrational response. It’s the response of someone who has high standards and limited patience for tools that create more work than they save. After twenty years of watching the industry cycle through silver bullets, a healthy immune system is a professional asset.

But here’s the thing worth sitting with: the experience you just described isn’t evidence that AI coding tools don’t work. It’s evidence that you used them the way every experienced developer uses them the first time. As a smarter search engine, one that should produce a correct answer in exchange for a well-formed query. That model is wrong, and nobody tells you it’s wrong before you start. The tooling tutorials skip straight to the keyboard shortcuts.

This article is the step they left out.

It’s not about features. It’s about three shifts in how you think about the interaction. The shifts that experienced developers resist precisely because their existing intuitions are so well-calibrated. Your instincts are correct for the tools you’ve spent your career with. They misfire with this one, in specific, learnable ways.

If you’ve already written AI tools off, I’d ask for twenty minutes and one honest experiment before the verdict stands.

The Search Engine Trap

When you need to know something, you know how to ask. You’ve spent years getting good at it. You learned to break down problems, finding the right terms, filtering the noise from the results. That skill is real, and it transfers to most tools you’ll encounter.

It doesn’t transfer here.

When you fire a terse, precise query at an AI coding tool, you’re doing exactly what your instincts tell you to do. And the tool will respond. Confidently, fluently, with something that looks like an answer. The problem isn’t that it’s wrong. The problem is that it had almost nothing to work with, so it filled the gaps with reasonable guesses. Reasonable guesses look like answers until they don’t.

Here’s what that looks like in practice. Say you’re working on a service that processes order events from a Kafka topic, and you need to handle duplicate messages. You ask:

“Kafka consumer idempotency Java”

You get back a pattern. It’s a real pattern. It might even be the right one for your situation. But the tool doesn’t know your situation. It doesn’t know your broker version, your consumer group setup, how your offsets are managed, or what “duplicate” actually means in your domain. It made assumptions about all of it. Some of them are probably wrong.

Now try this instead. Before you ask anything, spend three sentences describing the situation:

“I have a Quarkus service consuming order events from Kafka. Messages can be delivered more than once due to retries on the producer side. I need to detect and skip duplicates without adding a database dependency.”

Then ask the same question.

The output changes significantly because it stopped guessing. You gave it the shape of your problem, and it worked inside that shape instead of inventing one.

This is the first shift: the first message is a briefing, not a query. You’re not looking something up. You’re establishing shared context with something that knows a lot about software in general and nothing about your problem in particular.

Senior developers resist this because it feels slow. You’re used to precision paying off fast. Here, a little more upfront investment returns something much closer to useful on the first response. Which ultimately means less time debugging output that looked right but wasn’t.

It won’t always work. But when it doesn’t, you’ll know why, and you’ll know what to add next. That’s a different relationship with the tool than firing queries and grading results.

Building Context Is a Skill

The briefing idea sounds simple, but it’s worth being specific about what good context actually contains. It is different from what you would type into a search engine, and different again from what you’d put in a bug report.

A search query is optimized for keyword matching. You strip out everything that isn’t signal: no sentences, no explanation, just the terms most likely to surface relevant documents. That’s the right move for search. It’s exactly the wrong move here.

A bug report is optimized for reproducibility. You include steps, environment details, expected vs. actual behavior. Useful structure, but still focused on what’s broken rather than what you’re trying to build.

Context for an AI tool is neither of those. It’s closer to what you’d say to a colleague before asking them to pair with you. The minimum they’d need to know not to go in the wrong direction. That usually means four things:

What you’re working with. Not just the language, but the framework, the version, the constraints you’re operating inside. “Java” is almost useless. “Quarkus 3, native compilation target, no reflection” is something to work with.

What you’re trying to achieve. The actual goal, not the intermediate step. Developers often ask about the implementation they’ve already decided on, when the more useful conversation would be about the outcome. If you describe the goal, the tool can sometimes tell you the implementation is unnecessary.

What you’ve already ruled out. This one is underused. If you’ve already decided you don’t want to introduce a new dependency, or you can’t change the database schema, or the solution has to work without a network call, say so upfront. Otherwise you’ll get suggestions that require exactly what you’ve ruled out, and you’ll waste a round correcting it.

What “good” looks like. A working solution isn’t always enough. If the code needs to be readable by junior developers, or needs to fit an existing pattern in the codebase, or has specific performance requirements. That is important context too.

Putting that together, the difference looks something like this. Instead of:

“REST client retry logic Java”

You write:

“I’m building a Quarkus REST client that calls an external shipping API. The API is occasionally slow and returns 503s under load. I need retry logic with exponential backoff. I’m already using SmallRye Fault Tolerance, so I’d prefer to stay in that ecosystem rather than adding a separate library. The solution needs to be easy to configure per-environment via application.properties.”

That’s four sentences. It takes thirty seconds to write. And it eliminates most of the ways the response could go wrong before it starts.

The other difference from search: you don’t stop after one exchange. Search is a single transaction: You query, you evaluate, you either use the result or try again with different terms. A conversation with an AI tool is iterative. If the first response is close but not right, you don’t start over. You say what’s wrong and keep going. That’s not a failure mode. That’s how it works.

The Verification Reflex

By this point you might be thinking: fine, the output is better. BUT I still have to check all of it. And that’s expensive.

You’re right that you have to check it. But “check all of it the same way” is where experienced developers often lose the efficiency gain they could otherwise get.

Here’s the thing: AI output fails in patterns. It’s not randomly wrong. Once you’ve worked with it for a while, you start to recognize where the gaps tend to appear, and that means you can review intelligently instead of uniformly.

It’s generally strong at structure. If you ask it to scaffold a REST endpoint, implement a design pattern, or translate an algorithm into code, the shape of the output is usually right. The overall approach, the method signatures, the flow of control tend to hold up.

It’s weaker at constraints. Anything that requires knowing something specific about your environment, your exact library version, a quirk in your configuration, a behavior that changed between framework releases, is where errors appear. The tool knows the general case. It doesn’t know your particular case unless you told it.

It’s unreliable at the edges of its training. Newer library versions, recently deprecated APIs, framework features that changed in the last major release. This is where you’ll find confident-sounding answers that are simply outdated. It won’t tell you it’s uncertain. It will just answer.

So the verification reflex to develop isn’t “read every line with equal suspicion.” It’s more like: trust the structure, check the specifics, and always verify anything that involves a version, a configuration value, or behavior you can’t immediately explain.

A concrete example: you ask for help implementing MicroProfile JWT validation in a Quarkus application. The scaffolding comes back looking right: The annotations, the injection points, the basic flow. But the specific property names in application.properties, the claim mapping behavior, the way token expiry is handled, those are worth checking against the actual documentation for your version. The structure earned some trust. The details didn’t.

This is a meaningful shift from how you’d evaluate a Stack Overflow answer. There, you’re reading a human response that was written for a specific situation, often with caveats and comments from people who found edge cases. Here, you’re reading output that sounds authoritative regardless of how confident it should be. The calibration signal has to come from you.

The other thing worth naming: a wrong answer isn’t the end of the conversation. When you find an error, the move is to say so specifically and keep going — not to dismiss the session and start fresh. “This won’t work because my version of Hibernate ORM doesn’t support that syntax, it changed in 7.x. Adjust for the new API.” That one sentence usually produces a corrected response faster than restarting with a new query.

Context Is the API

The last shift is the one that feels most backwards if you think about it the wrong way.

Developers work with APIs. You know that the quality of your output depends on writing correct code against a well-defined interface. Precision matters. The right method name, the right parameters, the right types. Ambiguity is a bug.

AI tools invert this. There’s no fixed interface. There are no correct parameter names. What you put in is natural language, and the quality of what comes out scales with how well you describe the situation — not how precisely you phrase the request.

That feels uncomfortable. It feels like the tool is rewarding vagueness and punishing expertise. It isn’t. It’s rewarding relevant context and punishing assumed context. The difference matters.

When you call a library method, the library knows its own internals. You don’t need to explain them. When you talk to an AI tool, it knows software in general. It knows nothing about your system unless you tell it. Every relevant detail you leave out is a gap it will fill with a guess.

The mental model that works: you’re the senior engineer, and you’ve just brought in a capable contractor. They’re good. They’ve worked on systems like yours before. But they walked in this morning, they haven’t read your codebase, and they don’t know your constraints. What do you tell them before they write the first line?

You’d tell them about the system. The patterns you use. The things you’ve already decided. The things you want to avoid. Not because they’re incapable, but because without that information, even a capable person will make reasonable-sounding decisions that don’t fit your situation.

That conversation is your prompt. And unlike a contractor, you can have it in thirty seconds, iterate on it immediately, and throw it away if it goes in the wrong direction.

Here’s what that looks like in practice. You’re adding an observability layer to a Quarkus service. You could ask:

“How do I add metrics to a Quarkus application?”

Or you could write:

“I have a Quarkus 3 service deployed on Kubernetes. I’m using Micrometer with the Prometheus registry. I want to add a custom counter that tracks the number of orders processed, broken down by status code. The metric needs to be visible in our existing Grafana dashboards, which scrape from the standard /q/metrics endpoint. I don’t want to add any new dependencies.”

The first question will get you the getting-started tutorial repackaged. The second will get you something close to what you actually need, because you described what you actually need.

The investment is real. It takes longer to write the second prompt than the first. But compare it to the alternative: write the short prompt, get a generic answer, figure out what’s wrong with it, search for the specific thing it got wrong, find a Stack Overflow thread from three years ago that’s almost right, adapt it to your version, test it. That’s the workflow the short prompt leads to. The longer prompt skips most of it.

Context Doesn’t Have to Die When the Chat Does

There’s a frustrating thing that happens once you start getting real value from these conversations: the session ends, and the context you built up goes with it. Next time you open a chat, the contractor has forgotten everything. You’re back to day one.

This is where most developers either give up on continuity or start pasting the same paragraph at the top of every session. Neither scales.

The better approach is to treat your context as a artifact you maintain alongside your code and not something you rebuild from scratch each time.

The simplest version is a plain text file in your repository. Some tools look for it by convention: Claude Code picks up CLAUDE.md, Cursor reads .cursorrules, GitHub Copilot can be steered through a copilot-instructions.md in your .github folder. The names and locations vary, but the idea is the same. You write down the things that would otherwise live in your opening briefing — the stack, the constraints, the patterns you follow, the things you want to avoid — and the tool loads them automatically when you start a session.

A useful CLAUDE.md for a Quarkus project might contain things like: which version of Quarkus you’re targeting, whether you’re compiling to native, which extensions are already in use, your preferred approach to error handling, and any patterns that are established in the codebase that new code should follow. It doesn’t have to be long. A dozen lines that eliminate the most common wrong turns is worth more than a comprehensive document nobody maintains.

You can read more about my thoughts on context for Java developers here:

AI Coding Agents in Enterprise Java: The Context Problem

Markus Eisele

Mar 17

Read full story

The more sophisticated version of this idea is what agentic tools are starting to build on. When you use something like Claude Code or Cursor’s agent mode, the tool isn’t just responding to your messages, it’s reading files, running commands, and building its own picture of your project. The context file becomes a starting point rather than the whole story. You’re still the one who decides what’s in it, but the tool can fill in gaps by looking at the code itself.

This is still early. The tools are inconsistent about what they read, when they read it, and how much weight they give it. But the direction is clear: the context you invest in today doesn’t have to be throwaway. It can accumulate. The contractor analogy holds here too. The goal is to get to a point where starting a new session feels less like day one and more like picking up where you left off.

For now, the practical advice is simple: when you have a session that goes well, before you close it, ask the tool to summarize the decisions made and the constraints that shaped them. Paste that into your context file. It takes two minutes. Over time, that file becomes a working record of how your project thinks — and the next session starts smarter because of it.

One Experiment Before You Decide

If none of this has moved you yet, I’m not going to push harder. Skepticism earned through experience doesn’t dissolve because someone wrote a convincing article.

But here’s a specific thing worth trying before the verdict stands.

Pick something real from your current work. Not a toy problem — something you’d actually spend time on this week. A tricky refactoring, an integration you’ve been putting off, a design decision you’re not sure about. Write a briefing for it: four or five sentences covering what you’re working with, what you’re trying to achieve, what you’ve ruled out, and what good looks like. Then have the conversation, not just the query.

If the output is still wrong, note where it failed. Is it the structure, or is it a specific constraint it didn’t know about? If it’s the latter, add the constraint and try again. That iteration — that’s the workflow.

You don’t have to change how you work based on one session. But if you get one moment where the tool saves you real time on something real, not a demo, not a tutorial, your actual work, that’s worth knowing. Even if the rest of the verdict stays the same.

Odequ

Apr 2

This is a great post. Thanks for this

that's why when I ask chatGPT to build something or write something, how it generates codes does not make sense sometimes. But when I tell it the situation, describe the problems and give hints how or what process it should follow to solve it, the answers always become 70/80% accurate or little modification require only. Hence, always (a system that updates frequently), any ai tools can not solve that. Like, x and X may have different meanings.

AI Coding Agents in Enterprise Java: The Context Problem

Discussion about this post

Ready for more?