Why Quarkus Agent Skills Matter More Than Another Model Upgrade

A practical argument for framework-aware skills as a way to reduce broken outputs, shorten recovery loops, and make agent tooling easier to trust.

May 14, 2026

Phillip Kruger from the Quarkus team pointed me to this quarkus-agent-mcp testing thread when we were talking about Quarkus Agent MCP and Quarkus Skills. I like threads like this because they are hard to polish into marketing. You get the wins, the bad runs, and the places where the tool still needs work.

Everybody loves a token chart. I care more about the app that boots, uses the right framework version, and does not fall into an authentication loop the first time somebody hits a real endpoint.

This thread is interesting for exactly that reason. It is small. It is messy. It is also more honest than most AI tooling claims because it shows failures, retries, and one case where the MCP-assisted path made a weaker model worse. Around May 4, 2026, the runs compared Quarkus app generation with and without Quarkus agent MCP across three prompts and two Anthropic models. That gives us six live-fire runs, not a benchmark suite. Treat the numbers as indicators. The pattern is still worth paying attention to.

When I say skills here, I mean small framework-specific instruction files plus the tooling around them. The quarkus-agent-mcp README describes quarkus_skills as a way to load extension-specific patterns, testing guidance, and common pitfalls before the agent writes code. That is the useful part. I do not need my coding assistant to rediscover Quarkus from scratch on my branch.

If you want the short version, it is this: basic CRUD barely moved. Multi-extension work changed the picture fast. In the summary comment, the thread ended up at five working results with MCP and three without. For me, that is the headline. Cost matters. Time matters. A working app matters more.

Easy prompts flatten the story

The first prompt was a plain CRUD service: entity, REST endpoints, PostgreSQL, Panache, validation, tests. That is the kind of task a strong model has already seen enough times to fake competence convincingly, and in this case it did more than fake it. Both Opus 4.6 and Sonnet 4.5 produced working results with and without MCP.

The gains were still there, but they were modest. In the issue thread, Opus with MCP came in 18 percent cheaper than the no-MCP run, while Sonnet with MCP was 11 percent cheaper. Useful, yes. Dramatic, no. I actually like that result because it acts as a control. If the article starts by claiming skills are magic on boring CRUD, your readers should get suspicious immediately.

Generic model knowledge is already pretty good at scaffolding common Java web apps. The interesting question is what happens once the prompt stops looking like training-data comfort food and starts mixing framework behavior that is easy to get wrong.

The expensive part is recovery

The second and third prompts are where the thread starts paying rent.

Prompt two added WebSockets, a scheduler, and a health check. On Opus, the no-MCP run produced an app that would not start. It needed another debugging round before it became usable, which pushed the cost from $2.29 to $3.39 and the wall time from 9 minutes 22 seconds to 11 minutes 28 seconds. The MCP run finished in 5 minutes 21 seconds at $1.86 and worked on the first try.

Prompt three added security, a REST client, and caching. On Sonnet, the no-MCP run got much uglier. It spent $5.31 and 25 minutes, reported success, and still had an authentication loop at runtime. The MCP run spent $2.42, finished in 11 minutes, and delivered a working app with 14 passing tests plus verified runtime behavior.

That is the part I care about most. The real bill is not the first generation pass. The real bill arrives after the first wrong turn, when the model has to dig itself out of a hole it created five turns earlier and you are paying for every extra hop. A broken app is not a near miss. It is more work, more tokens, more reading, and usually less trust.

This is where a lot of AI tooling discussions get annoyingly abstract. People argue about request counts and token prices as if correctness were a decorative extra. It is the other way around. The difference between “done” and “broken but plausible” dominates the rest of the economics very quickly.

Skills start helping when the framework stops being generic

The issue thread gives a few concrete examples of what skills are buying.

The easy one is version freshness. In the CRUD prompt, the no-MCP run used Quarkus 3.17.7. The MCP run used Quarkus 3.35.1. That alone is not some glorious victory parade, but it is still real value. Current framework version, current conventions, current extensions, current APIs. If you are working in a fast-moving ecosystem, stale model memory is a tax even when the code still compiles.

The bigger gain is cross-extension behavior. WebSockets on their own are manageable. Schedulers on their own are manageable. Security, REST clients, and cache usage all look straightforward when you isolate them in docs. Put them together inside one generated app and the model has to navigate lifecycle rules, scope rules, testing loops, and a pile of small Quarkus conventions. That is exactly where a generic assistant starts improvising.

Improvisation is expensive in frameworks because it usually looks reasonable right up until it hits the runtime. A missing annotation, a wrong injection target, an outdated extension name, or a test path that never really exercised the failing code can all hide in “looks fine” territory for a while.

Skills help because they move some of that framework knowledge out of the model’s foggy memory and into explicit, current guidance. They can say which pattern is idiomatic, which one is a trap, which tool to use for tests, and what kind of failure to expect when something goes sideways. That is much better than hoping the model happens to remember a blog post from months ago.

The bad Sonnet run is the useful result

One of the six runs went the wrong way for MCP. Good. Keep it.

On Sonnet 4.5 with prompt two, the MCP-assisted run was worse than the no-MCP run. It cost 93 percent more, took longer, produced more code, and still failed. If I were trying to market skills, that is the result I would be tempted to wave away. It is also the result that makes the whole thread credible.

The reason matters even more than the number. The follow-up analysis in issue #117 and issue #118 turned the failure into two concrete engineering problems.

One was a missing skill warning: the WebSockets Next guidance did not steer the model away from injecting WebSocketConnection into an @ApplicationScoped bean, which led to ContextNotActiveException. The other was a tool failure: the Dev MCP test runner got stuck after mvn clean removed target/test-classes, and there was no useful recovery path from there.

That is the part I like. The bad run did not end as vague “model weakness.” It became framework knowledge and tool backlog. A better skill can prevent the first mistake next time. A better test tool can prevent the second spiral. You can improve both.

This is also why I think framework teams should care about skills. The happy path docs were never the whole problem. The sharp edges are the product. If your framework has scope traps, lifecycle rules, build quirks, and tooling recovery paths, the agent needs those encoded in a form it can actually load at the moment it is about to be wrong.

This is an argument for skills, not for hype

I would avoid turning this into “MCP saves 27 percent” or “skills make coding agents faster.” Those lines are too brittle, and they miss the point anyway.

The thread shows something more useful:

On easy prompts, skills help a bit because they keep the model current and a little leaner.
On harder prompts, skills help more because they reduce wrong turns across extension boundaries.
When the skill is incomplete or the tool stack is brittle, the failure becomes visible fast.
That visibility gives framework maintainers something concrete to fix.

For me, that last point is the real leverage. A plain prompt can fail and teach you almost nothing beyond “the model guessed badly.” A skill-based system can fail in a way that points back to a missing guardrail, an outdated convention, or a broken tool path. That is a much better engineering loop.

There is also a practical model economics angle here. In the issue summary, Sonnet with MCP on prompt three cost less than Opus without MCP on the same prompt and still delivered a working app. I would not generalize that into a universal law. I would absolutely keep it in mind when somebody tells you the only path to better agent results is buying the biggest model available.

Sometimes the cheaper move is giving the model better rails.

What I would tell framework teams

If you build a framework, an extension ecosystem, or internal platform tooling, I think the takeaway is pretty plain.

Ship skills. Ship them with the product. Version them. Review them. Treat them like code that shapes runtime outcomes, because that is what they are now.

Do not stop at the happy path. Put the common footguns in there. Put the testing guidance in there. Put the stale-version traps in there. Put the “this fails after mvn clean” recovery note in there until the tool is fixed. A perfect model will still benefit from that. A cheaper model may depend on it.

I also think tooling teams should pay more attention to recovery loops. Structured exceptions, resettable test runs, dynamic tool discovery, current docs, and framework-aware skills all matter because they keep the agent from burning turns on detective work a local tool already knows how to answer.

That is boring infrastructure work. Good. Boring infrastructure work is usually where the real gains live.

The real ROI

I do not think issue #71 proves a universal savings percentage. It is too small for that, and the thread itself is honest enough to show why. I do think it proves something more useful.

The real ROI of agent skills is fewer bad branches. Fewer stale assumptions. Fewer expensive recovery loops. Fewer “tests passed, runtime broken” victories. When a run does fail, a skill-based setup has a better chance of turning that failure into a concrete improvement instead of another vague complaint about model quality.

That is enough for me.

If I am choosing where to spend energy on agent tooling for a framework, I would spend it on skills long before I spent it on prettier benchmark slides.

Discussion about this post

Ready for more?