Why Benchmarking Java Frameworks Is Harder Than It Looks
Inside the engineering work behind the new Quarkus benchmark results—and what they teach us about JVM performance testing.
Framework benchmark posts usually follow the same pattern. Someone publishes a chart, someone crops the winning bar, and within a few hours people are treating one screenshot like architecture guidance. That is how benchmark discussions go wrong. The problem is not that numbers are useless. The problem is that numbers without context travel faster than the explanation that makes them meaningful.
That is why the new Quarkus benchmark work deserves attention. Yes, the published results are strong. Holly Cummins reports that, in the benchmark setup the team now publishes openly, Quarkus reaches 19,255 transactions per second compared with 7,238 for Spring Boot, starts in 2.919 seconds instead of 6.569 seconds, and uses 269 MB instead of 583 MB of memory. But the more important story is not the chart. The more important story is that the Quarkus team stopped asking people to trust a vague performance graphic and started showing the work behind it.
And this is the part I find most worth writing about.
Because this was not one person throwing numbers into a blog post. This was a joint engineering effort to make Quarkus performance claims more transparent, more reproducible, and more open to criticism. In Holly’s write-up, Eric Deandrea is credited with years of benchmark and automation work, built on an earlier benchmark from John O’Hara. Holly and Eric then moved that work into a new public repository and wired it to publish results through another repository that acts as a data store. Once the benchmark became open, Francesco Nigro reviewed the setup and identified ways to make the measurements more robust. Holly also writes that Eric, Francesco, and Sanne Grinovero spent significant time digging through logs and profiling to understand configuration issues in the Spring side of the comparison. This matters. It tells you the result is not just a Quarkus marketing story. It is a piece of engineering work that got better because other engineers could inspect it.
That openness is the real win.
The old benchmark problem was not just age. It was trust.
Holly is very direct about what was wrong with the old performance story on the Quarkus site. The numbers were outdated. There was no date on the chart. There was no explanation of the methodology. There was no public source code attached to the benchmark. The compared framework was not even named. That kind of benchmark graphic is fine if all you want is a homepage decoration. It is not fine if you want technical readers to treat the result as evidence. Holly says it plainly: if it is not reproducible, it is not trustworthy.
And benchmarking really is easy to get wrong. Holly says that directly too. Getting numbers is trivial. Getting numbers that are actually measuring what you think you are measuring is hard. That line is worth sitting with for a minute, because it is the whole benchmark problem in one sentence. The Java ecosystem has no shortage of charts. What it lacks, much of the time, is benchmark literacy.
The benchmark got better because it was exposed to criticism
This is the second thing that makes the Quarkus benchmark story worth reading. Open sourcing the benchmark did not just make it visible. It made it improvable.
Holly describes one of the most important fixes in the article. The team had used cgroups with cpuset to pin processes to specific CPU cores, which is the right general idea. But both the application under test and the load generator ended up in the same cgroup, competing for those same cores. With the load generator running 16 threads, that introduced significant measurement noise. Separating the core assignments fixed that issue, and once proper isolation was in place, the throughput results shifted. Holly notes something subtle and very important here: before the fix, slower frameworks looked more competitive because their lower load left more breathing room for the other processes on the system. In other words, measurement noise was flattering the weaker result.
This is the kind of detail I trust.
Not because mistakes happened. Mistakes happen in any serious benchmark effort. I trust it because the team published enough of the work for someone like Francesco Nigro to say, “this part is wrong,” and for that criticism to make the benchmark better. That is how performance engineering is supposed to work. You do not earn credibility by acting certain. You earn credibility by making your claims vulnerable to inspection.
The benchmark became harder to fake, harder to misread, and easier to challenge.
Fair benchmarks are harder than “out of the box” benchmarks
There is another useful lesson in Holly’s post, and it is one many framework benchmark discussions skip entirely. “Out of the box” and “fair” are not always the same thing.
The Quarkus team initially wanted to replicate a normal user experience and compare frameworks without tuning them to extremes. That is a reasonable goal. It avoids turning the benchmark into a contest between experts who know every knob. But after publishing early results, they got feedback from people who use Spring Boot every day. Those users pointed out differences in open-session-in-view behavior and connection pool sizing. Holly says that the connection pool settings mattered enough that the default Spring configuration was suffering serious connection errors under load. If the client cannot connect, throughput drops for reasons that are not very interesting as a framework comparison. So the team investigated, profiled, and introduced a lightly tuned version alongside the out-of-the-box version. The front-page graphics now show the tuned results. Holly also notes that the tuning changed the trade-off slightly, sacrificing some memory for better speed in both frameworks.
This is a very mature way to handle benchmark fairness.
A less serious post would have quietly kept the worse default result and called it a day. A more dogmatic post would have disappeared into endless tuning until the benchmark no longer represented normal code. The Quarkus team did something better. They made the compromise visible. They kept the realism goal, but they also listened when framework users said the default behavior was introducing avoidable errors that distorted the comparison. That is the kind of judgment performance work needs.
And again, this is why the people matter. Eric Deandrea, Holly Cummins, Francesco Nigro, Sanne Grinovero, and others are not just names to drop. They represent a broader engineering effort inside Quarkus to make sure the claim survives contact with experts, critics, and real framework behavior.
Throughput finally entered the conversation
For years, a lot of Quarkus performance discussion in the public space centered on startup time and memory footprint. Those are real strengths. But that framing also created a bad assumption: if a framework talks mostly about startup and memory, maybe its throughput story is weak. Holly explicitly calls out that misconception and argues that, for Quarkus compared with alternative JVM frameworks, there is no throughput-versus-startup trade-off in this benchmark. The benchmark was rebuilt in part because the old graphics omitted throughput entirely.
This matters because throughput is the number many production teams care about once the service is warm and handling traffic. Startup time matters for autoscaling, scale-to-zero behavior, cold starts, batch jobs, and dense container environments. Memory matters for cost and packing density. But if your service stays up for a long time and sits behind real user traffic, throughput is not optional context. It is part of the performance picture. Holly’s benchmark story is stronger because it finally brings that dimension into view.
Still, this is where I want to add a bit more framing than the original post does.
Throughput is important, but throughput is not the whole room.
A framework can post a strong throughput number and still leave open questions about p95 and p99 latency, downstream contention, transaction behavior, blocking versus non-blocking paths, or what happens when the database becomes the real bottleneck. Holly is careful about the benchmark principles here. She says the setup is designed to test the framework rather than the infrastructure, which in practice means keeping the experiment CPU-bound. That is the right move for a framework comparison. But it also means you should read the result for what it is: a meaningful framework benchmark, not a full model of your production system.
That distinction is where many readers get lost. They see a real result and then immediately overextend it.
This is where local benchmarking still fools people
This connects directly to something I wrote earlier about benchmarking Quarkus on macOS and Apple Silicon.
That article made a simple point: your laptop is not a production server, and local benchmarks are easy to misread. On macOS, especially on Apple Silicon, scheduler behavior, thermal throttling, mixed performance and efficiency cores, Rosetta surprises, and the lack of Linux-style CPU affinity controls all distort results. Francesco Nigro, quoted there, makes the point very clearly: we should not load test on notebooks or artificial environments. Your laptop has variable clocks, thermal limits, and no real control over CPU quotas or memory isolation.
That older article is useful here because it gives us the missing second half of the benchmark conversation.
The new Quarkus benchmark is better engineering because it uses controlled lab conditions, process isolation, and tools meant for real load generation. Holly says the “expert-approved” home setup uses qDup for orchestration and Hyperfoil to drive load without coordinated omission, and she explicitly warns that the simpler scripts for Mac and Linux do not attempt process isolation and need to be treated with caution. She even says that laptop power management can create wild effects and that you can easily end up measuring some other bottleneck that has nothing to do with Quarkus or Spring.
A lab benchmark and a laptop benchmark solve different problems.
A controlled benchmark helps us compare frameworks under conditions designed to isolate framework behavior. A local benchmark helps a developer build intuition, spot gross regressions, or test a specific application under a repeatable local setup. These are both useful things. But they are not the same thing, and treating them as interchangeable is how benchmark posts turn into nonsense.
I still stand by the rule from the earlier article: on macOS, treat local numbers as relative, not absolute. They are good for learning. They are bad for trophies.
The native conversation still needs more discipline
Another place where benchmark discussions go sloppy is native mode.
Holly is refreshingly clear here too. Quarkus in JVM mode should be compared with other JVM frameworks in JVM mode, and Quarkus in native mode should be compared with other frameworks in native mode. Comparing Quarkus native to another framework on the JVM and declaring a winner is, in Holly’s words, silly. She also says that if you compare Quarkus JVM to Quarkus native, the trade-off comes back: native starts much faster and uses less memory, but it pays a throughput penalty. In this benchmark, going native cuts Quarkus throughput in half, and Holly notes that the native penalty is similar for Spring Boot. She also says that, for most applications, this trade-off is not worth it unless you care a lot about very frequent start-stop behavior or very low workloads.
Native is not the “serious” mode, and JVM is not the “legacy” mode. They are two runtime choices with different cost profiles and different failure boundaries. If you run warm services with sustained traffic, JVM mode often remains the better answer because adaptive JIT optimization helps throughput over time. If you care about bursty startup, scale-to-zero, or fitting more services into tighter memory envelopes, native becomes much more attractive. My earlier benchmarking piece on macOS showed exactly why local testing can invert expectations here. On that machine, the JVM sometimes outperformed native in ways that say more about the host environment than about the runtime’s true production profile.
That is why I like Holly’s framing. It pulls the native discussion back toward workload decisions instead of slogans.
There is another story here: engineering culture
The benchmark itself is important. But what I find more strategically interesting is what the benchmark reveals about Quarkus as a project.
A team that is willing to publish methodology, name the compared framework, accept outside criticism, revisit fairness assumptions, and document where the benchmark changed is showing something deeper than performance. It is showing engineering culture. Holly’s post is full of little signals of that culture: public repos, benchmark principles like parity and realism, acknowledgment of benchmark flaws, and visible collaboration between people doing different parts of the work. Eric Deandrea’s lab automation matters. Holly’s public explanation matters. Francesco Nigro’s criticism matters. Sanne Grinovero’s tuning and log analysis work matters. Feedback from Spring users matters too, because it forced the Quarkus team to re-examine whether “default” was really fair.
That is the thing I would want my readers to take away.
Not “Quarkus won, end of story.”
But: “This benchmark is worth reading because the people behind it behaved like engineers.”
What this benchmark still does not do for you
Even a good benchmark does not remove the need for judgment.
It does not tell you your exact cloud bill. It does not tell you whether your own service, with its own persistence behavior and its own traffic patterns, will land in the same shape. It does not tell you how your p99 latency behaves during a noisy deployment window. It does not tell you what happens when your real production bottleneck is a queue, a remote API, or a badly indexed table. Holly is careful to say the benchmark aims to test the frameworks, not the infrastructure. That is a strength of the exercise. But it is also the limit of the exercise.
This is why benchmark literacy matters more than benchmark fandom.
A good benchmark is not a conclusion. It is a starting point for better questions.
What exactly was measured?
What was isolated?
What was tuned?
What workload shape does it represent?
How close is that to my own production reality?
If your team cannot answer those questions, the chart is decoration.
So how should Java teams read the new Quarkus benchmarks?
Read them as engineering evidence, not as universal truth.
Read them as proof that Quarkus deserves to be taken seriously not just for startup and memory, but for throughput as well in a carefully controlled benchmark. Holly’s numbers support that very clearly. Read them as proof that Quarkus is getting more disciplined about how it presents performance. The project moved from anonymous, stale graphics toward public, inspectable benchmark work. That is a big improvement.
But also read them with the same discipline the Quarkus team had to apply while building them.
Do not turn them into lazy framework wars. Do not treat local laptop numbers as confirmation of lab results. Do not compare native to JVM and pretend that is a serious conclusion. Do not confuse “out of the box” with “fair.” And do not forget that the benchmark improved precisely because it was challenged. That is part of the result too.
Final thoughts
The new Quarkus benchmarks are good news for Quarkus. That part is easy to see.
The deeper reason they matter is that they show what responsible benchmark work looks like in an open source Java project. Eric Deandrea, Holly Cummins, Francesco Nigro, Sanne Grinovero, and many others turned this into more than a homepage graphic. They made it a collaborative engineering effort, exposed it to criticism, fixed what needed fixing, and gave the rest of us something more useful than a trophy chart: a benchmark that is substantially easier to trust because it is easier to inspect.
And that is the real story. Not that Quarkus posted new performance numbers, but that the project showed its work.



