One question that came up while reading: isn’t one of the core limitations for AI inference on the JVM the lack of native GPU / accelerator support, including efficient GPU–CPU switching and device-level memory management? JVM multithreading and FFM help a lot on the CPU side, but they don’t fundamentally solve GPU offload or scheduling, which is still critical for high-throughput inference.
Related to that, I’m also wondering about loading very large model weights directly into the JVM:
Wouldn’t holding multi-GB weights inside the JVM heap put significant pressure on the Garbage Collector, especially for long-running inference services?
Even with off-heap memory or Panama bindings, do you see GC behavior, memory fragmentation, or pause times becoming a practical bottleneck at scale?
We haven't even scratched the surface of what's possible now. Native SIMD is coming and possible!
With regards to memory, there's also been a lot of improvements for large heaps lately and I do personally not think that this is going to be an issue at all.
I haven't had a chance to test jlama with regards to that. But it is on my list!
Can you update the example to not use `getValue`? I'd like people to start defaulting to the buffer based methods as they are much faster and involve less pointer chasing.
I've nearly finished MemorySegment support in ONNX Runtime, it should hopefully land at some point this month after a round of reviews.
Done! Thanks for your comment! Please reach out, if there's anything you would like me to cover here. Big fan! Thanks for your effort and "Happy New Year!" 🥳
"Python builds the prototype, Java builds the system that survive scale."
I can't agree more. This statement should be repeated more often.
Hi Markus,
Thanks for the insightful article.
One question that came up while reading: isn’t one of the core limitations for AI inference on the JVM the lack of native GPU / accelerator support, including efficient GPU–CPU switching and device-level memory management? JVM multithreading and FFM help a lot on the CPU side, but they don’t fundamentally solve GPU offload or scheduling, which is still critical for high-throughput inference.
Related to that, I’m also wondering about loading very large model weights directly into the JVM:
Wouldn’t holding multi-GB weights inside the JVM heap put significant pressure on the Garbage Collector, especially for long-running inference services?
Even with off-heap memory or Panama bindings, do you see GC behavior, memory fragmentation, or pause times becoming a practical bottleneck at scale?
Thanks for your questions, Bartek!
Just take a look at jlama for example. https://github.com/tjake/Jlama
We haven't even scratched the surface of what's possible now. Native SIMD is coming and possible!
With regards to memory, there's also been a lot of improvements for large heaps lately and I do personally not think that this is going to be an issue at all.
I haven't had a chance to test jlama with regards to that. But it is on my list!
Hang thight,
Thanks for reading and your feedback!
Best,
Markus
Can you update the example to not use `getValue`? I'd like people to start defaulting to the buffer based methods as they are much faster and involve less pointer chasing.
I've nearly finished MemorySegment support in ONNX Runtime, it should hopefully land at some point this month after a round of reviews.
Done! Thanks for your comment! Please reach out, if there's anything you would like me to cover here. Big fan! Thanks for your effort and "Happy New Year!" 🥳
Thanks for updating it. The MemorySegment support is in this PR - https://github.com/microsoft/onnxruntime/pull/26911, not sure if it'll be in time for 1.24 though.