Running TensorFlow from Java 25 Without JNI or Python
How the Foreign Function & Memory API Enables Fast, Local ML Inference on macOS
I will be honest upfront.
I am not excited about working with Python.
I respect the ecosystem. I understand why data scientists love it. But as a Java developer building systems that need to run for years, integrate with other services, and behave predictably under load, Python has never felt like home.
Still, TensorFlow exists. And it exists first and foremost as a native C and C++ runtime. Python is “just” a very good binding layer.
So the question I wanted to answer was simple:
How far can we push modern Java, specifically the Foreign Function and Memory API, before we need JNI or Python at runtime?
This article is the result of that experiment.
We will build a high-performance REST service on macOS that runs TensorFlow inference directly from Java 25.
No JNI.
No embedded Python runtime.
No Docker overhead.
Just Java, Quarkus, and the TensorFlow C API.
This pattern is especially useful for local AI agents and edge-style services where you want cheap, fast inference close to your code. Think classification, routing, embeddings, or heuristics that do not justify a network hop or a managed cloud service.
What We Are Building
By the end of this tutorial, you will have:
A Quarkus REST service written in Java 25
A TensorFlow MNIST model loaded via the C API
Inference executed using Java’s FFM API
Predictable memory management without JNI
A server-side benchmark to compare Java and Python inference
All running natively on macOS.
Prerequisites
This tutorial targets macOS only! If you want to make this run on Windows, you will have to adapt paths and libraries.
Java 25 (Early Access): Download JDK 25 EA.
JExtract (Project Panama Tooling): Download JExtract.
Note: Extract
jextractand add itsbinfolder to yourPATH.
Quarkus CLI:
brew install quarkusio/tap/quarkus(or use Maven).Python and Tensorflow.
macOS: Apple Silicon (M1/M2/M3/M4) or Intel. The examples were tested on Apple Silicon.
If you do not want to follow along, grab the example from my GitHub repository.
Prepare TensorFlow for macOS
We only use Python to install TensorFlow and build the model artifact.
Nothing from Python is used at runtime.
# 1. Create the virtual environment (named “venv”)
python3 -m venv venv
# 2. Activate it
source venv/bin/activate
# 3. Install TensorFlow inside the virtual environment
pip install tensorflow==2.13.0This installs the native TensorFlow libraries into the virtual environment, including the .dylib files we will later load from Java.
We train a very simple MNIST classifier:
Input: 28x28 grayscale image
Output: 10 digit classes
Hidden layer: 256 neurons
Dropout: 0.3
Epochs: 5
Validation enabled
I intentionally do not inline the Python here.
This is a separation-of-concerns decision.
Model training is not application logic. It belongs in a different lifecycle, usually handled by data scientists or ML engineers. As a Java developer, you should consume model artifacts, not maintain training code in your service.
Grab create_model.py from the repository and run it.
When the script finishes, you get two artifacts:
mnist_saved_model/mnist_frozen_graph.pb
Why We Use the Frozen Graph
SavedModels are directory-based and flexible, but they require more complex C API usage.
Frozen graphs are single .pb files with weights embedded as constants.
They are simpler to load and behave more predictably for low-level integration.
For this tutorial, frozen graphs are the right choice.
3. Bootstrap the Project
quarkus create app com.acme:tflite-agent-macos \
--extension rest-jackson
--no-code
cd tflite-agent-macosOh, and if you are wondering why this project is called tflite. Well, I tried getting TF Light to run first. But that was almost impossible, as I couldn’t build it locally or get a arm binary for macOS from anywhere.
Update pom.xml: Replace the content with this Java 25-ready configuration.
<maven.compiler.release>25</maven.compiler.release>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>${compiler-plugin.version}</version>
<configuration>
<compilerArgs>
<arg>--enable-preview</arg>
</compilerArgs>
</configuration>
</plugin>
<plugin>
<artifactId>maven-surefire-plugin</artifactId>
<version>${surefire-plugin.version}</version>
<configuration>
<argLine>--enable-native-access=ALL-UNNAMED --enable-preview</argLine>
</configuration>
</plugin>Java 25 preview features are mandatory for FFM in this form.
Generate Bindings with JExtract
Let’s create the FFM mappings. It is comparably simple, but I created a more robust script for you. Check it out in the repository.
# 1. Set the variables
export INCLUDE_PATH/tensorflow/c/c_api.h
export OUTPUT_DIR="src/main/java"
export PACKAGE="tensorflow.ffi"
# 2. Run jextract
jextract \
--output "$OUTPUT_DIR" \
-t "$PACKAGE" \
-l tensorflow \
-I "$INCLUDE_PATH" \
"$INCLUDE_PATH/tensorflow/c/c_api.h"The generate_jextract.sh also exports the DYLD_LIBRARY_PATH which you need to set before running your quarkus application.
The TFLite Runtime Wrapper
The heart of this tutorial is TFRuntime.java.
This class wraps the TensorFlow C API using Java’s Foreign Function and Memory API.
Big Picture Architecture
Java Application
↓
TFRuntime.java (FFM API)
↓
TensorFlow C API (libtensorflow_cc.2.dylib)
↓
TensorFlow Core (C++)Phase 1: Model Loading
The Constructor Pattern
public TFRuntime(String modelPath, int numThreads) {
this(modelPath, numThreads, “inputs:0”, “Identity:0”);
}It loads a .pb frozen graph file with default operation names. The constructor initializes native resources that will live for the object’s lifetime.
Loading the Frozen Graph
private ModelHandle loadFrozenGraph(String path) {
Arena arena = Arena.ofConfined(); // ← FFM memory scope
MemorySegment status = TF_NewStatus();
try {
byte[] graphBytes = Files.readAllBytes(Paths.get(path));
MemorySegment sessionOptions = TF_NewSessionOptions();
MemorySegment graph = TF_NewGraph();
MemorySegment importOptions = TF_NewImportGraphDefOptions();
// Create TF_Buffer from graph bytes
MemorySegment buffer = TF_Buffer.allocate(arena);
MemorySegment data = arena.allocate(JAVA_BYTE, graphBytes.length);
data.copyFrom(MemorySegment.ofArray(graphBytes));
TF_Buffer.data(buffer, data);
TF_Buffer.length(buffer, graphBytes.length);
// Import graph and create session
TF_GraphImportGraphDef(graph, buffer, importOptions, status);
MemorySegment session = TF_NewSession(graph, sessionOptions, status);
return new ModelHandle(graph, session, sessionOptions, arena);
}
}FFM Concepts in Action
Arena: Scoped Memory Management
Arena arena = Arena.ofConfined();What: A memory scope that owns all allocations within it
Why: Automatic cleanup when arena closes (like try-with-resources)
Lifetime: Lives as long as the TFRuntime object (stored in `modelArena` field)
MemorySegment: Native Memory Access
MemorySegment data = arena.allocate(JAVA_BYTE, graphBytes.length);
data.copyFrom(MemorySegment.ofArray(graphBytes));What: A safe view into native memory
Why: Type-safe access to C structures and arrays
How: Copy Java byte array to native memory for TensorFlow
Foreign Function Calls
MemorySegment graph = TF_NewGraph();What: Direct call to C function `TF_NewGraph()`
Why: No JNI overhead, just a function pointer call
Return: MemorySegment pointing to native TF_Graph structure
The ModelHandle Record
private record ModelHandle(
MemorySegment graph,
MemorySegment session,
MemorySegment sessionOptions,
Arena arena
) {}Bundle all native resources that must live together. The arena keeps all allocated memory valid.
Phase 2: Inference (The run Method)
The high level flow is the following:
public float[] run(float[] input) {
try (Arena arena = Arena.ofConfined()) { // ← Inference scope
// 1. Get operations from graph
// 2. Create input tensor
// 3. Run TensorFlow session
// 4. Extract output tensor
// 5. Convert to Java array
// 6. Clean up output tensor
}
}Step 1: Get Operations
private MemorySegment getOperation(Arena arena, String opName) {
byte[] nameBytes = (opName + “\0”).getBytes(StandardCharsets.UTF_8);
MemorySegment nameSeg = arena.allocate(JAVA_BYTE, nameBytes.length);
nameSeg.copyFrom(MemorySegment.ofArray(nameBytes));
MemorySegment op = TF_GraphOperationByName(graph, nameSeg);
return op;
}FFM Pattern:
Convert Java String to C string (null-terminated)
Allocate native memory in arena
Copy string bytes to native memory
Call C function with native pointer
Step 2: Create Input Tensor (The Critical Part!)
private MemorySegment createInputTensor(Arena arena, float[] input) {
// Define shape: [1, 28, 28]
MemorySegment dims = arena.allocate(JAVA_LONG.byteSize() * 3);
dims.set(JAVA_LONG, 0, 1L); // batch size
dims.set(JAVA_LONG, 8, 28L); // height
dims.set(JAVA_LONG, 16, 28L); // width
// CRITICAL: Use Arena.ofAuto() for tensor data
Arena tensorArena = Arena.ofAuto();
long dataSize = (long) input.length * Float.BYTES;
MemorySegment data = tensorArena.allocate(dataSize);
data.copyFrom(MemorySegment.ofArray(input));
MemorySegment tensor = TF_NewTensor(
TF_FLOAT, dims, 3, data, dataSize,
NULL, // ← No deallocator
NULL
);
return tensor;
}Step 3: Run Inference
TF_SessionRun(
session, // TensorFlow session
NULL, // run options (none)
inputs, // input operations
inputValues, // input tensor pointers
1, // number of inputs
outputs, // output operations
outputValues, // output tensor pointers (filled by TF)
1, // number of outputs
NULL, // target operations (none)
0, // number of targets
NULL, // run metadata (none)
status // error status
);FFM Pattern: Pass pointers to structures. TensorFlow fills outputValues with result.
Step 4: Extract Output
MemorySegment outputTensor = outputValues.get(ADDRESS, 0);
float[] result = readOutputTensor(outputTensor);Step 5: Read Tensor Data
private float[] readOutputTensor(MemorySegment tensor) {
long byteSize = TF_TensorByteSize(tensor);
int size = (int) (byteSize / Float.BYTES);
MemorySegment data = TF_TensorData(tensor);
float[] result = new float[size];
MemorySegment.copy(data, JAVA_FLOAT, 0, result, 0, size);
return result;
}FFM Pattern:
Get pointer to tensor’s data buffer
Use `MemorySegment.copy()` to bulk-copy native floats to Java array
Type-safe: `JAVA_FLOAT` ensures correct interpretation
Step 6: Critical Cleanup
// CRITICAL: Delete output tensor (TensorFlow expects caller to free it)
TF_DeleteTensor(outputTensor);
// Do NOT delete input tensor (Arena.ofAuto() manages it)Phase 3: And finally, the cleanup.
@PreDestroy
public void close() {
if (session != null && !session.equals(NULL)) {
MemorySegment status = TF_NewStatus();
try {
TF_CloseSession(session, status);
TF_DeleteSession(session, status);
} finally {
TF_DeleteStatus(status);
}
}
if (graph != null && !graph.equals(NULL)) {
TF_DeleteGraph(graph);
}
if (sessionOptions != null && !sessionOptions.equals(NULL)) {
TF_DeleteSessionOptions(sessionOptions);
}
if (modelArena != null) {
modelArena.close(); // ← Frees all arena allocations
}
}Order matters:
Close session (stops using graph)
Delete graph (stops using options)
Delete options
Close arena (frees all memory)
The Service Logic
The MnistService uses Java FFM to wrap the TensorFlow C API for MNIST classification. It implements thread-safe lazy loading for the model (mnist_frozen_graph.pb) and is configured for Apple Silicon performance. The predict method accepts a 784-float input and returns a Prediction record containing the digit, confidence, and inference time. It also handles native library errors explicitly, providing actionable diagnostics if the underlying C dependencies are missing.
Create src/main/java/com/acme/mnist/MnistService.java.
package com.acme.mnist;
import org.jboss.logging.Logger;
import jakarta.enterprise.context.ApplicationScoped;
@ApplicationScoped
public class MnistService {
private static final Logger LOG = Logger.getLogger(MnistService.class);
private volatile TFRuntime runtime;
private volatile String initializationError;
private final Object initLock = new Object();
public Prediction predict(float[] pixels) {
ensureInitialized();
if (runtime == null) {
throw new IllegalStateException(
“TensorFlow runtime not initialized. “ +
(initializationError != null ? initializationError : “Unknown initialization error.”));
}
long start = System.nanoTime();
float[] logits = runtime.run(pixels);
long duration = System.nanoTime() - start;
int maxIndex = 0;
float maxVal = logits[0];
for (int i = 1; i < logits.length; i++) {
if (logits[i] > maxVal) {
maxVal = logits[i];
maxIndex = i;
}
}
return new Prediction(maxIndex, maxVal, duration / 1_000_000.0);
}
public record Prediction(int digit, float confidence, double inferenceMs) {
}
}The REST Resource
Finally, we dug ourselves out from the trenches and reached the REST endpoint.
Create src/main/java/com/acme/mnist/MnistResource.java.
package com.acme;
import com.acme.mnist.MnistService;
import io.quarkus.logging.Log;
import jakarta.inject.Inject;
import jakarta.ws.rs.Consumes;
import jakarta.ws.rs.POST;
import jakarta.ws.rs.Path;
import jakarta.ws.rs.Produces;
import jakarta.ws.rs.core.MediaType;
@Path(”/api/mnist”)
public class MnistResource {
@Inject
MnistService service;
public static class Request {
public float[] pixels; // Efficient mapping
}
@POST
@Consumes(MediaType.APPLICATION_JSON)
@Produces(MediaType.APPLICATION_JSON)
public MnistService.Prediction classify(Request req) {
Log.infof(”Classifying request: %s”, req.toString());
if (req.pixels == null || req.pixels.length != 784) {
throw new IllegalArgumentException(”Input must be 784 float pixels”);
}
return service.predict(req.pixels);
}
}Run on macOS
This is the critical step. You must tell Java where to find your .dylib.
export DYLD_LIBRARY_PATH=”/Path/to/your/venv/lib/python3.9/site-packages/tensorflow:$DYLD_LIBRARY_PATH”
./mvnw quarkus:devVerify
Generate a valid payload (784 zeros) or take my test payload from the repository and test:
curl -s http://localhost:8080/api/mnist \
-H "Content-Type: application/json" \
-d @payload.json | jq .Expected result:
{
“digit”: 7,
“confidence”: 0.9999286,
“inferenceMs”: 0.579875
}So, who is faster? Python or Java?
I have added a benchmark endpoint that runs inference on the serverside. 10 warmup iterations and 100 inference requests.
curl -X POST “http://localhost:8080/api/mnist/benchmark?iterations=100&warmup=10” \
-H “Content-Type: application/json” \
-d @payload.json | jq .Get endpoint info
curl http://localhost:8080/api/mnist/benchmark/info | jq .If you now run the comparison script you get a comparison between the two on my machine.
Run the comparison
python3 compare_inference_v2.py
======================================================================
INFERENCE COMPARISON v2: Python vs Java (Server-Side Benchmark)
======================================================================
Iterations: 100
Warmup: 10
This version uses server-side benchmarking for accurate Java measurements
without network overhead.
======================================================================
PYTHON TENSORFLOW INFERENCE TEST
======================================================================
Loading frozen graph model...
Warming up (10 iterations)...
Benchmarking (100 iterations)...
======================================================================
JAVA REST API BENCHMARK (Server-Side)
======================================================================
✓ Server is running
Running server-side benchmark (100 iterations, 10 warmup)...
✓ Benchmark completed
======================================================================
INFERENCE TIME COMPARISON (Pure Inference Only)
======================================================================
Metric Python TensorFlow Java (Server-Side)
----------------------------------------------------------------------
Average 0.061 ms 0.087 ms
Median 0.058 ms 0.079 ms
Min 0.054 ms 0.053 ms
Max 0.106 ms 0.208 ms
Std Dev 0.007 ms 0.033 ms
P95 0.074 ms 0.164 ms
P99 0.105 ms 0.206 ms
----------------------------------------------------------------------
Throughput 16433 pred/sec 11433 pred/sec
Performance Python is 1.44x FASTER than Java
Difference 0.027 ms per inference
======================================================================
PREDICTION VERIFICATION
======================================================================
Method Digit Confidence Match
----------------------------------------------------------------------
Python 7 0.999929 ✓
Java 7 0.999929 ✓
Both methods predict the same digit!
======================================================================
TEST CONFIGURATION
======================================================================
Iterations: 100
Warmup: 10
Java verified: 100 iterations, 10 warmup
======================================================================
COMPARISON COMPLETE
======================================================================Notes:
MLIR V1 optimization pass is not enabled on Java
I have done zero optimizations with regards to session handling or anything else. This is literally default performance from the generated API.
Benchmarks should not compare :) I have written about this before. But I really wanted to see a baseline here.
The server-side benchmark reveals that Java’s pure inference performance is much closer to Python than many people think.
Python: 0.061 ms (best for maximum throughput)
Java: 0.087 ms (excellent for production services)
Difference: Only 0.027 ms (27 microseconds!)
Both implementations are excellent for production use. The choice depends on your ecosystem needs rather than raw performance, as the difference is negligible for most applications.
Congratulations.
You built a native TensorFlow inference service using pure Java.
No JNI.
No Python runtime.
No containers.
Java 25 plus FFM gives you direct access to native ML runtimes with predictable performance and memory behavior.
This is how local AI agents should be built.
Simple. Explicit. Boring.




Masterful tutorial showing FFM isn't just a language feature—it's a practical escape hatch from the JNI/Python dependency trap. The breakdown of how you bridge Java with native TensorFlow through memory Arenas and foreign function calls is exactly what production ML deployments need. The benchmark showing Java at 0.087ms vs Python at 0.061ms proves the performance delta is negligible when done rite. This pattern scales to any C/C++ library, not just ML stacks.