Running TensorFlow from Java 25 Without JNI or Python

How the Foreign Function & Memory API Enables Fast, Local ML Inference on macOS

Jan 08, 2026

I will be honest upfront.

I am not excited about working with Python.

I respect the ecosystem. I understand why data scientists love it. But as a Java developer building systems that need to run for years, integrate with other services, and behave predictably under load, Python has never felt like home.

Still, TensorFlow exists. And it exists first and foremost as a native C and C++ runtime. Python is “just” a very good binding layer.

So the question I wanted to answer was simple:

How far can we push modern Java, specifically the Foreign Function and Memory API, before we need JNI or Python at runtime?

This article is the result of that experiment.

We will build a high-performance REST service on macOS that runs TensorFlow inference directly from Java 25.

No JNI.
No embedded Python runtime.
No Docker overhead.

Just Java, Quarkus, and the TensorFlow C API.

This pattern is especially useful for local AI agents and edge-style services where you want cheap, fast inference close to your code. Think classification, routing, embeddings, or heuristics that do not justify a network hop or a managed cloud service.

What We Are Building

By the end of this tutorial, you will have:

A Quarkus REST service written in Java 25
A TensorFlow MNIST model loaded via the C API
Inference executed using Java’s FFM API
Predictable memory management without JNI
A server-side benchmark to compare Java and Python inference

All running natively on macOS.

Prerequisites

This tutorial targets macOS only! If you want to make this run on Windows, you will have to adapt paths and libraries.

Java 25 (Early Access): Download JDK 25 EA.
JExtract (Project Panama Tooling): Download JExtract.
- Note: Extract jextract and add its bin folder to your PATH.
Quarkus CLI: brew install quarkusio/tap/quarkus (or use Maven).
Python and Tensorflow.
macOS: Apple Silicon (M1/M2/M3/M4) or Intel. The examples were tested on Apple Silicon.

If you do not want to follow along, grab the example from my GitHub repository.

Prepare TensorFlow for macOS

We only use Python to install TensorFlow and build the model artifact.
Nothing from Python is used at runtime.

# 1. Create the virtual environment (named “venv”)
python3 -m venv venv

# 2. Activate it
source venv/bin/activate

# 3. Install TensorFlow inside the virtual environment
pip install tensorflow==2.13.0

This installs the native TensorFlow libraries into the virtual environment, including the .dylib files we will later load from Java.

We train a very simple MNIST classifier:

Input: 28x28 grayscale image
Output: 10 digit classes
Hidden layer: 256 neurons
Dropout: 0.3
Epochs: 5
Validation enabled

I intentionally do not inline the Python here.

This is a separation-of-concerns decision.

Model training is not application logic. It belongs in a different lifecycle, usually handled by data scientists or ML engineers. As a Java developer, you should consume model artifacts, not maintain training code in your service.

Grab create_model.py from the repository and run it.

When the script finishes, you get two artifacts:

mnist_saved_model/
mnist_frozen_graph.pb

Why We Use the Frozen Graph

SavedModels are directory-based and flexible, but they require more complex C API usage.

Frozen graphs are single .pb files with weights embedded as constants.
They are simpler to load and behave more predictably for low-level integration.

For this tutorial, frozen graphs are the right choice.

3. Bootstrap the Project

quarkus create app com.acme:tflite-agent-macos \
  --extension rest-jackson
  --no-code

cd tflite-agent-macos

Oh, and if you are wondering why this project is called tflite. Well, I tried getting TF Light to run first. But that was almost impossible, as I couldn’t build it locally or get a arm binary for macOS from anywhere.

Update pom.xml: Replace the content with this Java 25-ready configuration.

        <maven.compiler.release>25</maven.compiler.release>
     
            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>${compiler-plugin.version}</version>
                <configuration>
                    <compilerArgs>
                        <arg>--enable-preview</arg>
                    </compilerArgs>
                </configuration>
            </plugin>
            <plugin>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>${surefire-plugin.version}</version>
                <configuration>
                    <argLine>--enable-native-access=ALL-UNNAMED --enable-preview</argLine>
                </configuration>
            </plugin>

Java 25 preview features are mandatory for FFM in this form.

Generate Bindings with JExtract

Let’s create the FFM mappings. It is comparably simple, but I created a more robust script for you. Check it out in the repository.

# 1. Set the variables 
export INCLUDE_PATH/tensorflow/c/c_api.h
export OUTPUT_DIR="src/main/java"
export PACKAGE="tensorflow.ffi"

# 2. Run jextract
jextract \
  --output "$OUTPUT_DIR" \
  -t "$PACKAGE" \
  -l tensorflow \
  -I "$INCLUDE_PATH" \
  "$INCLUDE_PATH/tensorflow/c/c_api.h"

The generate_jextract.sh also exports the DYLD_LIBRARY_PATH which you need to set before running your quarkus application.

The TFLite Runtime Wrapper

The heart of this tutorial is TFRuntime.java.

This class wraps the TensorFlow C API using Java’s Foreign Function and Memory API.

Big Picture Architecture

Java Application
    ↓
TFRuntime.java (FFM API)
    ↓
TensorFlow C API (libtensorflow_cc.2.dylib)
    ↓
TensorFlow Core (C++)

Phase 1: Model Loading

The Constructor Pattern

public TFRuntime(String modelPath, int numThreads) {

this(modelPath, numThreads, “inputs:0”, “Identity:0”);

}

It loads a .pb frozen graph file with default operation names. The constructor initializes native resources that will live for the object’s lifetime.

Loading the Frozen Graph

private ModelHandle loadFrozenGraph(String path) {
    Arena arena = Arena.ofConfined();  // ← FFM memory scope
    MemorySegment status = TF_NewStatus();
    try {
        byte[] graphBytes = Files.readAllBytes(Paths.get(path));
        
        MemorySegment sessionOptions = TF_NewSessionOptions();
        MemorySegment graph = TF_NewGraph();
        MemorySegment importOptions = TF_NewImportGraphDefOptions();
        
        // Create TF_Buffer from graph bytes
        MemorySegment buffer = TF_Buffer.allocate(arena);
        MemorySegment data = arena.allocate(JAVA_BYTE, graphBytes.length);
        data.copyFrom(MemorySegment.ofArray(graphBytes));
        TF_Buffer.data(buffer, data);
        TF_Buffer.length(buffer, graphBytes.length);
        
        // Import graph and create session
        TF_GraphImportGraphDef(graph, buffer, importOptions, status);
        MemorySegment session = TF_NewSession(graph, sessionOptions, status);
        
        return new ModelHandle(graph, session, sessionOptions, arena);
    }
}

FFM Concepts in Action

Arena: Scoped Memory Management

Arena arena = Arena.ofConfined();

What: A memory scope that owns all allocations within it
Why: Automatic cleanup when arena closes (like try-with-resources)
Lifetime: Lives as long as the TFRuntime object (stored in `modelArena` field)

MemorySegment: Native Memory Access

MemorySegment data = arena.allocate(JAVA_BYTE, graphBytes.length);

data.copyFrom(MemorySegment.ofArray(graphBytes));

What: A safe view into native memory
Why: Type-safe access to C structures and arrays
How: Copy Java byte array to native memory for TensorFlow

Foreign Function Calls

MemorySegment graph = TF_NewGraph();

What: Direct call to C function `TF_NewGraph()`
Why: No JNI overhead, just a function pointer call
Return: MemorySegment pointing to native TF_Graph structure

The ModelHandle Record

private record ModelHandle(

MemorySegment graph, 

MemorySegment session, 

MemorySegment sessionOptions, 

Arena arena

) {}

Bundle all native resources that must live together. The arena keeps all allocated memory valid.

Phase 2: Inference (The run Method)

The high level flow is the following:

public float[] run(float[] input) {
    try (Arena arena = Arena.ofConfined()) {  // ← Inference scope
        // 1. Get operations from graph
        // 2. Create input tensor
        // 3. Run TensorFlow session
        // 4. Extract output tensor
        // 5. Convert to Java array
        // 6. Clean up output tensor
    }
}

Step 1: Get Operations

private MemorySegment getOperation(Arena arena, String opName) {
byte[] nameBytes = (opName + “\0”).getBytes(StandardCharsets.UTF_8);
MemorySegment nameSeg = arena.allocate(JAVA_BYTE, nameBytes.length);
nameSeg.copyFrom(MemorySegment.ofArray(nameBytes));
MemorySegment op = TF_GraphOperationByName(graph, nameSeg);
return op;

}

FFM Pattern:

Convert Java String to C string (null-terminated)
Allocate native memory in arena
Copy string bytes to native memory
Call C function with native pointer

Step 2: Create Input Tensor (The Critical Part!)

private MemorySegment createInputTensor(Arena arena, float[] input) {
    // Define shape: [1, 28, 28]
    MemorySegment dims = arena.allocate(JAVA_LONG.byteSize() * 3);
    dims.set(JAVA_LONG, 0, 1L);      // batch size
    dims.set(JAVA_LONG, 8, 28L);     // height
    dims.set(JAVA_LONG, 16, 28L);    // width

    // CRITICAL: Use Arena.ofAuto() for tensor data
    Arena tensorArena = Arena.ofAuto();
    long dataSize = (long) input.length * Float.BYTES;
    MemorySegment data = tensorArena.allocate(dataSize);
    data.copyFrom(MemorySegment.ofArray(input));

    MemorySegment tensor = TF_NewTensor(
        TF_FLOAT, dims, 3, data, dataSize, 
        NULL,  // ← No deallocator
        NULL
    );
    return tensor;
}

Step 3: Run Inference

TF_SessionRun(
    session,        // TensorFlow session
    NULL,           // run options (none)
    inputs,         // input operations
    inputValues,    // input tensor pointers
    1,              // number of inputs
    outputs,        // output operations
    outputValues,   // output tensor pointers (filled by TF)
    1,              // number of outputs
    NULL,           // target operations (none)
    0,              // number of targets
    NULL,           // run metadata (none)
    status          // error status
);

FFM Pattern: Pass pointers to structures. TensorFlow fills outputValues with result.

Step 4: Extract Output

MemorySegment outputTensor = outputValues.get(ADDRESS, 0);
float[] result = readOutputTensor(outputTensor);

Step 5: Read Tensor Data

private float[] readOutputTensor(MemorySegment tensor) {
    long byteSize = TF_TensorByteSize(tensor);
    int size = (int) (byteSize / Float.BYTES);
    MemorySegment data = TF_TensorData(tensor);
    
    float[] result = new float[size];
    MemorySegment.copy(data, JAVA_FLOAT, 0, result, 0, size);
    return result;
}

FFM Pattern:

Get pointer to tensor’s data buffer
Use `MemorySegment.copy()` to bulk-copy native floats to Java array
Type-safe: `JAVA_FLOAT` ensures correct interpretation

Step 6: Critical Cleanup

// CRITICAL: Delete output tensor (TensorFlow expects caller to free it)
TF_DeleteTensor(outputTensor);

// Do NOT delete input tensor (Arena.ofAuto() manages it)

Phase 3: And finally, the cleanup.

@PreDestroy
public void close() {
    if (session != null && !session.equals(NULL)) {
        MemorySegment status = TF_NewStatus();
        try {
            TF_CloseSession(session, status);
            TF_DeleteSession(session, status);
        } finally {
            TF_DeleteStatus(status);
        }
    }
    if (graph != null && !graph.equals(NULL)) {
        TF_DeleteGraph(graph);
    }
    if (sessionOptions != null && !sessionOptions.equals(NULL)) {
        TF_DeleteSessionOptions(sessionOptions);
    }
    if (modelArena != null) {
        modelArena.close();  // ← Frees all arena allocations
    }
}

Order matters:

Close session (stops using graph)
Delete graph (stops using options)
Delete options
Close arena (frees all memory)

The Service Logic

The MnistService uses Java FFM to wrap the TensorFlow C API for MNIST classification. It implements thread-safe lazy loading for the model (mnist_frozen_graph.pb) and is configured for Apple Silicon performance. The predict method accepts a 784-float input and returns a Prediction record containing the digit, confidence, and inference time. It also handles native library errors explicitly, providing actionable diagnostics if the underlying C dependencies are missing.

Create src/main/java/com/acme/mnist/MnistService.java.

package com.acme.mnist;

import org.jboss.logging.Logger;

import jakarta.enterprise.context.ApplicationScoped;

@ApplicationScoped
public class MnistService {
    private static final Logger LOG = Logger.getLogger(MnistService.class);

   
    private volatile TFRuntime runtime;
    private volatile String initializationError;
    private final Object initLock = new Object();


    public Prediction predict(float[] pixels) {
        ensureInitialized();
        if (runtime == null) {
            throw new IllegalStateException(
                    “TensorFlow runtime not initialized. “ +
                            (initializationError != null ? initializationError : “Unknown initialization error.”));
        }
        long start = System.nanoTime();
        float[] logits = runtime.run(pixels);
        long duration = System.nanoTime() - start;

        int maxIndex = 0;
        float maxVal = logits[0];
        for (int i = 1; i < logits.length; i++) {
            if (logits[i] > maxVal) {
                maxVal = logits[i];
                maxIndex = i;
            }
        }
        return new Prediction(maxIndex, maxVal, duration / 1_000_000.0);
    }

    public record Prediction(int digit, float confidence, double inferenceMs) {
    }
}

The REST Resource

Finally, we dug ourselves out from the trenches and reached the REST endpoint.

Create src/main/java/com/acme/mnist/MnistResource.java.

package com.acme;

import com.acme.mnist.MnistService;

import io.quarkus.logging.Log;
import jakarta.inject.Inject;
import jakarta.ws.rs.Consumes;
import jakarta.ws.rs.POST;
import jakarta.ws.rs.Path;
import jakarta.ws.rs.Produces;
import jakarta.ws.rs.core.MediaType;

@Path(”/api/mnist”)
public class MnistResource {

    @Inject
    MnistService service;

    public static class Request {
        public float[] pixels; // Efficient mapping
    }

    @POST
    @Consumes(MediaType.APPLICATION_JSON)
    @Produces(MediaType.APPLICATION_JSON)
    public MnistService.Prediction classify(Request req) {
        Log.infof(”Classifying request: %s”, req.toString());
        if (req.pixels == null || req.pixels.length != 784) {
            throw new IllegalArgumentException(”Input must be 784 float pixels”);
        }
        return service.predict(req.pixels);
    }
}

Run on macOS

This is the critical step. You must tell Java where to find your .dylib.

export DYLD_LIBRARY_PATH=”/Path/to/your/venv/lib/python3.9/site-packages/tensorflow:$DYLD_LIBRARY_PATH”

./mvnw quarkus:dev

Verify

Generate a valid payload (784 zeros) or take my test payload from the repository and test:

curl -s http://localhost:8080/api/mnist \
  -H "Content-Type: application/json" \
  -d @payload.json | jq .

Expected result:

{
  “digit”: 7,
  “confidence”: 0.9999286,
  “inferenceMs”: 0.579875
}

So, who is faster? Python or Java?

I have added a benchmark endpoint that runs inference on the serverside. 10 warmup iterations and 100 inference requests.

curl -X POST “http://localhost:8080/api/mnist/benchmark?iterations=100&warmup=10” \
  -H “Content-Type: application/json” \
  -d @payload.json | jq .

Get endpoint info

curl http://localhost:8080/api/mnist/benchmark/info | jq .

If you now run the comparison script you get a comparison between the two on my machine.

Run the comparison

python3 compare_inference_v2.py

======================================================================
INFERENCE COMPARISON v2: Python vs Java (Server-Side Benchmark)
======================================================================
Iterations: 100
Warmup: 10

This version uses server-side benchmarking for accurate Java measurements
without network overhead.

======================================================================
PYTHON TENSORFLOW INFERENCE TEST
======================================================================
Loading frozen graph model...
Warming up (10 iterations)...
Benchmarking (100 iterations)...

======================================================================
JAVA REST API BENCHMARK (Server-Side)
======================================================================
✓ Server is running
Running server-side benchmark (100 iterations, 10 warmup)...
✓ Benchmark completed

======================================================================
INFERENCE TIME COMPARISON (Pure Inference Only)
======================================================================

Metric               Python TensorFlow         Java (Server-Side)       
----------------------------------------------------------------------
Average                   0.061 ms                     0.087 ms
Median                    0.058 ms                     0.079 ms
Min                       0.054 ms                     0.053 ms
Max                       0.106 ms                     0.208 ms
Std Dev                   0.007 ms                     0.033 ms
P95                       0.074 ms                     0.164 ms
P99                       0.105 ms                     0.206 ms

----------------------------------------------------------------------
Throughput                16433 pred/sec               11433 pred/sec

Performance          Python is 1.44x FASTER than Java
Difference           0.027 ms per inference

======================================================================
PREDICTION VERIFICATION
======================================================================
Method               Digit      Confidence      Match     
----------------------------------------------------------------------
Python               7          0.999929        ✓         
Java                 7          0.999929        ✓         

Both methods predict the same digit!

======================================================================
TEST CONFIGURATION
======================================================================
Iterations:      100
Warmup:          10
Java verified:   100 iterations, 10 warmup

======================================================================
COMPARISON COMPLETE
======================================================================

Notes:

MLIR V1 optimization pass is not enabled on Java
I have done zero optimizations with regards to session handling or anything else. This is literally default performance from the generated API.
Benchmarks should not compare :) I have written about this before. But I really wanted to see a baseline here.

The server-side benchmark reveals that Java’s pure inference performance is much closer to Python than many people think.

Python: 0.061 ms (best for maximum throughput)
Java: 0.087 ms (excellent for production services)
Difference: Only 0.027 ms (27 microseconds!)

Both implementations are excellent for production use. The choice depends on your ecosystem needs rather than raw performance, as the difference is negligible for most applications.

Congratulations.

You built a native TensorFlow inference service using pure Java.
No JNI.
No Python runtime.
No containers.

Java 25 plus FFM gives you direct access to native ML runtimes with predictable performance and memory behavior.

This is how local AI agents should be built.

Simple. Explicit. Boring.

Discussion about this post

Ready for more?