Resilient AI with Quarkus: Fault Tolerance Meets LangChain4j

Build Java microservices that call local LLMs safely with retries, circuit breakers, bulkheads, and fallbacks.

Sep 14, 2025

Modern Java services increasingly call local or remote LLMs. These calls are slow, bursty, and occasionally flaky. Perfect material to demonstrate real fault-tolerance, not just retries.

In this hands-on tutorial you’ll build a small REST service that summarizes a URL’s content using a local Ollama model via LangChain4j. You’ll wrap the LLM call with SmallRye Fault Tolerance strategies: timeout, retry with backoff, circuit breaker, bulkhead, fallback, and rate limiting. We’ll keep it pragmatic and production-aware.

Quarkus provides SmallRye Fault Tolerance (implementation of MicroProfile Fault Tolerance) and exposes both spec features and SmallRye extras like rate limiting.
LangChain4j’s Quarkus extensions make calling Ollama models straightforward and configurable.

Why this matters in the enterprise

LLM calls are networked and compute-heavy. Timeouts and backpressure protect upstreams and your bill.
Retries help with transient model hiccups. Circuit breakers stop runaway cascades.
Fallbacks preserve user experience when AI is down.
Rate limiting and bulkheads keep one hot endpoint from starving the rest of the service.

Prerequisites

Java 17+
Maven 3.9+
Podman 5+ (or Docker). We’ll rely on Quarkus Dev Services to auto-start Ollama in dev, but you can also run a container manually.
Quarkus 3.26.x (latest at the time of writing). If you are on an older minor, update with quarkus update.

Bootstrap the project

Use the Quarkus CLI or Maven:

quarkus create app com.example:resilient-ai \
  --extension='smallrye-fault-tolerance,rest-jackson,quarkus-langchain4j-ollama' \
  --no-code
cd resilient-ai

Find the source code on my Github repository if you don’t want to build it out yourself.

Configuration

We’ll use Llama3 for fast local dev, set a global LLM timeout, and configure fault-tolerance overrides using the official “identifier” format documented in the Quarkus FT guide.

src/main/resources/application.properties:

# --- LangChain4j Ollama ---
# Dev mode can auto-start Ollama and auto-pull the model; pre-pull to speed up.
# Model choices: qwen3:1.7b, llama3, mistral, etc.
quarkus.langchain4j.ollama.chat-model.model-name=llama3
quarkus.langchain4j.ollama.chat-model.temperature=0
# Global inference timeout for LangChain4j (Quarkus extension setting):
quarkus.langchain4j.timeout=45s

# --- Fault Tolerance global toggles ---
quarkus.fault-tolerance.metrics.enabled=true
# Keep SmallRye's sensible non-strict mode (see guide for differences)
quarkus.fault-tolerance.mp-compatibility=false

# --- Per-method FT overrides using <identifier> = ClassName/methodName ---
# We’ll override annotations applied on AiSummarizer.summarize(..)

# Timeout tighter than global LLM timeout
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".timeout.value=10
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".timeout.unit=SECONDS

# Retry with exponential backoff and jitter (SmallRye)
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".retry.max-retries=2
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".retry.delay=200
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".retry.delay-unit=MILLIS
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".retry.jitter=400
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".retry.jitter-unit=millis
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".exponential-backoff.enabled=true
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".exponential-backoff.factor=2

# Circuit breaker tuned for LLM flakiness
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".circuit-breaker.request-volume-threshold=8
quarkus.fault-tolerance."AiSummarizer/summarize".circuit-breaker.failure-ratio=0.5
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".circuit-breaker.delay=10
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".circuit-breaker.delay-unit=SECONDS
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".circuit-breaker.success-threshold=2

# Bulkhead to avoid CPU thrash (synchronous; no queue)
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".bulkhead.value=8

# Rate limit (SmallRye extra) to throttle bursty clients
# 5 requests per second:
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".rate-limit.value=5
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".rate-limit.window=1
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".rate-limit.window-unit=SECONDS

All property names and the <identifier> pattern can be looked up in the Quarkus FT configuration reference. The @RateLimit strategy is a SmallRye feature available in Quarkus.

Core implementation

We’ll implement four parts:

A tiny HTTP client to fetch page text.
An AI service that summarizes the text.
A resource that orchestrates fetch + summarize and applies fault tolerance.
A simple cache fallback when the circuit is open or timeouts hit.

Fetch minimal text

src/main/java/com/example/FetchService.java:

package com.example;

import java.io.IOException;
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;

import jakarta.enterprise.context.ApplicationScoped;

@ApplicationScoped
public class FetchService {

    private final HttpClient client = HttpClient.newHttpClient();

    public String fetch(String url) throws IOException, InterruptedException {
        HttpRequest req = HttpRequest.newBuilder(URI.create(url))
                .header("User-Agent", "ResilientAI/1.0")
                .GET()
                .build();
        HttpResponse<String> resp = client.send(req, HttpResponse.BodyHandlers.ofString());
        if (resp.statusCode() / 100 != 2) {
            throw new IOException("Non-2xx from upstream: " + resp.statusCode());
        }
        // Naive text cleanup; real systems would use Tika/Docling.
        return resp.body().replaceAll("\\s+", " ").trim();
    }
}

The AI service (LangChain4j)

src/main/java/com/example/Assistant.java:

package com.example;

import dev.langchain4j.service.UserMessage;
import io.quarkiverse.langchain4j.RegisterAiService;

@RegisterAiService
public interface Assistant {

    @UserMessage("""
            Summarize the following content in 5 bullet points.
            Be concise and factual. Content: {text}
            """)
    String summarize(String text);
}

This uses the Quarkus LangChain4j Ollama integration and the configured model.

The resilient orchestrator with SmallRye FT

src/main/java/com/example/AiSummarizer.java:

package com.example;

import org.eclipse.microprofile.faulttolerance.Bulkhead;
import org.eclipse.microprofile.faulttolerance.CircuitBreaker;
import org.eclipse.microprofile.faulttolerance.Fallback;
import org.eclipse.microprofile.faulttolerance.Retry;
import org.eclipse.microprofile.faulttolerance.Timeout;

import io.smallrye.faulttolerance.api.ExponentialBackoff;
import io.smallrye.faulttolerance.api.RateLimit;
import jakarta.enterprise.context.ApplicationScoped;
import jakarta.inject.Inject;

@ApplicationScoped
public class AiSummarizer {

    @Inject
    Assistant assistant;
    @Inject
    FetchService fetch;

    // Apply practical, layered safeguards:
    @Timeout(value = 10000) // fast abort on slow model calls
    @Retry(maxRetries = 2, delay = 200) // transient hiccups
    @ExponentialBackoff // multiplicative backoff (SmallRye)
    @CircuitBreaker(requestVolumeThreshold = 8, failureRatio = 0.5, delay = 10000, // ms
            successThreshold = 2)
    @Bulkhead(value = 8) // cap parallelism
    @RateLimit(value = 5) // 5 req / second (SmallRye)
    @Fallback(fallbackMethod = "cachedSummary")
    public String summarize(String url) throws Exception {
        String text = fetch.fetch(url);
        // Keep the prompt compact; we rely on model defaults for length.
        return assistant.summarize(text);
    }

    // Fallback must match signature + return type
    String cachedSummary(String url) {
        // Minimal local fallback: deterministic stub or a tiny cached value.
        // In real life, use a proper cache or last-known-good store.
        return "Service is busy. Here is a safe fallback: "
                + "Unable to summarize right now for " + url + ". Try again shortly.";
    }
}

Notes:

@ExponentialBackoff and @RateLimit are SmallRye extensions exposed by Quarkus; they’re documented as part of SmallRye FT “extras.” (smallrye.io)
All annotation parameters can be overridden with the configuration you saw earlier using the <identifier> format com.example.AiSummarizer/summarize.

REST endpoint

src/main/java/com/example/SummaryResource.java:

package com.example;

import jakarta.inject.Inject;
import jakarta.ws.rs.BadRequestException;
import jakarta.ws.rs.GET;
import jakarta.ws.rs.Path;
import jakarta.ws.rs.Produces;
import jakarta.ws.rs.QueryParam;
import jakarta.ws.rs.core.MediaType;
import jakarta.ws.rs.core.Response;

@Path("/summarize")
@Produces(MediaType.TEXT_PLAIN)
public class SummaryResource {

    @Inject
    AiSummarizer summarizer;

    @GET
    public Response summarize(@QueryParam("url") String url) {
        if (url == null || url.isBlank()) {
            throw new BadRequestException("Missing ?url=");
        }
        try {
            String result = summarizer.summarize(url);
            return Response.ok(result).build();
        } catch (Exception e) {
            // Most failures are handled by FT; this is the last resort.
            return Response.serverError()
                    .entity("Unexpected error: " + e.getMessage())
                    .build();
        }
    }
}

Run and verify

quarkus dev

Quarkus Dev Services can auto-start Ollama and pull the model on first use. If you’re using a natively installed Ollama, pre-pull to speed up: ollama pull llama3.

Try it out

Let’s first try the happy path with a small website so we can actually see this is working:

curl 'http://localhost:8080/summarize?url=https://myfear.com'

You should see 5 bullets. If the model is still downloading, you may hit the fallback briefly.

Now a more complex case, that my local machine could not solve within the specified timeouts and retries.

curl 'http://localhost:8080/summarize?url=https://quarkus.io/guides/smallrye-fault-tolerance'

The rest endpoint responds with:
”Service is busy. Here is a safe fallback: Unable to summarize right now .”

What each strategy buys you

@Timeout: caps latency to protect p95 and threads.
@Retry + @ExponentialBackoff : tames transient failures without creating synchronized retry storms. SmallRye provides the backoff decorator and the jitter configuration.
@CircuitBreaker: sheds load when the model or host is struggling. Parameters like requestVolumeThreshold, failureRatio, delay, and successThreshold are configurable.
@Bulkhead: limits concurrent LLM calls so CPU doesn’t starve other endpoints.
@RateLimit (SmallRye): throttles bursts per time window; combine with an API gateway for global quotas.
@Fallback: ensures graceful degradation to cached or deterministic output.

All of these can be tuned per method via configuration keys, not just annotations, which is ideal for ops-driven tuning.

Production notes

Pin model and timeouts. Keep quarkus.langchain4j.ollama.chat-model.model-name explicit and set quarkus.langchain4j.timeout based on real latency.
Metrics. If you use Micrometer or SmallRye Metrics, FT metrics are emitted automatically; wire dashboards and alerts on open circuits, retry rates, and timeouts.
Backpressure. Bulkhead + rate limit at the app; add gateway limits at the edge.
Cache your fallbacks. Use a store (Redis, Postgres) for last-known-good summaries.
Tune the breaker. Validate the window size (requestVolumeThreshold) and failureRatio using load tests that mimic real user patterns.
Threading. If you switch to async (Uni), read the SmallRye FT notes on asynchronous behavior and request context propagation in Quarkus.
Config over code. Keep conservative defaults in annotations and override per environment with the documented property names.
SmallRye FT is designed for guarding remote calls; don’t use it as a generic exception-handling replacement.

Where to explore next

Use @RetryWhen for result-based retry decisions, or the programmatic Guard API (@ApplyGuard) when you need composition. Quarkus documents both.
Add OpenTelemetry to correlate user requests with FT metrics and AI latency.
Swap models per environment by config only. The LangChain4j guide shows the model keys.

Full source recap

You now have a complete, runnable Quarkus service that calls a local LLM and remains stable under real-world failure modes using SmallRye Fault Tolerance. The combination is ideal for AI-infused enterprise Java services where reliability matters as much as intelligence.

Build for resilience, not hope.