Resilient AI with Quarkus: Fault Tolerance Meets LangChain4j
Build Java microservices that call local LLMs safely with retries, circuit breakers, bulkheads, and fallbacks.
Modern Java services increasingly call local or remote LLMs. These calls are slow, bursty, and occasionally flaky. Perfect material to demonstrate real fault-tolerance, not just retries.
In this hands-on tutorial you’ll build a small REST service that summarizes a URL’s content using a local Ollama model via LangChain4j. You’ll wrap the LLM call with SmallRye Fault Tolerance strategies: timeout, retry with backoff, circuit breaker, bulkhead, fallback, and rate limiting. We’ll keep it pragmatic and production-aware.
Quarkus provides SmallRye Fault Tolerance (implementation of MicroProfile Fault Tolerance) and exposes both spec features and SmallRye extras like rate limiting.
LangChain4j’s Quarkus extensions make calling Ollama models straightforward and configurable.
Why this matters in the enterprise
LLM calls are networked and compute-heavy. Timeouts and backpressure protect upstreams and your bill.
Retries help with transient model hiccups. Circuit breakers stop runaway cascades.
Fallbacks preserve user experience when AI is down.
Rate limiting and bulkheads keep one hot endpoint from starving the rest of the service.
Prerequisites
Java 17+
Maven 3.9+
Podman 5+ (or Docker). We’ll rely on Quarkus Dev Services to auto-start Ollama in dev, but you can also run a container manually.
Quarkus 3.26.x (latest at the time of writing). If you are on an older minor, update with
quarkus update
.
Bootstrap the project
Use the Quarkus CLI or Maven:
quarkus create app com.example:resilient-ai \
--extension='smallrye-fault-tolerance,rest-jackson,quarkus-langchain4j-ollama' \
--no-code
cd resilient-ai
Find the source code on my Github repository if you don’t want to build it out yourself.
Configuration
We’ll use Llama3 for fast local dev, set a global LLM timeout, and configure fault-tolerance overrides using the official “identifier” format documented in the Quarkus FT guide.
src/main/resources/application.properties
:
# --- LangChain4j Ollama ---
# Dev mode can auto-start Ollama and auto-pull the model; pre-pull to speed up.
# Model choices: qwen3:1.7b, llama3, mistral, etc.
quarkus.langchain4j.ollama.chat-model.model-name=llama3
quarkus.langchain4j.ollama.chat-model.temperature=0
# Global inference timeout for LangChain4j (Quarkus extension setting):
quarkus.langchain4j.timeout=45s
# --- Fault Tolerance global toggles ---
quarkus.fault-tolerance.metrics.enabled=true
# Keep SmallRye's sensible non-strict mode (see guide for differences)
quarkus.fault-tolerance.mp-compatibility=false
# --- Per-method FT overrides using <identifier> = ClassName/methodName ---
# We’ll override annotations applied on AiSummarizer.summarize(..)
# Timeout tighter than global LLM timeout
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".timeout.value=10
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".timeout.unit=SECONDS
# Retry with exponential backoff and jitter (SmallRye)
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".retry.max-retries=2
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".retry.delay=200
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".retry.delay-unit=MILLIS
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".retry.jitter=400
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".retry.jitter-unit=millis
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".exponential-backoff.enabled=true
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".exponential-backoff.factor=2
# Circuit breaker tuned for LLM flakiness
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".circuit-breaker.request-volume-threshold=8
quarkus.fault-tolerance."AiSummarizer/summarize".circuit-breaker.failure-ratio=0.5
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".circuit-breaker.delay=10
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".circuit-breaker.delay-unit=SECONDS
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".circuit-breaker.success-threshold=2
# Bulkhead to avoid CPU thrash (synchronous; no queue)
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".bulkhead.value=8
# Rate limit (SmallRye extra) to throttle bursty clients
# 5 requests per second:
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".rate-limit.value=5
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".rate-limit.window=1
quarkus.fault-tolerance."com.example.AiSummarizer/summarize".rate-limit.window-unit=SECONDS
All property names and the <identifier>
pattern can be looked up in the Quarkus FT configuration reference. The @RateLimit
strategy is a SmallRye feature available in Quarkus.
Core implementation
We’ll implement four parts:
A tiny HTTP client to fetch page text.
An AI service that summarizes the text.
A resource that orchestrates fetch + summarize and applies fault tolerance.
A simple cache fallback when the circuit is open or timeouts hit.
Fetch minimal text
src/main/java/com/example/FetchService.java
:
package com.example;
import java.io.IOException;
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import jakarta.enterprise.context.ApplicationScoped;
@ApplicationScoped
public class FetchService {
private final HttpClient client = HttpClient.newHttpClient();
public String fetch(String url) throws IOException, InterruptedException {
HttpRequest req = HttpRequest.newBuilder(URI.create(url))
.header("User-Agent", "ResilientAI/1.0")
.GET()
.build();
HttpResponse<String> resp = client.send(req, HttpResponse.BodyHandlers.ofString());
if (resp.statusCode() / 100 != 2) {
throw new IOException("Non-2xx from upstream: " + resp.statusCode());
}
// Naive text cleanup; real systems would use Tika/Docling.
return resp.body().replaceAll("\\s+", " ").trim();
}
}
The AI service (LangChain4j)
src/main/java/com/example/Assistant.java
:
package com.example;
import dev.langchain4j.service.UserMessage;
import io.quarkiverse.langchain4j.RegisterAiService;
@RegisterAiService
public interface Assistant {
@UserMessage("""
Summarize the following content in 5 bullet points.
Be concise and factual. Content: {text}
""")
String summarize(String text);
}
This uses the Quarkus LangChain4j Ollama integration and the configured model.
The resilient orchestrator with SmallRye FT
src/main/java/com/example/AiSummarizer.java
:
package com.example;
import org.eclipse.microprofile.faulttolerance.Bulkhead;
import org.eclipse.microprofile.faulttolerance.CircuitBreaker;
import org.eclipse.microprofile.faulttolerance.Fallback;
import org.eclipse.microprofile.faulttolerance.Retry;
import org.eclipse.microprofile.faulttolerance.Timeout;
import io.smallrye.faulttolerance.api.ExponentialBackoff;
import io.smallrye.faulttolerance.api.RateLimit;
import jakarta.enterprise.context.ApplicationScoped;
import jakarta.inject.Inject;
@ApplicationScoped
public class AiSummarizer {
@Inject
Assistant assistant;
@Inject
FetchService fetch;
// Apply practical, layered safeguards:
@Timeout(value = 10000) // fast abort on slow model calls
@Retry(maxRetries = 2, delay = 200) // transient hiccups
@ExponentialBackoff // multiplicative backoff (SmallRye)
@CircuitBreaker(requestVolumeThreshold = 8, failureRatio = 0.5, delay = 10000, // ms
successThreshold = 2)
@Bulkhead(value = 8) // cap parallelism
@RateLimit(value = 5) // 5 req / second (SmallRye)
@Fallback(fallbackMethod = "cachedSummary")
public String summarize(String url) throws Exception {
String text = fetch.fetch(url);
// Keep the prompt compact; we rely on model defaults for length.
return assistant.summarize(text);
}
// Fallback must match signature + return type
String cachedSummary(String url) {
// Minimal local fallback: deterministic stub or a tiny cached value.
// In real life, use a proper cache or last-known-good store.
return "Service is busy. Here is a safe fallback: "
+ "Unable to summarize right now for " + url + ". Try again shortly.";
}
}
Notes:
@ExponentialBackoff
and@RateLimit
are SmallRye extensions exposed by Quarkus; they’re documented as part of SmallRye FT “extras.” (smallrye.io)All annotation parameters can be overridden with the configuration you saw earlier using the
<identifier>
format com.example.AiSummarizer/summarize
.
REST endpoint
src/main/java/com/example/SummaryResource.java
:
package com.example;
import jakarta.inject.Inject;
import jakarta.ws.rs.BadRequestException;
import jakarta.ws.rs.GET;
import jakarta.ws.rs.Path;
import jakarta.ws.rs.Produces;
import jakarta.ws.rs.QueryParam;
import jakarta.ws.rs.core.MediaType;
import jakarta.ws.rs.core.Response;
@Path("/summarize")
@Produces(MediaType.TEXT_PLAIN)
public class SummaryResource {
@Inject
AiSummarizer summarizer;
@GET
public Response summarize(@QueryParam("url") String url) {
if (url == null || url.isBlank()) {
throw new BadRequestException("Missing ?url=");
}
try {
String result = summarizer.summarize(url);
return Response.ok(result).build();
} catch (Exception e) {
// Most failures are handled by FT; this is the last resort.
return Response.serverError()
.entity("Unexpected error: " + e.getMessage())
.build();
}
}
}
Run and verify
quarkus dev
Quarkus Dev Services can auto-start Ollama and pull the model on first use. If you’re using a natively installed Ollama, pre-pull to speed up: ollama pull llama3
.
Try it out
Let’s first try the happy path with a small website so we can actually see this is working:
curl 'http://localhost:8080/summarize?url=https://myfear.com'
You should see 5 bullets. If the model is still downloading, you may hit the fallback briefly.
Now a more complex case, that my local machine could not solve within the specified timeouts and retries.
curl 'http://localhost:8080/summarize?url=https://quarkus.io/guides/smallrye-fault-tolerance'
The rest endpoint responds with:
”Service is busy. Here is a safe fallback: Unable to summarize right now .”
What each strategy buys you
@Timeout
: caps latency to protect p95 and threads.@Retry
+@ExponentialBackoff
: tames transient failures without creating synchronized retry storms. SmallRye provides the backoff decorator and the jitter configuration.@CircuitBreaker
: sheds load when the model or host is struggling. Parameters likerequestVolumeThreshold
,failureRatio
,delay
, andsuccessThreshold
are configurable.@Bulkhead
: limits concurrent LLM calls so CPU doesn’t starve other endpoints.@RateLimit
(SmallRye): throttles bursts per time window; combine with an API gateway for global quotas.@Fallback
: ensures graceful degradation to cached or deterministic output.
All of these can be tuned per method via configuration keys, not just annotations, which is ideal for ops-driven tuning.
Production notes
Pin model and timeouts. Keep
quarkus.langchain4j.ollama.chat-model.model-name
explicit and setquarkus.langchain4j.timeout
based on real latency.Metrics. If you use Micrometer or SmallRye Metrics, FT metrics are emitted automatically; wire dashboards and alerts on open circuits, retry rates, and timeouts.
Backpressure. Bulkhead + rate limit at the app; add gateway limits at the edge.
Cache your fallbacks. Use a store (Redis, Postgres) for last-known-good summaries.
Tune the breaker. Validate the window size (
requestVolumeThreshold
) andfailureRatio
using load tests that mimic real user patterns.Threading. If you switch to async (
Uni
), read the SmallRye FT notes on asynchronous behavior and request context propagation in Quarkus.Config over code. Keep conservative defaults in annotations and override per environment with the documented property names.
SmallRye FT is designed for guarding remote calls; don’t use it as a generic exception-handling replacement.
Where to explore next
Use
@RetryWhen
for result-based retry decisions, or the programmatic Guard API (@ApplyGuard
) when you need composition. Quarkus documents both.Add OpenTelemetry to correlate user requests with FT metrics and AI latency.
Swap models per environment by config only. The LangChain4j guide shows the model keys.
Full source recap
You now have a complete, runnable Quarkus service that calls a local LLM and remains stable under real-world failure modes using SmallRye Fault Tolerance. The combination is ideal for AI-infused enterprise Java services where reliability matters as much as intelligence.
Build for resilience, not hope.