Model Routing in Quarkus LangChain4j with Ollama
Build a two-lane Quarkus service that classifies prompts, routes cheap questions to a fast local model, and keeps routing decisions testable and observable.
ForgeCI’s on-call queue looked fine until someone opened the inference bill. “What flag disables cache?” and “why does my pipeline OOM only on cached arm64 builds?” both hit the same model at the same per-token rate. The cheap questions subsidized the expensive ones.
This tutorial fixes that. We build ForgeAssist: a Quarkus service that classifies each incoming prompt, routes it to the cheapest Ollama model that can plausibly handle it, and logs the routing decision without blocking the HTTP response. The sample stays small: one enum, one AI service, one router, one observer, and one REST endpoint.
What you will be able to do
By the end you can:
Configure multiple named Ollama model instances in one Quarkus application.
Declare a structured-output AiService that returns a Java enum from an LLM.
Inject named
ChatModelbeans programmatically and pick between them at runtime.Fire a CDI event to make routing decisions observable without polluting the response path.
Write a
@QuarkusTestthat asserts routing behavior instead of answer quality.
What we build
ForgeAssist exposes POST /assist with a plain-text question body. Every request flows through a classifier on the fast lane (qwen2.5:0.5b), then dispatches to either the fast lane or the power lane (llama3.2) based on a Complexity enum. A CDI observer logs the decision asynchronously.
The request lands on the REST resource, the router asks the classifier on the fast model, then it picks a lane, returns the answer, and fires an async routing event on the way out.
Named model — a second configuration block in application.properties, distinguished by a string identifier.
AiService — a Java interface Quarkus implements at build time by calling an LLM.
Structured output — the return type drives schema-aware parsing; you do not hand-parse enum strings.
@ModelName — a LangChain4j CDI qualifier that selects which named model bean to inject.
CDI event — decoupled notification; fire() is synchronous, fireAsync() is asynchronous.
What you need
You have written @QuarkusTest before and know what a LangChain4j AI service interface looks like.
JDK 25 (
java -version)Maven 3.9+ (
mvn -version)Ollama installed and running (
ollama serve)Models pulled:
ollama pull qwen2.5:0.5bandollama pull llama3.2Quarkus CLI installed (
quarkus --version) — optional; Maven wrapper works fineAbout two ☕️☕️
Project setup
quarkus create app dev.forgeassist:forgeassist-routing \
--package-name=dev.forgeassist \
--extension='quarkus-rest,quarkus-langchain4j-ollama,quarkus-arc' \
--java=25 \
--no-code
cd forgeassist-routingWhen you add the Ollama extension, the generator also adds quarkus-langchain4j-bom to dependencyManagement.
Add test dependencies to pom.xml:
<dependency>
<groupId>io.quarkus</groupId>
<artifactId>quarkus-junit-mockito</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>io.rest-assured</groupId>
<artifactId>rest-assured</artifactId>
<scope>test</scope>
</dependency>Run a quick sanity check before business logic:
quarkus devQuarkus should start and Dev UI should be at http://localhost:8080/q/dev. No model call happens yet. We are only confirming the extension graph wires up.
Configure two Ollama models
This block is the spine of the tutorial. Misread it and you will debug phantom CDI issues for an hour.
src/main/resources/application.properties:
quarkus.application.name=forgeassist-routing
# Explicit host Ollama — Dev Services off for teaching clarity
quarkus.langchain4j.ollama.devservices.enabled=false
quarkus.langchain4j.ollama.base-url=http://localhost:11434
# Power lane (default unnamed bean)
quarkus.langchain4j.ollama.chat-model.model-id=llama3.2
quarkus.langchain4j.ollama.chat-model.temperature=0.7
quarkus.langchain4j.timeout=60s
# Fast lane (named "fast")
quarkus.langchain4j.fast.chat-model.provider=ollama
quarkus.langchain4j.ollama.fast.base-url=${quarkus.langchain4j.ollama.base-url}
quarkus.langchain4j.ollama.fast.chat-model.model-id=qwen2.5:0.5b
quarkus.langchain4j.ollama.fast.chat-model.temperature=0.0
quarkus.langchain4j.ollama.fast.timeout=15sNamed model config has two parts: quarkus.langchain4j.<name>.* at the provider level and quarkus.langchain4j.ollama.<name>.* for Ollama-specific settings. Leave <name> out and you configure the default bean. Add fast and CDI gets a second bean exposed through @ModelName("fast"). The quarkus.langchain4j.fast.chat-model.provider=ollama line matters; skip it and the named lane does not resolve cleanly.
Temperature 0.0 on the fast model is deliberate. The classifier must be deterministic; a stochastic classifier that occasionally says SIMPLE when it means COMPLEX defeats the purpose.
Dev Services is off on purpose. Quarkus LangChain4j can auto-start Ollama and pull models in dev mode, which is convenient locally and bad for a tutorial that is supposed to show the real endpoint and the real failure path.
One quick sanity check before you move on: if Dev Services is off and Ollama is unreachable, or qwen2.5:0.5b is missing, what does the first AI request do? Make the prediction, then try it. That is the failure you want to recognize later when the logs are less friendly.
The complexity enum
I use a binary enum rather than SIMPLE / MODERATE / COMPLEX. More tiers need more calibration and prompt tuning, and they only help when you have three meaningfully different models to route to. Start with two lanes; extend later if the economics justify it.
src/main/java/dev/forgeassist/Complexity.java:
package dev.forgeassist;
public enum Complexity {
/**
* Factual lookups, single-step how-tos, definitional questions.
* Examples: "What does --dry-run do?", "List the ForgeCI environment
* variables."
*/
SIMPLE,
/**
* Multi-step reasoning, debugging with context, architectural trade-offs,
* ambiguous or environment-specific problems.
* Examples: "Why does my pipeline OOM only on cached arm64 builds?"
*/
COMPLEX
}The classifier AiService
Quarkus LangChain4j can return Java types from an AiService method, not just String. With an enum return type, the framework injects the enum schema into the prompt and deserializes the response, so there is no hand-rolled parsing code. The trade-off is still real: a confused model can return something unparseable. For a two-lane classifier with a small context window, I think that risk is low enough.
src/main/java/dev/forgeassist/PromptClassifier.java:
package dev.forgeassist;
import dev.langchain4j.service.SystemMessage;
import dev.langchain4j.service.UserMessage;
import io.quarkiverse.langchain4j.RegisterAiService;
@RegisterAiService(modelName = "fast")
public interface PromptClassifier {
@SystemMessage("""
You are a prompt complexity classifier for ForgeCI, a CI/CD platform.
Classify the user's question into exactly one of: SIMPLE, COMPLEX.
SIMPLE: factual lookups, flag definitions, single-step how-tos,
questions answerable from documentation alone.
COMPLEX: debugging with environment context, multi-step reasoning,
architectural trade-offs, questions that require inference
beyond what documentation states.
Respond with ONLY the enum value. No punctuation. No explanation.
""")
Complexity classify(@UserMessage String prompt);
}modelName = "fast" on the classifier matters. Classification is already the cheap job. Using the power model to decide whether to use the power model is circular and a little silly.
The system prompt uses ForgeCI-flavored examples on purpose. Generic “SIMPLE: short questions” prompts look tidy and perform worse once you hit the awkward edge cases.
At the wire, a simple flag question looks like this:
{
"model": "qwen2.5:0.5b",
"messages": [
{
"role": "system",
"content": "You are a prompt complexity classifier..."
},
{
"role": "user",
"content": "What does the --no-cache flag do in forge build?"
}
]
}Response: SIMPLE
Routing decision record
We define the record first so the fireAsync line later is moving a real type, not a mystery blob.
src/main/java/dev/forgeassist/RoutingDecision.java:
package dev.forgeassist;
import java.time.Instant;
/**
* Immutable record of a single model routing decision.
* Fired as a CDI event; consumed by observers for logging and metrics.
*/
public record RoutingDecision(
String prompt,
Complexity complexity,
String selectedModel,
long classificationMillis,
Instant timestamp) {
public RoutingDecision(
String prompt, Complexity complexity, String selectedModel, long classificationMillis) {
this(prompt, complexity, selectedModel, classificationMillis, Instant.now());
}
}Records give you immutability for free. The CDI event keeps the routing decision decoupled from whoever wants to watch it, so the router does not care whether you only log today or add metrics tomorrow.
The routing event observer
src/main/java/dev/forgeassist/RoutingEventObserver.java:
package dev.forgeassist;
import org.jboss.logging.Logger;
import jakarta.enterprise.context.ApplicationScoped;
import jakarta.enterprise.event.ObservesAsync;
@ApplicationScoped
public class RoutingEventObserver {
private static final Logger LOG = Logger.getLogger(RoutingEventObserver.class);
public void onRoutingDecision(@ObservesAsync RoutingDecision decision) {
LOG.infof(
"[ROUTING] complexity=%s model=%s classificationMs=%d prompt=\"%s\"",
decision.complexity(),
decision.selectedModel(),
decision.classificationMillis(),
truncate(decision.prompt(), 80));
}
private static String truncate(String s, int max) {
return s.length() <= max ? s : s.substring(0, max) + "…";
}
}That separation is the whole point: the observer does not know about the router, the router does not know about the observer, and adding a Micrometer counter later does not require another pass through ModelRouter.
The model router
This is the center of the sample, so we build it in layers.
Shell and injections
@ModelName is a Quarkus LangChain4j qualifier — not @Named. Injection without a qualifier resolves to the default (unnamed) bean, which is the power model here.
package dev.forgeassist;
import dev.langchain4j.model.chat.ChatModel;
import io.quarkiverse.langchain4j.ModelName;
import jakarta.enterprise.context.ApplicationScoped;
import jakarta.enterprise.event.Event;
import jakarta.inject.Inject;
@ApplicationScoped
public class ModelRouter {
private final PromptClassifier classifier;
private final ChatModel fastModel;
private final ChatModel powerModel;
private final Event<RoutingDecision> routingEvents;
@Inject
public ModelRouter(
PromptClassifier classifier,
@ModelName("fast") ChatModel fastModel,
ChatModel powerModel,
Event<RoutingDecision> routingEvents) {
this.classifier = classifier;
this.fastModel = fastModel;
this.powerModel = powerModel;
this.routingEvents = routingEvents;
}
// route() follows below
}Classification and timing
At this point in execution, only the fast model has run (via the classifier AiService). The power model has done nothing.
public String route(String userPrompt) {
long start = System.currentTimeMillis();
Complexity complexity = classifier.classify(userPrompt);
long classificationMillis = System.currentTimeMillis() - start;
// dispatch continues below
}Dispatch and event
Once the target model changes at runtime, drop down to the raw LangChain4j ChatModel.chat(ChatRequest) API instead of wrapping another AiService around it.
ChatModel selected = switch (complexity) {
case SIMPLE -> fastModel;
case COMPLEX -> powerModel;
};
ChatRequest request = ChatRequest.builder().messages(UserMessage.from(userPrompt)).build();
String response = selected.chat(request).aiMessage().text();
routingEvents.fireAsync(
new RoutingDecision(
userPrompt,
complexity,
complexity == Complexity.SIMPLE ? "qwen2.5:0.5b" : "llama3.2",
classificationMillis));
return response;
}Complete ModelRouter.java with imports:
package dev.forgeassist;
import dev.langchain4j.data.message.UserMessage;
import dev.langchain4j.model.chat.ChatModel;
import dev.langchain4j.model.chat.request.ChatRequest;
import io.quarkiverse.langchain4j.ModelName;
import jakarta.enterprise.context.ApplicationScoped;
import jakarta.enterprise.event.Event;
import jakarta.inject.Inject;
@ApplicationScoped
public class ModelRouter {
private final PromptClassifier classifier;
private final ChatModel fastModel;
private final ChatModel powerModel;
private final Event<RoutingDecision> routingEvents;
@Inject
public ModelRouter(
PromptClassifier classifier,
@ModelName("fast") ChatModel fastModel,
ChatModel powerModel,
Event<RoutingDecision> routingEvents) {
this.classifier = classifier;
this.fastModel = fastModel;
this.powerModel = powerModel;
this.routingEvents = routingEvents;
}
public String route(String userPrompt) {
long start = System.currentTimeMillis();
Complexity complexity = classifier.classify(userPrompt);
long classificationMillis = System.currentTimeMillis() - start;
ChatModel selected = switch (complexity) {
case SIMPLE -> fastModel;
case COMPLEX -> powerModel;
};
ChatRequest request = ChatRequest.builder().messages(UserMessage.from(userPrompt)).build();
String response = selected.chat(request).aiMessage().text();
routingEvents.fireAsync(
new RoutingDecision(
userPrompt,
complexity,
complexity == Complexity.SIMPLE ? "qwen2.5:0.5b" : "llama3.2",
classificationMillis));
return response;
}
}fireAsync keeps the HTTP response path clean: the caller gets the answer immediately and the observer runs asynchronously. That makes sense for diagnostics and lightweight counters. It is a bad fit for side effects you cannot afford to lose unless you also have a plan for seeing and recovering from async observer failures.
REST endpoint
src/main/java/dev/forgeassist/ForgeAssistResource.java:
package dev.forgeassist;
import jakarta.inject.Inject;
import jakarta.ws.rs.Consumes;
import jakarta.ws.rs.POST;
import jakarta.ws.rs.Path;
import jakarta.ws.rs.Produces;
import jakarta.ws.rs.core.MediaType;
@Path("/assist")
public class ForgeAssistResource {
private final ModelRouter router;
@Inject
public ForgeAssistResource(ModelRouter router) {
this.router = router;
}
@POST
@Consumes(MediaType.TEXT_PLAIN)
@Produces(MediaType.TEXT_PLAIN)
public String ask(String question) {
return router.route(question);
}
}Prove it (live Ollama)
With quarkus dev running and Ollama on http://localhost:11434
# Simple prompt — expect fast lane when classification cooperates
curl -s -X POST http://localhost:8080/assist \
-H "Content-Type: text/plain" \
-d "What does the --no-cache flag do in forge build?"
# Complex prompt — expect power lane
curl -s -X POST http://localhost:8080/assist \
-H "Content-Type: text/plain" \
-d "My ForgeCI pipeline passes locally but fails on cached arm64 runners with an OOM error only when layer cache is warm. What should I investigate first?"
You want a log line like this:
[ROUTING] complexity=SIMPLE model=qwen2.5:0.5b classificationMs=... prompt="What does the --no-cache flag do in forge build?"The exact classificationMs value moves around, and because the log comes from an async observer it may show up just after the HTTP response. The signal that matters is complexity + model, not exact timings or identical answer text. Small classifiers will mislabel some edge cases. That is normal, and the production section below is where the trade-off starts to matter.
Production risks
Once this works in dev, the pleasant part is over. These are the problems that show up in a real team.
Misclassification is more expensive than over-routing. The dangerous failure mode is a debugging-heavy prompt stamped SIMPLE and sent to the cheap lane, which then answers confidently and badly. When in doubt, route up, not down.
Logging the raw prompt is a demo convenience. Prompt text can contain tokens, stack traces, customer data, or internal hostnames. Production systems often hash, redact, or sample this field.
Routing adds a second model hop. Latency is now classification plus answer. Keep the classifier on a fast local model, but set explicit timeout boundaries on both lanes and decide what the API does when classification times out.
fireAsync() buys latency at the cost of failure visibility. Observer failures are harder to spot than synchronous method failures. If the event becomes business-critical, move that concern to a durable mechanism instead of pretending CDI async delivery is a queue.
Dev Services is a local convenience, not production topology. The app talks to a known Ollama endpoint, models must already exist there, and health checks should reflect that dependency honestly.
Tests
LLM output moves around, so tests that assert answer content age badly. The deterministic part, and the part worth locking down first, is the routing behavior: did the classifier run, and did the router pick the expected lane? The HTTP layer is deterministic too: did POST /assist delegate to the router?
Use @InjectMock with Mockito. ModelRouterTest replaces the classifier and both ChatModel beans so the router never calls Ollama. ForgeAssistResourceTest replaces the router so the HTTP test stays thin and boring, which is exactly what you want from a resource test.
I would not make the first tutorial test about asynchronous CDI delivery. Readers need to trust the lane decision before they care about observer scheduling.
src/test/java/dev/forgeassist/ModelRouterTest.java:
package dev.forgeassist;
import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.mockito.ArgumentMatchers.any;
import static org.mockito.Mockito.clearInvocations;
import static org.mockito.Mockito.verify;
import static org.mockito.Mockito.verifyNoInteractions;
import static org.mockito.Mockito.when;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;
import dev.langchain4j.data.message.AiMessage;
import dev.langchain4j.model.chat.ChatModel;
import dev.langchain4j.model.chat.request.ChatRequest;
import dev.langchain4j.model.chat.response.ChatResponse;
import io.quarkiverse.langchain4j.ModelName;
import io.quarkus.test.InjectMock;
import io.quarkus.test.junit.QuarkusTest;
import jakarta.inject.Inject;
@QuarkusTest
class ModelRouterTest {
@Inject
ModelRouter router;
@InjectMock
PromptClassifier classifier;
@InjectMock
@ModelName("fast")
ChatModel fastModel;
@InjectMock
ChatModel powerModel;
@BeforeEach
void stubModels() {
when(fastModel.chat(any(ChatRequest.class))).thenReturn(response("fast-lane"));
when(powerModel.chat(any(ChatRequest.class))).thenReturn(response("power-lane"));
clearInvocations(fastModel, powerModel);
}
@Test
void simplePromptUsesFastLane() {
String prompt = "What does the --no-cache flag do in forge build?";
when(classifier.classify(prompt)).thenReturn(Complexity.SIMPLE);
clearInvocations(classifier);
String answer = router.route(prompt);
assertEquals("fast-lane", answer);
verify(classifier).classify(prompt);
verify(fastModel).chat(any(ChatRequest.class));
verifyNoInteractions(powerModel);
}
@Test
void complexPromptUsesPowerLane() {
String prompt = "Why does my pipeline OOM only on cached arm64 builds?";
when(classifier.classify(prompt)).thenReturn(Complexity.COMPLEX);
clearInvocations(classifier);
String answer = router.route(prompt);
assertEquals("power-lane", answer);
verify(classifier).classify(prompt);
verify(powerModel).chat(any(ChatRequest.class));
verifyNoInteractions(fastModel);
}
private static ChatResponse response(String text) {
return ChatResponse.builder().aiMessage(AiMessage.from(text)).build();
}
}src/test/java/dev/forgeassist/ForgeAssistResourceTest.java:
package dev.forgeassist;
import static io.restassured.RestAssured.given;
import static org.hamcrest.Matchers.is;
import static org.mockito.Mockito.when;
import org.junit.jupiter.api.Test;
import io.quarkus.test.InjectMock;
import io.quarkus.test.junit.QuarkusTest;
@QuarkusTest
class ForgeAssistResourceTest {
@InjectMock
ModelRouter router;
@Test
void assistEndpointDelegatesToRouter() {
when(router.route("What does --no-cache do?")).thenReturn("fast-lane");
given()
.contentType("text/plain")
.body("What does --no-cache do?")
.when()
.post("/assist")
.then()
.statusCode(200)
.body(is("fast-lane"));
}
}Run tests (no running Ollama required for assertions):
./mvnw testExpect BUILD SUCCESS and 3 tests.
Extension exercises
If you still feel adventurous. Some homework ideas:
Three-tier routing — Add MODERATE to the enum, introduce a third Ollama model (for example qwen2.5:3b), and update the system prompt and switch expression. How does the classifier prompt need to change to distinguish MODERATE from both SIMPLE and COMPLEX?
Confidence-aware routing — Return a record ClassificationResult(Complexity complexity, int confidencePercent) instead of a bare enum. Route SIMPLE classifications with confidencePercent < 70 to the power model as a fallback.
Micrometer metrics observer — Add a second CDI observer that increments a Counter per complexity tier and expose the metrics endpoint.
Dev Services contrast — If you have Docker available, try Quarkus Dev Services for Ollama and compare startup experience with the explicit host configuration used here.
Close the loop
ForgeAssist classifies each prompt on a cheap local model, routes SIMPLE questions to qwen2.5:0.5b, routes COMPLEX ones to llama3.2, and logs the decision without blocking the caller. That fixes the opening problem: trivial and heavy questions no longer burn the same expensive default.



