Smart Local AI Routing in Java: Build a Hybrid LLM Gateway with Quarkus and Ollama
Use LangChain4j, semantic embeddings, and Quarkus to route prompts to the best local LLM for coding, summarization, or chat without sending data to the cloud.
There’s one truth enterprise architects know too well: no single LLM is perfect for every task. Some are fast but shallow. Others are powerful but expensive or slow. So, what do you do when your application needs flexibility, cost-efficiency, and full control?
You build your own hybrid LLM gateway.
In this hands-on tutorial, you'll learn how to create a smart, local-first gateway that semantically routes user prompts to the best model for the job. Everything local and on your machine. You’ll use Quarkus for its fast startup and developer productivity, Langchain4j for seamless LLM integration, and Ollama to serve local language models like LLaMA3, CodeLLaMA, and Mistral. We'll even use embeddings to build a routing layer that understands prompt intent, not just keywords.
Let’s build it.
Prerequisites
Make sure you have the following installed:
Java 17+
Maven 3.8.x+
Podman (or Docker) if you want to use Dev Services. Or a local Ollama install on your machine.
I suggest you pull the necessary models before you start and use a local Ollama instance to make the experience smoother.
ollama pull llama3.2:latest
ollama pull codellama:7b
ollama pull mistral:latest
ollama pull nomic-embed-text:latest
These four models will power chat, code, summarization, and semantic routing.
Create Your Quarkus Project
Generate your project using the following Maven command:
mvn io.quarkus.platform:quarkus-maven-plugin:create \
-DprojectGroupId=org.acme \
-DprojectArtifactId=semantic-llm-router \
-DclassName="org.acme.RoutingResource" \
-Dpath="/route" \
-Dextensions="quarkus-rest-jackson,quarkus-langchain4j-ollama"
cd semantic-llm-router
This creates a ready-to-run Quarkus project with REST support and Ollama integration. It’s not gonna be a lot of code today, but you can absolutely grab the working example from my Github repository!
Configure the Ollama Models
Now define your model setup in src/main/resources/application.properties
.
# Default Ollama Configuration
quarkus.langchain4j.ollama.timeout=60s
# Default chat model
quarkus.langchain4j.ollama.chat-model.model-id=llama3.2
quarkus.langchain4j.ollama.chat-model.log-responses=true
# Named model: coder
quarkus.langchain4j.ollama.coder.chat-model.model-id=codellama
quarkus.langchain4j.ollama.coder.log-responses=true
# Named model: summarizer
quarkus.langchain4j.ollama.summarizer.chat-model.model-id=mistral
quarkus.langchain4j.ollama.summarizer.log-responses=true
# Embedding model
quarkus.langchain4j.ollama.embedding-model.model-id=nomic-embed-text
Quarkus will now boot up with all models wired and available for named access.
Smart Chat Service with Semantic Routing
Next, let's build the logic that matches a user’s prompt to the best-fit model based on its meaning. This service builds route embeddings on startup and uses cosine similarity for matching. Create src/main/java/org/acme/SmartChatService.java
package org.acme;
import java.util.Map;
import java.util.stream.Collectors;
import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.model.ollama.OllamaChatModel;
import io.quarkiverse.langchain4j.ModelName;
import jakarta.annotation.PostConstruct;
import jakarta.enterprise.context.ApplicationScoped;
import jakarta.inject.Inject;
@ApplicationScoped
public class SmartChatService {
// Model definitions with routing descriptions
private static final Map<String, ModelDefinition> MODELS = Map.of(
"default", new ModelDefinition(
"llama3.2",
"A route for general-purpose conversations, casual chat, and everyday questions"),
"coder", new ModelDefinition(
"codellama",
"A route for answering questions about code, programming, software development, and technical tasks"),
"summarizer", new ModelDefinition(
"mistral",
"A route for summarizing long texts, articles, documents, and content analysis"));
// Model aliases for flexible naming
private static final Map<String, String> ALIASES = Map.of(
"chat", "default",
"code", "coder",
"codellama", "coder",
"llama3.2", "default",
"mistral", "summarizer",
"summary", "summarizer");
@Inject
EmbeddingModel embeddingModel;
@Inject
OllamaChatModel defaultModel;
@Inject
@ModelName("coder")
OllamaChatModel coderModel;
@Inject
@ModelName("summarizer")
OllamaChatModel summarizerModel;
// Pre-computed embeddings for routing
private Map<String, Embedding> modelEmbeddings;
@PostConstruct
void initialize() {
// Pre-compute embeddings for each model's description
modelEmbeddings = MODELS.entrySet().stream()
.collect(Collectors.toMap(
Map.Entry::getKey,
entry -> embeddingModel.embed(entry.getValue().description()).content()));
}
/**
* Chat with automatic model selection based on prompt content
*/
public String smartChat(String prompt) {
if (prompt == null || prompt.trim().isEmpty()) {
throw new IllegalArgumentException("Prompt cannot be empty");
}
String selectedModel = selectModelForPrompt(prompt);
return executeChat(selectedModel, prompt);
}
/**
* Chat with a specific model
*/
public String chat(String modelName, String prompt) {
if (prompt == null || prompt.trim().isEmpty()) {
throw new IllegalArgumentException("Prompt cannot be empty");
}
String resolvedModel = resolveModelName(modelName);
return executeChat(resolvedModel, prompt);
}
/**
* Get detailed chat result with model selection information
*/
public ChatResult chatWithDetails(String prompt) {
if (prompt == null || prompt.trim().isEmpty()) {
throw new IllegalArgumentException("Prompt cannot be empty");
}
String selectedModel = selectModelForPrompt(prompt);
ModelDefinition modelDef = MODELS.get(selectedModel);
String response = executeChat(selectedModel, prompt);
return new ChatResult(prompt, selectedModel, modelDef, response);
}
/**
* Get the model that would be selected for a prompt (without chatting)
*/
public String getSelectedModel(String prompt) {
return selectModelForPrompt(prompt);
}
/**
* Semantic routing: select best model based on prompt content
*/
private String selectModelForPrompt(String prompt) {
Embedding promptEmbedding = embeddingModel.embed(prompt).content();
return modelEmbeddings.entrySet().stream()
.map(entry -> Map.entry(
entry.getKey(),
cosineSimilarity(promptEmbedding.vector(), entry.getValue().vector())))
.max(Map.Entry.comparingByValue())
.map(entry -> {
String modelName = entry.getKey();
double score = entry.getValue();
System.out.printf("Prompt routed to '%s' with score: %.4f%n", modelName, score);
return modelName;
})
.orElse("default");
}
/**
* Execute chat with resolved model
*/
private String executeChat(String modelName, String prompt) {
try {
OllamaChatModel model = getModelInstance(modelName);
return model.chat(prompt);
} catch (Exception e) {
System.err.println("Error chatting with model " + modelName + ": " + e.getMessage());
throw new RuntimeException("Failed to get response from model: " + modelName, e);
}
}
/**
* Get the actual model instance for a resolved model name
*/
private OllamaChatModel getModelInstance(String modelName) {
return switch (modelName) {
case "coder" -> coderModel;
case "summarizer" -> summarizerModel;
case "default" -> defaultModel;
default -> defaultModel;
};
}
/**
* Resolve model name including aliases to canonical name
*/
private String resolveModelName(String modelName) {
if (modelName == null || modelName.trim().isEmpty()) {
return "default";
}
String lowerName = modelName.toLowerCase().trim();
// Check direct model name
if (MODELS.containsKey(lowerName)) {
return lowerName;
}
// Check aliases
String aliasTarget = ALIASES.get(lowerName);
if (aliasTarget != null) {
return aliasTarget;
}
System.out.println("Unknown model: " + modelName + ". Using default.");
return "default";
}
/**
* Calculate cosine similarity between two vectors
*/
private double cosineSimilarity(float[] vectorA, float[] vectorB) {
double dotProduct = 0.0;
double normA = 0.0;
double normB = 0.0;
for (int i = 0; i < vectorA.length; i++) {
dotProduct += vectorA[i] * vectorB[i];
normA += vectorA[i] * vectorA[i];
normB += vectorB[i] * vectorB[i];
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
// Public API methods for information
public String[] getAvailableModels() {
return MODELS.keySet().toArray(new String[0]);
}
public String[] getAllSupportedNames() {
var allNames = new java.util.HashSet<String>();
allNames.addAll(MODELS.keySet());
allNames.addAll(ALIASES.keySet());
return allNames.toArray(new String[0]);
}
public ModelDefinition getModelInfo(String modelName) {
String resolved = resolveModelName(modelName);
return MODELS.get(resolved);
}
public boolean isValidModel(String modelName) {
String resolved = resolveModelName(modelName);
return MODELS.containsKey(resolved);
}
// Data classes
public record ModelDefinition(String actualModelId, String description) {
}
public record ChatResult(
String prompt,
String selectedModel,
ModelDefinition modelInfo,
String response) {
public String getFormattedResult() {
return String.format(
"Model: %s (%s)\nDescription: %s\n\nResponse: %s",
selectedModel,
modelInfo.actualModelId(),
modelInfo.description(),
response);
}
}
}
This component is fast and initializes once per application run.
The REST API Endpoint
Let’s expose this whole gateway via a REST endpoint. Create src/main/java/org/acme/RoutingResource.java
package org.acme;
import jakarta.inject.Inject;
import jakarta.ws.rs.GET;
import jakarta.ws.rs.Path;
import jakarta.ws.rs.Produces;
import jakarta.ws.rs.QueryParam;
import jakarta.ws.rs.core.MediaType;
@Path("/chat")
public class RoutingResource {
@Inject
SmartChatService chatService;
@GET
@Path("/models")
@Produces(MediaType.APPLICATION_JSON)
public String[] getAvailableModels() {
return chatService.getAvailableModels();
}
@GET
@Path("/models/all")
@Produces(MediaType.APPLICATION_JSON)
public String[] getAllSupportedNames() {
return chatService.getAllSupportedNames();
}
@GET
@Path("/models/info")
@Produces(MediaType.TEXT_PLAIN)
public String getModelsInfo() {
StringBuilder info = new StringBuilder("Available Models:\n\n");
for (String modelName : chatService.getAvailableModels()) {
var modelInfo = chatService.getModelInfo(modelName);
if (modelInfo != null) {
info.append(String.format("Model: %s\n", modelName));
info.append(String.format(" Actual ID: %s\n", modelInfo.actualModelId()));
info.append(String.format(" Description: %s\n\n", modelInfo.description()));
}
}
return info.toString();
}
@GET
@Produces(MediaType.TEXT_PLAIN)
public String chat(@QueryParam("prompt") String prompt, @QueryParam("model") String model) {
if (prompt == null || prompt.trim().isEmpty()) {
return "Please provide a prompt parameter";
}
return chatService.chat(model, prompt);
}
@GET
@Path("/smart")
@Produces(MediaType.TEXT_PLAIN)
public String smartChat(@QueryParam("prompt") String prompt) {
if (prompt == null || prompt.trim().isEmpty()) {
return "Please provide a prompt parameter";
}
return chatService.smartChat(prompt);
}
@GET
@Path("/smart/details")
@Produces(MediaType.TEXT_PLAIN)
public String smartChatWithDetails(@QueryParam("prompt") String prompt) {
if (prompt == null || prompt.trim().isEmpty()) {
return "Please provide a prompt parameter";
}
var result = chatService.chatWithDetails(prompt);
return result.getFormattedResult();
}
@GET
@Path("/route")
@Produces(MediaType.TEXT_PLAIN)
public String routeOnly(@QueryParam("prompt") String prompt) {
if (prompt == null || prompt.trim().isEmpty()) {
return "Please provide a prompt parameter";
}
String selectedModel = chatService.getSelectedModel(prompt);
var modelInfo = chatService.getModelInfo(selectedModel);
return String.format("Prompt: %s\nSelected model: %s (%s)\nDescription: %s",
prompt, selectedModel, modelInfo.actualModelId(), modelInfo.description());
}
}
Now every incoming prompt goes through semantic routing and gets a model-generated reply.
Run and Test
Start Ollama (if you’re not using the Dev Services):
ollama serve
Then launch your app:
./mvnw quarkus:dev
And test it out with some curl commands:
# Core functionality
GET /chat?prompt=...&model=... # Chat with specific model
GET /chat/smart?prompt=... # Auto-select best model
GET /chat/smart/details?prompt=... # Detailed response with model info
# Model information
GET /chat/models # Available models
GET /chat/models/all # All supported names + aliases
GET /chat/models/info # Detailed model information
GET /chat/route?prompt=... # Show which model would be selected
Each response should be accompanied by a terminal log showing which model was selected and the similarity score.
Model Information Endpoints
1. Get Available Models (canonical names)
curl "http://localhost:8080/chat/models"
2. Get All Supported Names (including aliases)
curl "http://localhost:8080/chat/models/all"
3. Get Detailed Model Information
curl "http://localhost:8080/chat/models/info"
Chat Endpoints
Following commands will help you test all the functionality of your semantic LLM router!
4. Chat with Specific Model
# Using default model
curl "http://localhost:8080/chat?prompt=Hello%20there"
# Using coder model
curl "http://localhost:8080/chat?prompt=Write%20a%20Python%20function%20to%20calculate%20factorial&model=coder"
# Using summarizer model
curl "http://localhost:8080/chat?prompt=Summarize%20this%20text%20about%20AI&model=summarizer"
# Using aliases
curl "http://localhost:8080/chat?prompt=Fix%20this%20code&model=code"
curl "http://localhost:8080/chat?prompt=What%20is%20machine%20learning&model=chat"
5. Smart Chat (automatic model selection)
# Should route to coder model
curl "http://localhost:8080/chat/smart?prompt=How%20do%20I%20write%20a%20REST%20API%20in%20Java"
# Should route to summarizer model
curl "http://localhost:8080/chat/smart?prompt=Please%20summarize%20the%20main%20points%20of%20this%20document"
# Should route to default model
curl "http://localhost:8080/chat/smart?prompt=How%20are%20you%20today"
6. Smart Chat with Details
curl "http://localhost:8080/chat/smart/details?prompt=Explain%20machine%20learning%20algorithms"
7. Route Only (show selected model without chatting)
curl "http://localhost:8080/chat/route?prompt=Debug%20my%20JavaScript%20code"
curl "http://localhost:8080/chat/route?prompt=Summarize%20this%20research%20paper"
curl "http://localhost:8080/chat/route?prompt=What%20is%20the%20weather%20like"
Conclusion
You’ve just built a powerful hybrid LLM gateway using only local models and Java. This setup gives you:
Full privacy (your data never leaves your laptop)
Flexibility (use the right model for the task)
Extendibility (add more routes and models easily)
Enterprise-readiness (CDI, REST, and native image support)
In a world of cloud-bound, opaque LLM APIs, this approach puts control and clarity back into the hands of Java developers. And with Langchain4j and Quarkus doing the heavy lifting, you can focus on building smart apps and not plumbing infrastructure.