Building a Measurable RAG System with Quarkus Easy RAG, LangChain4j, and Ollama

How Java developers can build, evaluate, and trust Retrieval-Augmented Generation using Quarkus, LangChain4j, and local LLMs

Feb 06, 2026

Retrieval-Augmented Generation works until it quietly stops working.
Answers still sound confident. Latency looks fine. Nothing crashes.
And yet, the system slowly drifts into hallucinations, partial answers, or irrelevant context.

This tutorial is about preventing that failure mode.

We will build a complete, local RAG system using Quarkus, LangChain4j, and Ollama, and then instrument it with evaluation metrics so you can measure whether your RAG pipeline is still behaving correctly.

Not a simple RAG demo. A system you can reason about.

Why RAG Fails Quietly in Production

Most developers think RAG is about retrieval quality.
Get better embeddings. Tune chunk size. Increase topK. Done.

That mental model fails in real systems.

What actually breaks in production is alignment between retrieval, generation, and grounding. The retriever returns relevant text, but the model ignores half of it. Or it combines multiple documents into a fluent but incorrect answer. Or it answers the wrong question convincingly.

Without evaluation, you only find this out when users complain.

This tutorial assumes a different stance:

RAG is a pipeline, not a feature
Every pipeline needs observability
LLM output must be scored, not trusted

We will build evaluation into the request path itself.

What We’ll Build

RAG System using Quarkus Easy RAG (auto-ingestion of documents)
Evaluation Pipeline with RAGAS, BLEU, and Semantic Similarity metrics
Complete Working Example with a product documentation chatbot

Here is an overview of the application flow at a glance:

Prerequisites

Java 17 or higher
Maven 3.8+
Docker (for Ollama)
10GB disk space for models

We deliberately run everything locally. If you cannot reason about RAG locally, you cannot reason about it in Kubernetes.

Create Quarkus Project with Easy RAG

mvn io.quarkus:quarkus-maven-plugin:create \
    -DprojectGroupId=com.example \
    -DprojectArtifactId=rag-chatbot-eval \
    -Dextensions="quarkus-langchain4j-easy-rag,quarkus-langchain4j-ollama,rest-jackson" \
    -DclassName="com.example.ChatbotResource" \
    -Dpath="/chat"
    
cd rag-chatbot-eval

Add Additional Dependencies

Add to your pom.xml:

    <!-- For metrics calculation -->
    <dependency>
        <groupId>org.apache.commons</groupId>
        <artifactId>commons-text</artifactId>
        <version>1.15.0</version>
    </dependency>

Create Sample Documents

Create a directory structure:

mkdir -p src/main/resources/documents

Create src/main/resources/documents/banking-products.txt:

Standard Savings Account

Our Standard Savings Account offers competitive interest rates with no monthly fees. 
Key features include:
- 2.5% annual interest rate
- No minimum balance requirement
- Free online and mobile banking
- FDIC insured up to $250,000
- Easy access to your funds
- Monthly account statements

This account is perfect for customers who want to start saving with flexibility 
and no commitments.

Create src/main/resources/documents/premium-account.txt:

Premium Checking Account

The Premium Checking Account is designed for customers who value premium services.
Features include:
- No ATM fees worldwide
- Free checks and money orders
- Dedicated customer service line
- Travel insurance coverage
- Purchase protection
- 0.5% cashback on debit card purchases
- Monthly fee: $15 (waived with $5,000 minimum balance)

Ideal for frequent travelers and customers who want enhanced banking benefits.

Create src/main/resources/documents/loan-products.txt:

Personal Loan Options

We offer flexible personal loans for various needs:

Home Improvement Loans:
- Rates starting at 5.99% APR
- Loan amounts: $5,000 to $100,000
- Terms: 1 to 7 years
- No prepayment penalties

Auto Loans:
- Rates as low as 3.99% APR for new cars
- Up to 100% financing available
- Terms up to 72 months
- Pre-approval in minutes

All loans subject to credit approval. Visit a branch or apply online for more information.

Step 5: Configure Application

Create/update src/main/resources/application.properties:

# Easy RAG Configuration
quarkus.langchain4j.easy-rag.path=src/main/resources/documents

quarkus.langchain4j.easy-rag.path-matcher=glob:**.txt
quarkus.langchain4j.easy-rag.max-segment-size=400
quarkus.langchain4j.easy-rag.max-overlap-size=50
quarkus.langchain4j.easy-rag.max-results=3
quarkus.langchain4j.easy-rag.recursive=true

# Reuse embeddings for faster dev mode restarts
quarkus.langchain4j.easy-rag.reuse-embeddings.enabled=true
quarkus.langchain4j.easy-rag.reuse-embeddings.file=banking-embeddings.json

# Ollama Chat Model Configuration
quarkus.langchain4j.ollama.timeout=60s
quarkus.langchain4j.ollama.chat-model.model-id=llama3.2
quarkus.langchain4j.ollama.chat-model.temperature=0.5
quarkus.langchain4j.ollama.chat-model.timeout=60s

# Ollama Embedding Model Configuration
quarkus.langchain4j.ollama.embedding-model.model-id=nomic-embed-text

Part 2: Create the RAG-Powered Chatbot

Create AI Service Interface

package com.example.service;

import dev.langchain4j.service.SystemMessage;
import dev.langchain4j.service.UserMessage;
import io.quarkiverse.langchain4j.RegisterAiService;
import jakarta.enterprise.context.ApplicationScoped;

@ApplicationScoped
@RegisterAiService // no need to declare a retrieval augmentor here, it is automatically generated
                   // and discovered
public interface BankingChatbot {

    @SystemMessage("""
            You are a helpful banking assistant. Answer questions about our banking products
            using the information provided in the context. If you don't know the answer,
            say so politely. Be concise but informative.
            """)
    String chat(@UserMessage String userMessage);
}

Create Domain Models

Create the following files:

// src/main/java/com/example/model/ChatRequest.java
package com.example.model;

public class ChatRequest {
    public String question;
    public String referenceAnswer; // Optional, for evaluation

    public ChatRequest() {
    }

    public ChatRequest(String question) {
        this.question = question;
    }
}

// src/main/java/com/example/model/ChatResponse.java
package com.example.model;

import java.util.List;

public class ChatResponse {
    public String question;
    public String answer;
    public List<String> retrievedContexts;
    public EvaluationMetrics metrics;
    
    public ChatResponse() {}
}

// src/main/java/com/example/model/EvaluationMetrics.java
package com.example.model;

public class EvaluationMetrics {
    public double contextRelevance;
    public double faithfulness;
    public double answerRelevance;
    public double bleuScore;
    public double semanticSimilarity;
    public double overallScore;
    
    public EvaluationMetrics() {}
    
    @Override
    public String toString() {
        return String.format(
            "Metrics[context=%.2f, faithfulness=%.2f, relevance=%.2f, bleu=%.2f, semantic=%.2f, overall=%.2f]",
            contextRelevance, faithfulness, answerRelevance, bleuScore, semanticSimilarity, overallScore
        );
    }
}

Implement Evaluation Metrics

You can’t improve what you don’t measure. Metrics are the feedback loop that makes iteration possible. Especially in AI infused systems, you need to ensure to run a lot of evaluations to continuously monitor applications. RAGAS groups metrics according to the following outline:

For this tutorial, we are just taking a few examples from the complete list so you get an idea for the metrics categories and implementations. You can and should learn more and read more about RAGAS and metrics.

BLEU Score Calculator

The BleuScore metric is used to evaluate the quality of response by comparing it with reference. It measures the similarity between the response and the reference based on n-gram precision and brevity penalty. BLEU score was originally designed to evaluate machine translation systems, but it is also used in other natural language processing tasks. BLEU score ranges from 0 to 1, where 1 indicates a perfect match between the response and the reference. We use a simplified version below.

Create: src/main/java/com/example/metrics/BleuCalculator.java

package com.example.metrics;

import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;

import jakarta.enterprise.context.ApplicationScoped;

@ApplicationScoped
public class BleuCalculator {

    /**
     * Calculate BLEU score with n-gram precision up to 4-grams
     */
    public double calculateBleu(String candidate, String reference) {
        if (candidate == null || reference == null ||
                candidate.trim().isEmpty() || reference.trim().isEmpty()) {
            return 0.0;
        }

        List<String> candidateTokens = tokenize(candidate);
        List<String> referenceTokens = tokenize(reference);

        if (candidateTokens.isEmpty() || referenceTokens.isEmpty()) {
            return 0.0;
        }

        // Calculate precision for 1-gram to 4-gram
        double[] precisions = new double[4];
        double[] weights = { 0.25, 0.25, 0.25, 0.25 };

        for (int n = 1; n <= 4; n++) {
            precisions[n - 1] = calculateNGramPrecision(candidateTokens, referenceTokens, n);
        }

        // Geometric mean of precisions
        double logSum = 0.0;
        for (int i = 0; i < 4; i++) {
            if (precisions[i] > 0) {
                logSum += weights[i] * Math.log(precisions[i]);
            } else {
                return 0.0; // If any precision is 0, BLEU is 0
            }
        }

        double geometricMean = Math.exp(logSum);

        // Brevity penalty
        double brevityPenalty = calculateBrevityPenalty(
                candidateTokens.size(),
                referenceTokens.size());

        return brevityPenalty * geometricMean;
    }

    private double calculateNGramPrecision(List<String> candidate,
            List<String> reference,
            int n) {
        Map<String, Integer> candidateNGrams = getNGrams(candidate, n);
        Map<String, Integer> referenceNGrams = getNGrams(reference, n);

        int matchCount = 0;
        int totalCount = 0;

        for (Map.Entry<String, Integer> entry : candidateNGrams.entrySet()) {
            String ngram = entry.getKey();
            int candidateCount = entry.getValue();
            int referenceCount = referenceNGrams.getOrDefault(ngram, 0);

            matchCount += Math.min(candidateCount, referenceCount);
            totalCount += candidateCount;
        }

        return totalCount > 0 ? (double) matchCount / totalCount : 0.0;
    }

    private Map<String, Integer> getNGrams(List<String> tokens, int n) {
        Map<String, Integer> ngrams = new HashMap<>();

        for (int i = 0; i <= tokens.size() - n; i++) {
            String ngram = String.join(" ", tokens.subList(i, i + n));
            ngrams.put(ngram, ngrams.getOrDefault(ngram, 0) + 1);
        }

        return ngrams;
    }

    private double calculateBrevityPenalty(int candidateLength, int referenceLength) {
        if (candidateLength > referenceLength) {
            return 1.0;
        }
        return Math.exp(1.0 - (double) referenceLength / candidateLength);
    }

    private List<String> tokenize(String text) {
        return Arrays.stream(text.toLowerCase()
                .replaceAll("[^a-z0-9\\s]", "")
                .split("\\s+"))
                .filter(s -> !s.isEmpty())
                .collect(Collectors.toList());
    }
}

RAGAS Evaluator

Ragas provides a set of evaluation metrics that can be used to measure the performance of your LLM application. These metrics are designed to help you objectively measure the performance of your application. Metrics are available for different applications and tasks, such as RAG and Agentic workflows. For this demo, we are just using a subset of them. So this is a very lightweight version.

with only scores:

Context relevance
Faithfulness
Answer relevance

Create: src/main/java/com/example/metrics/RagasEvaluator.java

package com.example.metrics;

import java.util.List;

import dev.langchain4j.model.chat.ChatModel;
import jakarta.enterprise.context.ApplicationScoped;
import jakarta.inject.Inject;

@ApplicationScoped
public class RagasEvaluator {

    @Inject
    ChatModel chatModel;

    /**
     * Context Relevance: How relevant are the retrieved documents to the question?
     */
    public double evaluateContextRelevance(String question, List<String> contexts) {
        if (contexts == null || contexts.isEmpty()) {
            return 0.0;
        }

        StringBuilder contextText = new StringBuilder();
        for (int i = 0; i < contexts.size(); i++) {
            contextText.append("Context ").append(i + 1).append(": ")
                    .append(contexts.get(i)).append("\n\n");
        }

        String prompt = String.format("""
                Question: %s

                Retrieved Contexts:
                %s

                Task: Rate how relevant the provided contexts are to answering the question.
                Consider if the contexts contain information needed to answer the question.

                Respond with ONLY a number between 0.0 and 1.0, where:
                - 0.0 = completely irrelevant
                - 1.0 = perfectly relevant

                Score:""", question, contextText.toString());

        return parseScore(chatModel.chat(prompt));
    }

    /**
     * Faithfulness: Is the answer grounded in the provided context?
     */
    public double evaluateFaithfulness(String answer, List<String> contexts) {
        if (answer == null || answer.trim().isEmpty() || contexts == null || contexts.isEmpty()) {
            return 0.0;
        }

        StringBuilder contextText = new StringBuilder();
        for (String context : contexts) {
            contextText.append(context).append("\n\n");
        }

        String prompt = String.format("""
                Retrieved Context:
                %s

                Generated Answer: %s

                Task: Evaluate if the answer is faithful to the context.
                Check if all claims in the answer can be verified from the context.
                An answer is faithful if it doesn't add information not present in the context.

                Respond with ONLY a number between 0.0 and 1.0, where:
                - 0.0 = completely unfaithful (hallucinated content)
                - 1.0 = perfectly faithful (all claims supported by context)

                Score:""", contextText.toString(), answer);

        return parseScore(chatModel.chat(prompt));
    }

    /**
     * Answer Relevance: Does the answer actually address the question?
     */
    public double evaluateAnswerRelevance(String question, String answer) {
        if (question == null || answer == null ||
                question.trim().isEmpty() || answer.trim().isEmpty()) {
            return 0.0;
        }

        String prompt = String.format("""
                Question: %s

                Generated Answer: %s

                Task: Rate how well the answer addresses the question.
                Consider if the answer is on-topic and provides useful information.

                Respond with ONLY a number between 0.0 and 1.0, where:
                - 0.0 = completely irrelevant to the question
                - 1.0 = perfectly relevant and complete answer

                Score:""", question, answer);

        return parseScore(chatModel.chat(prompt));
    }

    private double parseScore(String response) {
        try {
            // Extract number from response
            String cleaned = response.trim()
                    .replaceAll("[^0-9.]", "")
                    .trim();

            if (cleaned.isEmpty()) {
                System.err.println("No numeric value found in response: " + response);
                return 0.5;
            }

            double score = Double.parseDouble(cleaned);
            return Math.max(0.0, Math.min(1.0, score)); // Clamp between 0 and 1
        } catch (NumberFormatException e) {
            System.err.println("Failed to parse score from: " + response);
            return 0.5; // Default middle score if parsing fails
        }
    }
}

Semantic Similarity Calculator

The Semantic Similarity metric evaluates the semantic resemblance between a generated response and a reference (ground truth) answer. It ranges from 0 to 1, with higher scores indicating better alignment between the generated answer and the ground truth.

Create: src/main/java/com/example/metrics/SemanticSimilarityCalculator.java

package com.example.metrics;

import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.model.embedding.EmbeddingModel;
import jakarta.enterprise.context.ApplicationScoped;
import jakarta.inject.Inject;

@ApplicationScoped
public class SemanticSimilarityCalculator {

    @Inject
    EmbeddingModel embeddingModel;

    /**
     * Calculate cosine similarity between two texts using embeddings
     */
    public double calculateSimilarity(String text1, String text2) {
        if (text1 == null || text2 == null ||
                text1.trim().isEmpty() || text2.trim().isEmpty()) {
            return 0.0;
        }

        try {
            Embedding embedding1 = embeddingModel.embed(text1).content();
            Embedding embedding2 = embeddingModel.embed(text2).content();

            return cosineSimilarity(embedding1.vector(), embedding2.vector());
        } catch (Exception e) {
            System.err.println("Error calculating semantic similarity: " + e.getMessage());
            return 0.0;
        }
    }

    private double cosineSimilarity(float[] vectorA, float[] vectorB) {
        if (vectorA.length != vectorB.length) {
            throw new IllegalArgumentException("Vectors must have same dimensions");
        }

        double dotProduct = 0.0;
        double normA = 0.0;
        double normB = 0.0;

        for (int i = 0; i < vectorA.length; i++) {
            dotProduct += vectorA[i] * vectorB[i];
            normA += Math.pow(vectorA[i], 2);
            normB += Math.pow(vectorB[i], 2);
        }

        if (normA == 0.0 || normB == 0.0) {
            return 0.0;
        }

        return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
    }
}

Create Evaluation Service with Context Extraction

Context Extraction Utility

Create: src/main/java/com/example/service/ContextExtractor.java

package com.example.service;

import java.util.List;
import java.util.stream.Collectors;

import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingMatch;
import dev.langchain4j.store.embedding.EmbeddingSearchRequest;
import dev.langchain4j.store.embedding.EmbeddingSearchResult;
import dev.langchain4j.store.embedding.EmbeddingStore;
import jakarta.enterprise.context.ApplicationScoped;
import jakarta.inject.Inject;

@ApplicationScoped
public class ContextExtractor {

    @Inject
    EmbeddingStore<TextSegment> embeddingStore;

    @Inject
    EmbeddingModel embeddingModel;

    /**
     * Extract relevant contexts for a given question
     */
    public List<String> extractContexts(String question, int maxResults) {
        // Generate embedding for the question
        Embedding questionEmbedding = embeddingModel.embed(question).content();

        // Search for relevant segments
        EmbeddingSearchRequest searchRequest = EmbeddingSearchRequest.builder()
                .queryEmbedding(questionEmbedding)
                .maxResults(maxResults)
                .minScore(0.5) // Optional: filter by relevance score
                .build();

        EmbeddingSearchResult<TextSegment> searchResult = embeddingStore.search(searchRequest);

        // Extract text from matched segments
        return searchResult.matches().stream()
                .map(EmbeddingMatch::embedded)
                .map(TextSegment::text)
                .collect(Collectors.toList());
    }
}

Complete Evaluation Service

Create: src/main/java/com/example/service/RagEvaluationService.java

package com.example.service;

import java.util.List;

import org.jboss.logging.Logger;

import com.example.metrics.BleuCalculator;
import com.example.metrics.RagasEvaluator;
import com.example.metrics.SemanticSimilarityCalculator;
import com.example.model.ChatRequest;
import com.example.model.ChatResponse;
import com.example.model.EvaluationMetrics;

import jakarta.enterprise.context.ApplicationScoped;
import jakarta.inject.Inject;

@ApplicationScoped
public class RagEvaluationService {

    private static final Logger LOG = Logger.getLogger(RagEvaluationService.class);

    @Inject
    BankingChatbot chatbot;

    @Inject
    ContextExtractor contextExtractor;

    @Inject
    RagasEvaluator ragasEvaluator;

    @Inject
    BleuCalculator bleuCalculator;

    @Inject
    SemanticSimilarityCalculator semanticCalculator;

    /**
     * Process a chat request and evaluate the response
     */
    public ChatResponse processAndEvaluate(ChatRequest request) {
        LOG.infof("Processing question: %s", request.question);

        // Step 1: Extract relevant contexts
        List<String> contexts = contextExtractor.extractContexts(request.question, 3);
        LOG.infof("Retrieved %d context segments", contexts.size());

        // Step 2: Generate answer using RAG
        String answer = chatbot.chat(request.question);
        LOG.infof("Generated answer: %s", answer);

        // Step 3: Evaluate the response
        EvaluationMetrics metrics = evaluateResponse(
                request.question,
                answer,
                contexts,
                request.referenceAnswer);

        // Step 4: Build response
        ChatResponse response = new ChatResponse();
        response.question = request.question;
        response.answer = answer;
        response.retrievedContexts = contexts;
        response.metrics = metrics;

        LOG.infof("Evaluation complete: %s", metrics);

        return response;
    }

    /**
     * Evaluate a RAG response using multiple metrics
     */
    private EvaluationMetrics evaluateResponse(String question,
            String answer,
            List<String> contexts,
            String referenceAnswer) {
        EvaluationMetrics metrics = new EvaluationMetrics();

        // RAGAS Metrics (always computed)
        try {
            metrics.contextRelevance = ragasEvaluator.evaluateContextRelevance(question, contexts);
            metrics.faithfulness = ragasEvaluator.evaluateFaithfulness(answer, contexts);
            metrics.answerRelevance = ragasEvaluator.evaluateAnswerRelevance(question, answer);
        } catch (Exception e) {
            LOG.errorf("Error computing RAGAS metrics: %s", e.getMessage());
        }

        // Reference-based metrics (only if reference answer provided)
        if (referenceAnswer != null && !referenceAnswer.trim().isEmpty()) {
            try {
                metrics.bleuScore = bleuCalculator.calculateBleu(answer, referenceAnswer);
                metrics.semanticSimilarity = semanticCalculator.calculateSimilarity(answer, referenceAnswer);
            } catch (Exception e) {
                LOG.errorf("Error computing reference-based metrics: %s", e.getMessage());
            }
        }

        // Calculate overall score
        metrics.overallScore = calculateOverallScore(metrics);

        return metrics;
    }

    private double calculateOverallScore(EvaluationMetrics metrics) {
        double sum = metrics.contextRelevance + metrics.faithfulness + metrics.answerRelevance;
        int count = 3;

        if (metrics.bleuScore > 0) {
            sum += metrics.bleuScore;
            count++;
        }

        if (metrics.semanticSimilarity > 0) {
            sum += metrics.semanticSimilarity;
            count++;
        }

        return count > 0 ? sum / count : 0.0;
    }
}

The REST API

Create: src/main/java/com/example/ChatbotResource.java

package com.example;

import org.jboss.logging.Logger;

import com.example.model.ChatRequest;
import com.example.model.ChatResponse;
import com.example.service.RagEvaluationService;

import jakarta.inject.Inject;
import jakarta.ws.rs.Consumes;
import jakarta.ws.rs.GET;
import jakarta.ws.rs.POST;
import jakarta.ws.rs.Path;
import jakarta.ws.rs.Produces;
import jakarta.ws.rs.core.MediaType;

@Path("/chat")
@Produces(MediaType.APPLICATION_JSON)
@Consumes(MediaType.APPLICATION_JSON)
public class ChatbotResource {

    private static final Logger LOG = Logger.getLogger(ChatbotResource.class);

    @Inject
    RagEvaluationService evaluationService;

    @POST
    public ChatResponse chat(ChatRequest request) {
        LOG.infof("Received chat request: %s", request.question);
        return evaluationService.processAndEvaluate(request);
    }

    @GET
    @Path("/health")
    @Produces(MediaType.TEXT_PLAIN)
    public String health() {
        return "Banking Chatbot with Evaluation is running!";
    }
}

Start the Application

# Start Quarkus in dev mode
mvn quarkus:dev

Test with Simple Question

Create test-simple.json:

{
  "question": "What are the benefits of a standard savings account?"
}

Test:

curl -X POST http://localhost:8080/chat \
  -H "Content-Type: application/json" \
  -d @test-simple.json | jq

Expected response:

{
  "question": "What are the benefits of a standard savings account?",
  "answer": "Our Standard Savings Account offers several benefits, including:\n\n* Competitive interest rate of 2.5% annual interest rate\n* No monthly fees\n* Free online and mobile banking\n* FDIC insurance up to $250,000\n* Easy access to your funds\n* Monthly account statements\n\nThis account is perfect for customers who want to start saving with flexibility and no commitments.",
  "retrievedContexts": [
        "[...]"
  ],
  "metrics": {
    "contextRelevance": 0.5,
    "faithfulness": 0.7,
    "answerRelevance": 0.5,
    "bleuScore": 0.0,
    "semanticSimilarity": 0.0,
    "overallScore": 0.5666666666666667
  }
}

Test with Reference Answer (Full Evaluation)

Create test-with-reference.json:

{
  "question": "What is the interest rate for the standard savings account?",
  "referenceAnswer": "The standard savings account offers a 2.5% annual interest rate."
}

Test:

curl -X POST http://localhost:8080/chat \
  -H "Content-Type: application/json" \
  -d @test-with-reference.json | jq

Expected response with all metrics:

{
  "question": "What is the interest rate for the standard savings account?",
  "answer": "The interest rate for our Standard Savings Account is 2.5% annual interest rate, with no minimum balance requirement and no monthly fees.",
  "retrievedContexts": [
    "[...]"
  ],
  "metrics": {
    "contextRelevance": 0.5,
    "faithfulness": 0.8,
    "answerRelevance": 0.8,
    "bleuScore": 0.161692143534558,
    "semanticSimilarity": 0.7843242341979497,
    "overallScore": 0.6092032755465016
  }
}

Test with Multiple Questions

Create test-batch.sh:

#!/bin/bash

questions=(
  "What are the fees for the premium checking account?"
  "What types of personal loans do you offer?"
  "What is the APR for auto loans?"
  "Can I get travel insurance with my account?"
)

for q in "${questions[@]}"; do
  echo "Testing: $q"
  curl -X POST http://localhost:8080/chat \
    -H "Content-Type: application/json" \
    -d "{\"question\": \"$q\"}" | jq '.metrics'
  echo "---"
  sleep 2
done

Run:

chmod +x test-batch.sh
./test-batch.sh

Understanding the Metrics

High relevance + low faithfulness means the model ignored the context.
High faithfulness + low relevance means retrieval failed.
Low everything means your documents are wrong.

Interpreting Scores

Context Relevance (0.0 - 1.0)

0.9-1.0: Excellent - Retrieved docs perfectly match the question
0.7-0.89: Good - Most relevant information is present
0.5-0.69: Fair - Some relevant information, but gaps exist
<0.5: Poor - Retrieved docs don’t match the question

Faithfulness (0.0 - 1.0)

0.9-1.0: Excellent - Answer fully grounded in context, no hallucinations
0.7-0.89: Good - Mostly accurate with minor additions
0.5-0.69: Fair - Some hallucinated content
<0.5: Poor - Significant hallucinations

Answer Relevance (0.0 - 1.0)

0.9-1.0: Excellent - Directly answers the question
0.7-0.89: Good - Addresses the question with minor tangents
0.5-0.69: Fair - Partially relevant
<0.5: Poor - Answer doesn’t address the question

BLEU Score (0.0 - 1.0) - Only with reference answer

0.7-1.0: High lexical overlap with reference
0.4-0.69: Moderate overlap, similar wording
0.2-0.39: Low overlap, different wording
<0.2: Very different from reference

Semantic Similarity (0.0 - 1.0) - Only with reference answer

0.85-1.0: Nearly identical meaning
0.7-0.84: Similar meaning, different expression
0.5-0.69: Related but divergent
<0.5: Different meanings

Next Steps to level this demo up

Add More Documents: Expand your knowledge base
Implement Reranking: Improve retrieval quality with rerankers
Create Web UI: Build a React/Angular frontend
Add Contextual RAG: Use Claude’s contextual embeddings
Deploy to Production: Use Kubernetes with persistent stores
A/B Testing: Compare different RAG configurations
Multi-Language Support: Test with documents in various languages

We built a RAG system that can explain itself when it fails.

Easy RAG makes document ingestion effortless - just point to a directory
Ollama enables completely local, cost-free RAG systems
RAGAS provides comprehensive, LLM-based evaluation without references
BLEU + Semantic Similarity add reference-based validation
Quarkus Integration gives you production-ready features out-of-the-box
Combined Metrics provide holistic view of RAG system quality

If you cannot measure it, you do not own it.

Discussion about this post

Ready for more?