Pattern Matching Is Not Moderation: Semantic Toxicity Detection with Quarkus

A hands-on Java tutorial showing why toxicity is a semantic problem and how to enforce AI safety with Guardrails and ONNX models.

Feb 03, 2026

Most teams start toxicity filtering the same way they start prompt injection detection: with patterns. They define a list of forbidden words, add a few regular expressions, and maybe layer embeddings on top to “generalize” the match. This approach works briefly, then collapses under real usage.

The core issue is that toxicity is not a lexical property of text. It is semantic, contextual, and multi-dimensional. The same word can be harmless in one context and abusive in another. The same intent can be expressed across languages, dialects, slang, or code-switching without sharing surface structure. Pattern matching fundamentally operates on form. Toxicity lives in meaning.

This tutorial focuses on how to detect toxicity using Guardrails backed by dedicated semantic models, not generic embeddings or keyword lists. The goal is not academic purity, but a design that survives production traffic, multilingual input, and evolving user behavior.

Prereqs in one Sentence

You need Java 21, Maven, optionally the Quarkus CLI, and a machine that can comfortably hold a few hundred MB of ONNX model in memory.

Bootstrapping the Quarkus Project

We’ll build a small Quarkus service that exposes a /chat endpoint, runs a toxicity guardrail before it ever touches the LLM, and returns a structured moderation decision when it blocks.

Create the project with one Quarkus CLI command or grab the code from my Github repository.

quarkus create app com.example:guarded-chat-toxicity \
  --java=21 \
  --no-code \
  --extensions="quarkus-rest-jackson,io.quarkiverse.langchain4j:quarkus-langchain4j-ollama"
cd guarded-chat-toxicity

quarkus-rest-jackson because moderation results should be JSON without ceremony.
quarkus-langchain4j-ollama because we still want a local chat model for the end-to-end flow, even though toxicity detection will be fully in-process.

Quarkus moves fast, so pin the Quarkus version to something current and stable. As of early 2026, the Quarkus site lists 3.30.6 as the latest release.

The Real Change: Dedicated Toxicity Model, Not Generic Embeddings

For production toxicity detection, we want a model trained specifically on toxicity signals, not an embedding model that just happens to “kind of” separate toxic from non-toxic text. We’ll use an ONNX-exported Detoxify-style multilingual model that produces multiple scores like toxicity, identity_attack, threat, and insult, because production decisions are never one-dimensional.

`Add ONNX and Tokenizer`

So far we have a plainQuarkus project. Time to add ONNX Runtime and the DJL tokenizer stack to run Hugging Face tokenizers locally. The model itself is ONNX, but tokenization is still the hidden sharp edge, so we keep it explicit and deterministic.

        <onnxruntime.version>1.23.2</onnxruntime.version>
        <djl.tokenizers.version>0.36.0</djl.tokenizers.version>

        <dependency>
            <groupId>com.microsoft.onnxruntime</groupId>
            <artifactId>onnxruntime</artifactId>
            <version>${onnxruntime.version}</version>
        </dependency>

        <dependency>
            <groupId>ai.djl.huggingface</groupId>
            <artifactId>tokenizers</artifactId>
            <version>${djl.tokenizers.version}</version>
        </dependency>

Download the Toxicity Model and Tokenizer

Create the directory:

mkdir -p src/main/resources/models

Download the quantized ONNX model and tokenizer files from the Detoxify ONNX repository. The quantized model is dramatically smaller and is designed to be used with ONNX Runtime. (Hugging Face)

curl -L -o src/main/resources/models/detoxify.quant.onnx \
  https://huggingface.co/gravitee-io/detoxify-onnx/resolve/main/model.quant.onnx

curl -L -o src/main/resources/models/tokenizer.json \
  https://huggingface.co/gravitee-io/detoxify-onnx/resolve/main/tokenizer.json

curl -L -o src/main/resources/models/tokenizer_config.json \
  https://huggingface.co/gravitee-io/detoxify-onnx/resolve/main/tokenizer_config.json

How to Find and Pick the Right Model on Hugging Face

Here’s a practical guide for finding models like the one we’re using:

Start with a real Use Case

Ask yourself: “What problem am I solving?”

Common examples:

Text classification (spam, sentiment, toxicity ← your case)
Translation
Text generation
Image recognition
Speech-to-text

For toxicity detection, you’d search: “toxicity detection”, “content moderation”, or “hate speech detection”

Search Hugging Face Hub

Go to https://huggingface.co/models and use filters:

Key Filters to Use:

Task: Select your task (e.g., “Text Classification”)
Library: Choose your framework
- ONNX ← What you need (for Java/cross-platform)
- transformers (Python-only)
- TensorFlow, PyTorch, etc.
Language: Filter by supported languages (e.g., “multilingual”)
Sort by: “Most downloads” or “Trending” for popular/reliable models

Evaluate the Model Page

When you find a candidate (like gravitee-io/detoxify-onnx), check:

Model Card (README)

Look for:

What it does: Clear description of the task
Languages supported: English-only vs multilingual
Example code: Shows it’s production-ready
Performance metrics: Accuracy, F1 score, etc.

Files Tab

You need specific files for your use case:

ONNX format: Look for .onnx files (e.g., model.onnx)
Tokenizer files: tokenizer.json (required for preprocessing)
Config files: tokenizer_config.json (helpful for settings)

Model Metadata

Check the model card header:

library_name: transformers
pipeline_tag: text-classification
language: multilingual  ← Important for our use case!

License

Make sure you can use it:

MIT, Apache 2.0 → Commercial use OK
CC-BY-NC → Non-commercial only
Proprietary → Check restrictions

Why `gravitee-io/detoxify-onnx` Is a Good Choice

Let’s analyze what makes this model suitable:

ONNX Format

Files and versions
├── model.onnx          ← Can run in Java with ONNX Runtime
├── tokenizer.json      ← Works with DJL HuggingFace tokenizer
└── tokenizer_config.json

Why it matters: ONNX models are framework-agnostic. You can use them in Java, C#, JavaScript, etc.

Multilingual Support

Based on XLM-RoBERTa (mentioned in the model card):

Supports 100+ languages
Not just English

Clear Use Case

The model card states: “Detoxify is a toxicity detection model” with specific labels:

toxicity
severe_toxicity
obscene
threat
insult
identity_attack
sexual_explicit

This matches your exact need.

Red Flags to Avoid

No Model Files

Some repos only have training code, no actual .onnx or .bin files

Unclear Documentation

If the README doesn’t explain:

What the model does
How to use it
What inputs/outputs look like → Skip it

Wrong Format

If you need ONNX but only see .bin or .pt files:

You’d need to convert it yourself (complex)
Look for pre-converted versions

Abandoned Projects

Check:

Last updated date (avoid models not updated in 2+ years)
Number of downloads (low downloads = untested)
Community activity (discussions, issues)

Alternative: Convert Your Own Model

If you find a good PyTorch model without ONNX:

Option A: Use conversion tools

# Install converter
pip install transformers onnx

# Convert to ONNX
python -m transformers.onnx --model=unitary/toxic-bert export/

Option B: Look for someone who already did it

Search: "[model-name] onnx" on Hugging Face
Often someone has already converted popular models

Configuration That Reflects Real Moderation Decisions

src/main/resources/application.properties

The model predicts multiple labels, and the model’s own config file documents exactly which labels it outputs.

# --- Ollama chat model for the demo flow ---
quarkus.langchain4j.ollama.chat-model.model-id=llama3.2

# --- Detoxify ONNX model ---
toxicity.model.path=src/main/resources/models/detoxify.quant.onnx
toxicity.tokenizer.path=src/main/resources/models/tokenizer.json
toxicity.tokenizer.config.path=src/main/resources/models/tokenizer_config.json

# --- Dimension-specific thresholds ---
toxicity.threshold.toxicity=0.70
toxicity.threshold.severe_toxicity=0.80
toxicity.threshold.identity_attack=0.70
toxicity.threshold.threat=0.80
toxicity.threshold.insult=0.75
toxicity.threshold.obscene=0.80
toxicity.threshold.sexual_explicit=0.85

# --- Logging ---
quarkus.log.category."com.example".level=INFO

This is the part most teams get wrong at first. You do not want “one toxicity threshold.” You want different tolerances for different harms, because “insult” is not the same risk as “threat,” and “identity attack” is almost always a hard block.

The Toxicity Classifier Running Inside the JVM

We’ll build this in three pieces: a model runner, a score object that keeps labels and probabilities explicit, and a guardrail that turns scores into an enforcement decision.

src/main/java/com/example/toxicity/ToxicityScores.java

package com.example.toxicity;

import java.util.Map;

public record ToxicityScores(Map<String, Double> probabilities) {

    public double get(String label) {
        return probabilities.getOrDefault(label, 0.0);
    }

    public String highestLabel() {
        return probabilities.entrySet().stream()
                .max(Map.Entry.comparingByValue())
                .map(Map.Entry::getKey)
                .orElse("unknown");
    }

    public double highestScore() {
        return probabilities.values().stream()
                .mapToDouble(Double::doubleValue)
                .max()
                .orElse(0.0);
    }
}

This is intentionally boring. Moderation systems fail when they compress everything into a single number and then forget why something was blocked. Keeping labels and scores explicit makes operations and tuning survivable.

src/main/java/com/example/toxicity/ToxicityClassifier.java

package com.example.toxicity;

public interface ToxicityClassifier {
    ToxicityScores predict(String text);

    void warmup();
}

We keep this as an interface because you will eventually want to swap implementations, either for a different model, a hosted service, or an ensemble approach. The guardrail should not care.

src/main/java/com/example/toxicity/OnnxDetoxifyClassifier.java

This is the core. It loads the ONNX model once, uses the tokenizer to produce input_ids and attention_mask, runs inference, and converts logits to probabilities via sigmoid. The model card for the repository even shows the same transformation, which is exactly what we want to mirror in Java. (Hugging Face)

package com.example.toxicity;

import java.nio.FloatBuffer;
import java.nio.LongBuffer;
import java.nio.file.Path;
import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.Map;

import org.eclipse.microprofile.config.inject.ConfigProperty;

import ai.djl.huggingface.tokenizers.Encoding;
import ai.djl.huggingface.tokenizers.HuggingFaceTokenizer;
import ai.onnxruntime.OnnxTensor;
import ai.onnxruntime.OrtEnvironment;
import ai.onnxruntime.OrtException;
import ai.onnxruntime.OrtSession;
import io.quarkus.logging.Log;
import jakarta.annotation.PostConstruct;
import jakarta.annotation.PreDestroy;
import jakarta.enterprise.context.ApplicationScoped;

@ApplicationScoped
public class OnnxDetoxifyClassifier implements ToxicityClassifier {

    @ConfigProperty(name = "toxicity.model.path")
    String modelPath;

    @ConfigProperty(name = "toxicity.tokenizer.path")
    String tokenizerPath;

    private OrtEnvironment env;
    private OrtSession session;
    private HuggingFaceTokenizer tokenizer;

    private static final String[] LABELS = {
            "toxicity",
            "severe_toxicity",
            "obscene",
            "identity_attack",
            "insult",
            "threat",
            "sexual_explicit",
            "male",
            "female",
            "homosexual_gay_or_lesbian",
            "christian",
            "jewish",
            "muslim",
            "black",
            "white",
            "psychiatric_or_mental_illness"
    };

    @PostConstruct
    void init() {
        try {
            this.env = OrtEnvironment.getEnvironment();
            this.session = env.createSession(Path.of(modelPath).toAbsolutePath().toString(),
                    new OrtSession.SessionOptions());

            // Create options map to configure the tokenizer
            Map<String, String> options = new HashMap<>();
            options.put("truncation", "true");
            options.put("padding", "true");

            // HuggingFaceTokenizer automatically discovers tokenizer_config.json
            // when tokenizer.json is in the same directory
            Path tokenizerPathObj = Path.of(tokenizerPath);
            // If tokenizerPath points to a file, use its parent directory so it can find
            // all tokenizer files
            Path tokenizerDir = tokenizerPathObj.toFile().isFile()
                    ? tokenizerPathObj.getParent()
                    : tokenizerPathObj;

            this.tokenizer = HuggingFaceTokenizer.newInstance(tokenizerDir, options);

            Log.info("Detoxify ONNX classifier initialized.");
        } catch (Exception e) {
            throw new IllegalStateException("Failed to initialize Detoxify ONNX classifier", e);
        }
    }

    @Override
    public ToxicityScores predict(String text) {
        try {
            // Tokenize the input text into token IDs and attention mask using the
            // configured tokenizer.
            Encoding encoding = tokenizer.encode(text);

            // Extract the token IDs (input_ids) that represent the text as a sequence of
            // integers.
            long[] inputIds = toLongArray(encoding.getIds());
            // Extract the attention mask that indicates which tokens are real (1) vs
            // padding (0).
            long[] attentionMask = toLongArray(encoding.getAttentionMask());

            // Define the tensor shape as [batch_size=1, sequence_length] for the model
            // input.
            long[] shape = new long[] { 1, inputIds.length };

            // Create ONNX tensors from the input arrays and run model inference.
            try (OnnxTensor inputIdsTensor = tensorOf(shape, inputIds);
                    OnnxTensor attentionMaskTensor = tensorOf(shape, attentionMask);
                    OrtSession.Result result = session.run(Map.of(
                            "input_ids", inputIdsTensor,
                            "attention_mask", attentionMaskTensor))) {

                // Extract the output tensor containing the raw logits from the model.
                OnnxTensor outputTensor = (OnnxTensor) result.get(0);
                // Get the float buffer from the output tensor for efficient buffer-based
                // access.
                FloatBuffer logitsBuffer = outputTensor.getFloatBuffer();
                // Apply sigmoid activation function to convert logits into probability scores.
                double[] probs = sigmoid(logitsBuffer);

                // Map each probability score to its corresponding toxicity label.
                Map<String, Double> scores = new LinkedHashMap<>();
                for (int i = 0; i < Math.min(LABELS.length, probs.length); i++) {
                    scores.put(LABELS[i], probs[i]);
                }

                // Return the toxicity scores wrapped in a ToxicityScores object.
                return new ToxicityScores(scores);
            }

        } catch (OrtException e) {
            throw new IllegalStateException("Failed to run toxicity inference", e);
        }
    }

    @Override
    public void warmup() {
        predict("warmup");
    }

    private OnnxTensor tensorOf(long[] shape, long[] values) throws OrtException {
        LongBuffer buffer = LongBuffer.wrap(values);
        return OnnxTensor.createTensor(env, buffer, shape);
    }

    private static long[] toLongArray(long[] source) {
        return source;
    }

    private static double[] sigmoid(FloatBuffer logits) {
        int length = logits.limit();
        double[] out = new double[length];
        for (int i = 0; i < length; i++) {
            double x = logits.get(i);
            out[i] = 1.0 / (1.0 + Math.exp(-x));
        }
        return out;
    }

    @PreDestroy
    void shutdown() {
        try {
            if (session != null)
                session.close();
            if (env != null)
                env.close();
        } catch (Exception e) {
            Log.warn("Error while shutting down ONNX resources", e);
        }
    }
}

This is the first major difference from embedding-based toxicity approaches. You are no longer guessing whether a message “feels like toxicity.” You are getting calibrated signals from a model trained to detect it.

Here’s what happens in plain English, step by step:

WTF is going on in predict(String text)

Input: A text string like “You’re an idiot!”

Output: Probability scores (0.0 to 1.0) for different toxicity categories

1. Convert Text to Numbers (Tokenization)

Encoding encoding = tokenizer.encode(text);

The AI model can’t read text. It only understands numbers. The tokenizer breaks your text into pieces (tokens) and converts them to integer IDs.

Example: "You're an idiot" → [345, 1289, 67, 9843]

2. Get Two Important Arrays

long[] inputIds = toLongArray(encoding.getIds());
long[] attentionMask = toLongArray(encoding.getAttentionMask());

inputIds: The actual token numbers representing your text
attentionMask: A list of 1s and 0s telling the model which numbers are real text (1) vs padding/filler (0)

Example with padding to length 8:

inputIds: [345, 1289, 67, 9843, 1, 1, 1, 1] ← padded with 1s
attentionMask: [1, 1, 1, 1, 0, 0, 0, 0] ← “ignore the last 4”

3. Package Data for the Model

long[] shape = new long[] { 1, inputIds.length };

Think of this as specifying the dimensions of a spreadsheet: 1 row × 512 columns

4. Run the AI Model

OrtSession.Result result = session.run(Map.of(
    "input_ids", inputIdsTensor,
    "attention_mask", attentionMaskTensor))

This is like calling an API. You send the prepared data to the trained AI model, and it processes it using billions of learned parameters.

5. Convert Raw Output to Probabilities

FloatBuffer logitsBuffer = outputTensor.getFloatBuffer();
double[] probs = sigmoid(logitsBuffer);

The model returns “logits” (raw scores like -2.5, 3.1, 0.7). These aren’t probabilities yet.

The sigmoid function squashes them into the 0.0–1.0 range:

-2.5 → 0.08 (8% probability)
3.1 → 0.96 (96% probability)
0.7 → 0.67 (67% probability)

6. Label the Results

Map<String, Double> scores = new LinkedHashMap<>();
for (int i = 0; i < LABELS.length; i++) {
    scores.put(LABELS[i], probs[i]);
}

Match each probability with its meaning:

"toxicity" → 0.92
"insult" → 0.85
"threat" → 0.12
"sexual_explicit" → 0.03
...

In one sentence: Convert text to numbers → feed to AI model → convert AI’s raw scores to probabilities → attach labels → return results.

It’s basically like sending a text message to a specialized calculator that’s been trained on millions of examples to recognize toxic language patterns.

The Guardrail That Turns Scores Into Enforcement

src/main/java/com/example/guardrails/ToxicityGuardrail.java

We implement an InputGuardrail because this is user input validation. The guardrail pulls thresholds from configuration, evaluates the relevant dimensions, and blocks with a precise reason that includes which dimension tripped and what score it had.

package com.example.guardrails;

import java.util.LinkedHashMap;
import java.util.Map;

import org.eclipse.microprofile.config.inject.ConfigProperty;

import com.example.toxicity.ToxicityClassifier;
import com.example.toxicity.ToxicityScores;

import dev.langchain4j.data.message.UserMessage;
import dev.langchain4j.guardrail.InputGuardrail;
import dev.langchain4j.guardrail.InputGuardrailResult;
import io.quarkus.logging.Log;
import jakarta.enterprise.context.ApplicationScoped;
import jakarta.inject.Inject;

@ApplicationScoped
public class ToxicityGuardrail implements InputGuardrail {

    private final ToxicityClassifier classifier;

    private final Map<String, Double> thresholds;

    @Inject
    public ToxicityGuardrail(ToxicityClassifier classifier,
            @ConfigProperty(name = "toxicity.threshold.toxicity") double toxicity,
            @ConfigProperty(name = "toxicity.threshold.severe_toxicity") double severeToxicity,
            @ConfigProperty(name = "toxicity.threshold.identity_attack") double identityAttack,
            @ConfigProperty(name = "toxicity.threshold.threat") double threat,
            @ConfigProperty(name = "toxicity.threshold.insult") double insult,
            @ConfigProperty(name = "toxicity.threshold.obscene") double obscene,
            @ConfigProperty(name = "toxicity.threshold.sexual_explicit") double sexualExplicit) {

        this.classifier = classifier;

        this.thresholds = new LinkedHashMap<>();
        thresholds.put("toxicity", toxicity);
        thresholds.put("severe_toxicity", severeToxicity);
        thresholds.put("identity_attack", identityAttack);
        thresholds.put("threat", threat);
        thresholds.put("insult", insult);
        thresholds.put("obscene", obscene);
        thresholds.put("sexual_explicit", sexualExplicit);
    }

    @Override
    public InputGuardrailResult validate(UserMessage userMessage) {
        String input = userMessage.singleText();
        ToxicityScores scores = classifier.predict(input);

        for (var entry : thresholds.entrySet()) {
            String label = entry.getKey();
            double threshold = entry.getValue();
            double score = scores.get(label);

            if (score >= threshold) {
                Log.warnf("Toxicity blocked: label=%s score=%.4f threshold=%.4f text=%s",
                        label, score, threshold, input);

                return failure("Toxicity detected (" + label + "): " + String.format("%.2f", score));
            }
        }

        Log.infof("Toxicity passed: highest=%s score=%.4f",
                scores.highestLabel(), scores.highestScore());

        return success();
    }
}

This is where multi-lingual reality shows up. The model does not care whether the insult is in English, German, or Portuguese. It sees token-level patterns learned from training data, and you get probabilities you can act on.

Wiring the Guardrail Into the AI Service

Quarkus LangChain4j runs input guardrails before the LLM call when you annotate the AI method with @InputGuardrails.

src/main/java/com/example/ai/ChatBot.java

package com.example.ai;

import com.example.guardrails.ToxicityGuardrail;

import dev.langchain4j.service.SystemMessage;
import dev.langchain4j.service.UserMessage;
import dev.langchain4j.service.guardrail.InputGuardrails;
import io.quarkiverse.langchain4j.RegisterAiService;

@RegisterAiService
public interface ChatBot {

    @SystemMessage("You are a helpful assistant. Keep responses short and professional.")
    @InputGuardrails(ToxicityGuardrail.class)
    String chat(@UserMessage String message);
}

This is important operationally. It guarantees you do not have “one endpoint” that forgets to call the guardrail. The AI service becomes the choke point.

REST Endpoint and Failure Behavior That Doesn’t Kill the Client

When a guardrail fails, LangChain4j throws an exception. If you do not catch and map it, your API turns into a vague 500 and your front-end gets nothing useful.

src/main/java/com/example/api/ChatResource.java

package com.example.api;

import com.example.ai.ChatBot;

import jakarta.inject.Inject;
import jakarta.ws.rs.Consumes;
import jakarta.ws.rs.POST;
import jakarta.ws.rs.Path;
import jakarta.ws.rs.Produces;
import jakarta.ws.rs.core.MediaType;

@Path("/chat")
@Consumes(MediaType.APPLICATION_JSON)
@Produces(MediaType.APPLICATION_JSON)
public class ChatResource {

    @Inject
    ChatBot chatBot;

    @POST
    public ChatResponse chat(ChatRequest request) {
        String reply = chatBot.chat(request.message());
        return new ChatResponse("ALLOWED", reply);
    }

    public record ChatRequest(String message) {
    }

    public record ChatResponse(String status, String reply) {
    }
}

src/main/java/com/example/api/GuardrailExceptionMapper.java

package com.example.api;

import dev.langchain4j.guardrail.InputGuardrailException;
import jakarta.ws.rs.core.Response;
import jakarta.ws.rs.ext.ExceptionMapper;
import jakarta.ws.rs.ext.Provider;

@Provider
public class GuardrailExceptionMapper implements ExceptionMapper<InputGuardrailException> {

    @Override
    public Response toResponse(InputGuardrailException exception) {
        return Response.status(Response.Status.BAD_REQUEST)
                .entity(new ErrorResponse("BLOCKED", exception.getMessage()))
                .build();
    }

    public record ErrorResponse(String status, String reason) {
    }
}

Now the system behaves like a moderation system should. It blocks explicitly and predictably, without tearing down the request path.

Run It End-to-End

Run your application:

quarkus dev

Send a normal message:

curl -s -X POST http://localhost:8080/chat \
  -H "Content-Type: application/json" \
  -d '{"message":"Can you summarize Quarkus Dev Services for me?"}'

Expected output (similar to below. Watch the status!):

{
  "status": "ALLOWED",
  "reply": "Quarkus Dev Services is an open-source platform that provides a suite of tools to simplify development, build, test, and deploy Java applications using the Quarkus framework. It includes:\n\n*   A CLI (Command-Line Interface) for building and managing Quarkus applications\n*   A IDE (Integrated Development Environment) extension for popular editors like Eclipse and IntelliJ IDEA\n*   A set of plugins for Jenkins and other CI/CD tools\n\nQuarkus Dev Services aims to make it easier for developers to create, deploy, and manage high-performance, cloud-native Java applications."
}

Now send something that should clearly trip the classifier:

curl -s -X POST http://localhost:8080/chat \
  -H "Content-Type: application/json" \
  -d '{"message":"You worthless idiot."}'

Expected output:

{
  "status": "BLOCKED",
  "reason": "The guardrail com.example.guardrails.ToxicityGuardrail_ClientProxy failed with this message: Toxicity detected (toxicity): 1.00"
}

The exact score will differ across hardware and model versions, but the shape of the result is stable: you get a category and a probability, not a hand-wavy “matched toxic description.”

Oh, and let’s try German:

curl -s -X POST http://localhost:8080/chat \
  -H "Content-Type: application/json" \
  -d '{"message":"Du bist ein totaler Vollidiot."}'

This means something like: "You are a complete idiot."

Well,

{
  "status": "BLOCKED",
  "reason": "The guardrail com.example.guardrails.ToxicityGuardrail_ClientProxy failed with this message: Toxicity detected (toxicity): 0.92"
}

The “Moving Target” Problem

Detecting modern slang and cultural idioms is notoriously difficult because these expressions rely on contextual fluidity and rapid evolution, which usually outpace the static datasets used to train AI models. Unlike standard profanity, which is explicit and permanent, slang often repurposes neutral words (like the German Opfer meaning “victim” but implying “loser”) or employs “algospeak” to evade filters, effectively creating a moving target that requires a model to understand current cultural intent rather than just dictionary definitions. Furthermore, idioms often lack compositional meaning, analyzing the individual words in a phrase like “Backpfeifengesicht” won’t reveal the violence implied, forcing the classifier to rely on deep, region-specific cultural knowledge that is frequently underrepresented in standard training data.

Why Models Struggle (The Technical Breakdown)

Semantic Shift (Repurposing): Models are trained that Word A = Good and Word B = Bad. Slang breaks this by taking a “Good” word (e.g., “basic,” “fruity,” or “snowflake”) and using it as an insult. The model sees the neutral token and clears it, missing the toxicity entirely.

Data Latency (The Time Lag): Training a model takes time. Internet slang evolves in weeks. By the time a model learns that a specific emoji or phrase is currently being used as a slur on TikTok or Reddit, the community may have already moved on to a new one.

Tokenization Limits: Idioms are “non-compositional.” If a model reads the Spanish idiom Me cago en la leche, it tokenizes “cago” (bad) and “leche” (neutral). Depending on how the model weights these, the neutral noun might dilute the toxicity score, causing a “False Negative.”

Production Hardening: Where This Approach Actually Holds Up

The win here is not just accuracy. It’s operational control. Because the model outputs multiple dimensions, you can tune “identity attack” to be a hard stop while letting mild “toxicity” float higher in an internal tool where people vent. Because it runs in-process, you do not pay network latency per request, and you do not turn moderation into an external dependency that fails at the worst possible time.

You still need to treat this as a system, not a library call. The model is large, warmup matters, and memory pressure is real. You should call warmup() on startup in production so the first user does not pay the cold path. You should log scores at a controlled sampling rate, not for every message, because moderation logs can become sensitive data. You should also accept that thresholds are not “set and forget.” They drift with your user base, your language mix, and your domain.

Most importantly, this approach finally aligns with the multilingual reality that kicked this whole thing off. Toxic intent does not arrive with English phrasing, and production toxicity detection cannot depend on English abstractions. A dedicated classifier gives you a stable signal across languages, and that stability is what makes guardrails feel like engineering instead of superstition.

Verification With a Real Test

Testing guardrails is different from testing business logic.

You are not asserting exact outputs. You are asserting behavior under known conditions. A clearly benign input should pass. A clearly abusive input should fail. Edge cases should be observable and tunable.

Because the classifier runs locally and deterministically, tests are stable. Because guardrail failures are explicit, failures can be surfaced cleanly to clients and logged for analysis.

This is what makes guardrails operational rather than aspirational.

src/test/java/com/example/guardrails/ToxicityGuardrailTest.java

package com.example.guardrails;

import io.quarkus.test.junit.QuarkusTest;
import jakarta.inject.Inject;
import org.junit.jupiter.api.Test;

import static org.junit.jupiter.api.Assertions.*;

@QuarkusTest
class ToxicityGuardrailTest {

    @Inject
    ToxicityGuardrail guardrail;

    @Test
    void shouldPassWhenAllScoresBelowThresholds() {
        var result = guardrail.validate(UserMessage.from("Hello, how are you?"));
        assertTrue(result.isSuccess());
    }

    @Test
    void shouldFailWhenToxicityExceedsThreshold() {
        var result = guardrail.validate(UserMessage.from("You are a terrible person and I hate you."));
        assertFalse(result.isSuccess());
    }

    @Test
    void shouldFailWhenSevereToxicityExceedsThreshold() {
        var result = guardrail.validate(UserMessage.from("You are an absolutely disgusting human being."));
        assertFalse(result.isSuccess());
    }
}

Run it:

./mvnw test

Expected output includes a passing test suite, and your logs will show which dimension tripped for the toxic case.

Conclusion

Toxicity detection cannot be reduced to matching words or comparing embeddings to vague descriptions. It is a semantic classification problem that requires models trained specifically for that purpose.

Guardrails provide the right abstraction for enforcing toxicity policies because they separate detection from generation and policy from implementation.

By using a dedicated toxicity classifier, multi-dimensional scoring, and in-process inference, you get a system that is fast, multilingual by design, and tunable without guesswork.

Once teams internalize that toxicity is semantic, not lexical, the architecture almost designs itself.

Neural Foundry

Fantastic deep dive! The distinction between semantic and lexical detection is crucial and often overlooked. I've seen too many teams struggle with pattern-matching approaches that fail the moment users get creative. Your multi-dimensional threshold approach is brilliant, treating diferent harm types separately makes so much operational sense that I'm surprised it's not more commin.

Discussion about this post

Ready for more?