Unicode Defense in Java: The Complete Guide
How invisible characters, homograph attacks, and BiDi overrides break production systems and how to stop them in Quarkus.
Most developers treat text validation as a solved problem. You check length, maybe strip a few characters, and rely on the database to enforce uniqueness. That mental model comes from an ASCII world where one byte equals one character and what you see is what the computer sees.
That model breaks the moment Unicode enters the system. Modern applications accept usernames, filenames, search terms, and free-text input from browsers, APIs, and mobile clients that all speak full Unicode. Characters that look identical can be different at the byte level. Characters that are invisible to humans are still meaningful to machines. Directional control characters can change how text is rendered without changing what the operating system executes.
I have written about Unicode before, make sure to also check that article:
In production, this does not show up as “weird text.” It shows up as duplicated accounts that look identical, blocklists that stop working, spoofed filenames that pass reviews, and audit trails you can no longer reason about. These failures are subtle, persistent, and expensive to clean up after the fact.
In this tutorial, we build a Quarkus application that treats Unicode as a security boundary. We validate and sanitize text at the edge, before it reaches business logic or the database. The goal is not perfect Unicode handling. The goal is predictable behavior under attack.
Prerequisites
You need a working Java and Quarkus setup to follow along.
Java 21 installed
Quarkus CLI available on your path
Basic understanding of REST endpoints and Bean Validation
Project Setup
Create the project or start from my Github repository to directly look at the source code.
quarkus create app com.secure.text:unicode-defense \
--extension=quarkus-rest-jackson,quarkus-hibernate-validator Move into the directory:
cd unicode-defenseWe use quarkus-rest-jackson for a modern REST API and quarkus-hibernate-validator to integrate custom validation logic with standard Jakarta Bean Validation. No additional libraries are required. Everything else is pure Java.
Implementing the Unicode Defense Engine
Text handling logic should not be scattered across controllers and validators. We centralize all Unicode-related rules in a single utility class. This gives us one place to reason about guarantees and limits.
The TextSanitizer has two responsibilities. First, it normalizes and cleans input before it is stored or processed. Second, it detects unsafe input so validation can fail fast. These are separate concerns. Sanitization changes data. Validation decides whether data is acceptable at all.
We will tackle four specific categories of Unicode threats.
Threat 1: The Invisible Enemy (Zero-Width Characters)
Unicode includes characters designed to be invisible. They exist to handle formatting (like deciding where a word should break) but malicious actors use them to create visual duplicates that are byte-distinct.
Attack Vector A (Impersonation): An attacker registers the username
admin\u200B(where\u200Bis a zero-width space). To a human, it looks exactly likeadmin. To the database, it is a unique string.Attack Vector B (Filter Bypass): A profanity filter blocks “badword”. An attacker submits “bad\u200Bword”. The filter sees a mismatch and lets it through, but the browser renders it as the banned word.
The Logic & Regex
We need to explicitly strip these characters. Standard “whitespace” removal (trim()) often misses them.
The Regex Pattern:
private static final Pattern INVISIBLE_CHARS = Pattern.compile(
"[\\u200B\\u200C\\u200D\\uFEFF\\u00AD]"
);Breakdown:
\u200B(Zero-Width Space): The classic invisible separator.\u200C(Zero-Width Non-Joiner): Used in Arabic/Persian to prevent letters from connecting.\u200D(Zero-Width Joiner): Used to combine emojis (e.g., 👨 + 🌾 = 👨🌾). While valid in chat apps, it is dangerous in usernames/filenames.\uFEFF(Byte Order Mark): Often appears at the start of files but has no place in a username.\u00AD(Soft Hyphen): Invisible unless the text wraps at the end of a line.
Threat 2: The “Trojan Horse” (BiDi Overrides)
Computers support both Left-to-Right (Latin) and Right-to-Left (Arabic/Hebrew) text. Unicode control characters allow you to switch direction mid-string. Attackers use this to mask file extensions.
The Attack: An attacker names a file
taxes_cod\u202Eexe.pdf.The Reality: The
\u202Echaracter is the Right-to-Left Override (RLO). It tells the computer: “Render everything after me backwards.”The Rendering: The computer renders
exe.pdfbackwards asfdp.exe.The Result: The user sees
taxes_codfdp.exe(looks like a harmless PDF), but the operating system sees an.exefile.
The Logic & Regex
For fields like filenames, usernames, or emails, there is rarely a valid reason to forcibly override text direction. We block the entire range of directional formatting codes.
The Regex Pattern:
private static final Pattern BIDI_OVERRIDE = Pattern.compile(
"[\\u202A-\\u202E\\u2066-\\u2069]"
);Breakdown:
\u202A-\u202E: The classic embedding and override controls (LRE, RLE, PDF, LRO, RLO).\u2066-\u2069: The newer “Isolate” controls introduced in Unicode 6.3, which perform similar functions but isolate the text from its surroundings.
Threat 3: The “Evil Twin” (Homograph Attacks)
Homographs are characters that look the same but are mathematically different. The most common attack involves mixing Latin characters with Cyrillic or Greek characters that appear identical.
Latin ‘a’: U+0061
Cyrillic ‘а’: U+0430
Attack:
pаypal.com(using Cyrillic ‘a’). Visually identical topaypal.com.
The Logic
We cannot simply ban Cyrillic (legitimate users have Cyrillic names). Instead, we detect Script Mixing. It is highly suspicious for a single string to contain both Basic Latin and Cyrillic characters.
The Implementation: We don’t use a simple Regex here because the ranges are massive. We use Java’s Character.UnicodeBlock API.
public static boolean isSuspiciousMixedScript(String input) {
Set<UnicodeBlock> blocks = new HashSet<>();
input.codePoints().forEach(cp -> {
UnicodeBlock block = UnicodeBlock.of(cp);
// Ignore "Common" blocks like numbers/punctuation/spaces
// that are valid in all languages.
if (!isCommonBlock(block)) {
blocks.add(block);
}
});
// The Logic: "Does this string contain BOTH Latin AND Cyrillic?"
boolean hasCyrillic = blocks.contains(UnicodeBlock.CYRILLIC);
boolean hasLatin = blocks.contains(UnicodeBlock.LATIN_EXTENDED_A) ||
blocks.contains(UnicodeBlock.BASIC_LATIN);
return hasCyrillic && hasLatin;
}Threat 4: The Shape Shifter (Normalization)
In Unicode, you can write the character “é” in two ways:
Precomposed (NFC): A single byte sequence
U+00E9.Decomposed (NFD): The letter ‘e’ (
U+0065) followed by a ‘combining acute accent’ (U+0301).
To a user, they look identical. To Java, "café".equals("cafe\u0301") returns false. This breaks database unique constraints, caching keys, and equality checks.
The Logic
We must force all incoming text into a single “Canonical Form” before we process it. NFC (Normalization Form C) is the web standard—it prefers the single, precomposed character.
The Implementation:
String normalized = Normalizer.normalize(input, Normalizer.Form.NFC);
Implementation
Create src/main/java/com/secure/text/util/TextSanitizer.java:
package com.secure.text.util;
import java.lang.Character.UnicodeBlock;
import java.text.Normalizer;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Pattern;
public class TextSanitizer {
// THREAT 1: Invisible Characters
// Removes ZWSP, ZWNJ, ZWJ, BOM, and Soft Hyphen
private static final Pattern INVISIBLE_CHARS = Pattern.compile(
"[\\u200B\\u200C\\u200D\\uFEFF\\u00AD]");
// THREAT 2: BiDi Overrides
// Removes Left-to-Right and Right-to-Left overrides that mask extensions
private static final Pattern BIDI_OVERRIDE = Pattern.compile(
"[\\u202A-\\u202E\\u2066-\\u2069]");
// THREAT 3: Control Characters
// Removes ASCII control chars (Bell, Escape, etc.) but KEEPS \r\n\t
private static final Pattern CONTROL_CHARS = Pattern.compile(
"[\\p{Cc}&&[^\r\n\t]]");
/**
* The Master Cleaning Function.
* Call this BEFORE saving data to your database.
*/
public static String sanitize(String input) {
if (input == null)
return null;
// 1. Normalize: Fix the "café" problem (Precomposed vs Decomposed)
String normalized = Normalizer.normalize(input, Normalizer.Form.NFC);
// 2. Scrub: Remove invisible/dangerous characters
normalized = INVISIBLE_CHARS.matcher(normalized).replaceAll("");
normalized = BIDI_OVERRIDE.matcher(normalized).replaceAll("");
normalized = CONTROL_CHARS.matcher(normalized).replaceAll("");
return normalized.trim();
}
/**
* The Validation Function.
* Call this inside your @SafeText validator.
*/
public static boolean isSafe(String input) {
if (input == null)
return true;
// If we find any match, the string is unsafe
return !INVISIBLE_CHARS.matcher(input).find() &&
!BIDI_OVERRIDE.matcher(input).find() &&
!CONTROL_CHARS.matcher(input).find();
}
/**
* THREAT 3: Homograph Detection
* Checks if input dangerously mixes scripts (e.g. Cyrillic + Latin).
* Call this explicitly in your Controller for sensitive fields like usernames.
*/
public static boolean isSuspiciousMixedScript(String input) {
if (input == null || input.isEmpty())
return false;
Set<UnicodeBlock> blocks = new HashSet<>();
// Analyze every code point (character) in the string
input.codePoints().forEach(cp -> {
UnicodeBlock block = UnicodeBlock.of(cp);
if (block != null && !isCommonBlock(block)) {
blocks.add(block);
}
});
// The Rule: You cannot have Cyrillic AND Latin in the same string.
boolean hasCyrillic = blocks.contains(UnicodeBlock.CYRILLIC);
boolean hasLatin = blocks.contains(UnicodeBlock.LATIN_EXTENDED_A) ||
blocks.contains(UnicodeBlock.BASIC_LATIN);
return hasCyrillic && hasLatin;
}
// Helper to ignore "safe" shared characters like spaces, numbers, and
// punctuation
private static boolean isCommonBlock(UnicodeBlock block) {
return block == UnicodeBlock.COMMON_INDIC_NUMBER_FORMS ||
block == UnicodeBlock.LATIN_1_SUPPLEMENT;
// Note: BASIC_LATIN is NOT filtered out because we need to detect
// homograph attacks where Cyrillic and Latin are mixed
}
}All text that passes validation and is sanitized is normalized to NFC, free of invisible separators, free of BiDi overrides, and free of unexpected control characters. Equality checks and database uniqueness behave predictably.
This does not make text “safe” for HTML, SQL, or file systems. It also does not prevent every homograph attack. It detects only high-risk script mixing.
Regex alone cannot reliably detect script mixing. Normalization must happen before any comparison. Centralizing this logic avoids inconsistent behavior across endpoints.
Integrating with Bean Validation
Validation should be declarative wherever possible. We expose the safety check as a custom constraint so it can be reused across DTOs.
Bean Validation is part of Hibernate Validator. By creating a custom constraint, we plug Unicode checks into the same lifecycle as @NotBlank or @Email.
Annotation
Create src/main/java/com/secure/text/validation/SafeText.java:
package com.secure.text.validation;
import jakarta.validation.Constraint;
import jakarta.validation.Payload;
import java.lang.annotation.Documented;
import java.lang.annotation.Retention;
import java.lang.annotation.Target;
import static java.lang.annotation.ElementType.FIELD;
import static java.lang.annotation.ElementType.PARAMETER;
import static java.lang.annotation.RetentionPolicy.RUNTIME;
@Target({FIELD, PARAMETER})
@Retention(RUNTIME)
@Constraint(validatedBy = SafeTextValidator.class)
@Documented
public @interface SafeText {
String message() default "Input contains invalid or dangerous Unicode characters";
Class<?>[] groups() default {};
Class<? extends Payload>[] payload() default {};
}Validator
Create src/main/java/com/secure/text/validation/SafeTextValidator.java:
package com.secure.text.validation;
import com.secure.text.util.TextSanitizer;
import jakarta.validation.ConstraintValidator;
import jakarta.validation.ConstraintValidatorContext;
public class SafeTextValidator implements ConstraintValidator<SafeText, String> {
@Override
public boolean isValid(String value, ConstraintValidatorContext context) {
if (value == null) {
return true;
}
return TextSanitizer.isSafe(value);
}
}Any field annotated with @SafeText fails validation if dangerous Unicode characters are present. It does not sanitize input. Validation rejects. Sanitization must still happen explicitly before persistence. Validators should not mutate state. Mixing validation and transformation makes behavior hard to reason about and harder to test.
REST API and DTO Wiring
Now we wire everything together in a realistic API flow.
DTO
Create src/main/java/com/secure/text/dto/UserDto.java:
package com.secure.text.dto;
import com.secure.text.validation.SafeText;
import jakarta.validation.constraints.Email;
import jakarta.validation.constraints.NotBlank;
import jakarta.validation.constraints.Size;
public class UserDto {
@NotBlank
@Size(min = 3, max = 20)
@SafeText(message = "Username contains invisible or control characters")
public String username;
@NotBlank
@Email
public String email;
@NotBlank
@Size(max = 50)
@SafeText
public String bio;
}Resource
Create src/main/java/com/secure/text/resource/UserResource.java:
package com.secure.text.resource;
import com.secure.text.dto.UserDto;
import com.secure.text.util.TextSanitizer;
import jakarta.validation.Valid;
import jakarta.ws.rs.Consumes;
import jakarta.ws.rs.POST;
import jakarta.ws.rs.Path;
import jakarta.ws.rs.Produces;
import jakarta.ws.rs.core.MediaType;
import jakarta.ws.rs.core.Response;
@Path("/api/users")
@Consumes(MediaType.APPLICATION_JSON)
@Produces(MediaType.APPLICATION_JSON)
public class UserResource {
@POST
public Response createUser(@Valid UserDto user) {
if (TextSanitizer.isSuspiciousMixedScript(user.username)) {
return Response.status(Response.Status.BAD_REQUEST)
.entity("{\"error\":\"Username looks suspicious (mixed scripts detected)\"}")
.build();
}
user.username = TextSanitizer.sanitize(user.username);
user.bio = TextSanitizer.sanitize(user.bio);
return Response.ok(user).build();
}
}Verification
We verify behavior with integration tests that exercise real HTTP requests.
Create src/test/java/com/secure/text/UserResourceTest.java:
package com.secure.text;
import static io.restassured.RestAssured.given;
import static org.hamcrest.CoreMatchers.containsString;
import org.junit.jupiter.api.Test;
import io.quarkus.test.junit.QuarkusTest;
import io.restassured.http.ContentType;
@QuarkusTest
public class UserResourceTest {
@Test
void validUserIsAccepted() {
given()
.contentType(ContentType.JSON)
.body("""
{
"username": "duke",
"email": "duke@java.io",
"bio": "I love coffee"
}
""")
.when()
.post("/api/users")
.then()
.statusCode(200);
}
@Test
void invisibleCharactersAreRejected() {
given()
.contentType(ContentType.JSON)
.body("""
{
"username": "duke\\u200B",
"email": "evil@test.com",
"bio": "Normal bio"
}
""")
.when()
.post("/api/users")
.then()
.statusCode(400)
.body(containsString("Username contains"));
}
@Test
void bidiOverrideIsRejected() {
given()
.contentType(ContentType.JSON)
.body("""
{
"username": "hacker",
"email": "hacker@test.com",
"bio": "file \\u202Etxt.exe"
}
""")
.when()
.post("/api/users")
.then()
.statusCode(400);
}
@Test
void homographAttackIsRejected() {
given()
.contentType(ContentType.JSON)
.body("""
{
"username": "p\\u0430ypal",
"email": "phish@test.com",
"bio": "normal"
}
""")
.when()
.post("/api/users")
.then()
.statusCode(400)
.body(containsString("suspicious"));
}
}These tests prove that the defenses fail closed. Unsafe input does not silently pass through.
Run:
mvn testto see the results.
Unicode Aware Sanitation
We built a Unicode-aware validation and sanitization layer in Quarkus that addresses invisible characters, BiDi overrides, normalization issues, and high-risk homograph attacks. The system is explicit about what it guarantees and where manual decisions are required.
Unicode is not a corner case. It is a production input surface, and treating it as such prevents failures that are otherwise invisible until users exploit them.



