Mastering Unicode in Java: Build World-Ready REST APIs with Quarkus
From hidden string pitfalls to emoji-safe endpoints, learn how to handle text correctly in modern Java applications.
Most Java developers have typed String name = "Hello";
more times than they can count. It works. No surprises. But the illusion of simplicity breaks the moment "こんにちは"
, "浩宇"
, or "😉"
shows up in your system. Suddenly, that simple String
reveals a universe of complexity. Bugs creep in. Data gets corrupted. Users complain their names aren’t stored correctly.
This tutorial will demystify Unicode and show you how to build robust, world-ready Java applications. We’ll cover the theory, expose the gotchas, and then build a “Global Greeting Service” with Quarkus that survives the chaos of real-world text.
By the end, you’ll understand:
Unicode fundamentals and how Java really stores text
Why string length and iteration are trickier than they look
How normalization prevents nasty mismatches
How to configure your REST service and database for safe Unicode handling
Let’s get started.
Unicode Fundamentals: The Bedrock of Modern Text
Unicode is often misunderstood. It’s not an encoding like UTF-8 or UTF-16. It’s a standard. A giant dictionary that assigns a unique number (a code point) to every character and emoji.
The letter “A” →
U+0041
The winking face 😉 →
U+1F609
Encodings (UTF-8, UTF-16) decide how to store these numbers as bytes. Java uses UTF-16 internally, which introduces some subtle traps.
Code Points vs. Code Units vs. Grapheme Clusters
Think of three levels:
Code point: The abstract number from Unicode (
U+1F48B
= 💋).Code unit: How encodings represent those points in memory. UTF-16 uses 16-bit units, sometimes one, sometimes two.
Grapheme cluster: What humans see as “a single character.” Could be one code point or several combined (e.g., “e” + combining accent).
Example: "a🚀c"
Grapheme clusters: 3 (
a
,🚀
,c
)Code points: 3 (
U+0061
,U+1F680
,U+0063
)UTF-16 code units: 4 (🚀 takes two units as a surrogate pair)
This explains why string.length()
often lies to you.
Normalization Matters
The same visual character can have multiple representations:
"é"
=U+00E9
(precomposed)"e" + "´"
=U+0065
+U+0301
(composed)
Without normalization, "café"
might not equal "café"
. Normalization (usually NFC) ensures consistent storage and comparison.
Building the "Global Greeting Service" with Quarkus
Let's put theory into practice. We'll build a simple REST API that stores and retrieves greetings.
Project Setup
You'll need Java (17+), Maven, Podman, and a terminal.
Generate the Quarkus Project:
mvn io.quarkus.platform:quarkus-maven-plugin:create \
-DprojectGroupId=org.acme \
-DprojectArtifactId=unicode-greetings \
-DclassName="org.acme.GreetingResource" \
-Dpath="/greetings" \
-Dextensions="rest-jackson,quarkus-hibernate-orm-panache,quarkus-jdbc-postgresql"
cd unicode-greetings
Delete the src/main/test resources. (I know 🙄.)
Create the Greeting Entity:
Rename the MyEntity.java to src/main/java/org/acme/Greeting.java
and replace with the following:
package org.acme;
import io.quarkus.hibernate.orm.panache.PanacheEntity;
import jakarta.persistence.Entity;
@Entity
public class Greeting extends PanacheEntity {
public String name;
public String message;
}
Update the GreetingResource:
Replace the contents of src/main/java/org/acme/GreetingResource.java
:
package org.acme;
import java.util.List;
import jakarta.transaction.Transactional;
import jakarta.ws.rs.Consumes;
import jakarta.ws.rs.GET;
import jakarta.ws.rs.POST;
import jakarta.ws.rs.Path;
import jakarta.ws.rs.Produces;
import jakarta.ws.rs.core.MediaType;
import jakarta.ws.rs.core.Response;
@Path("/greetings")
@Produces(MediaType.APPLICATION_JSON)
@Consumes(MediaType.APPLICATION_JSON)
public class GreetingResource {
@GET
public List<Greeting> getAll() {
return Greeting.listAll();
}
@POST
@Transactional
public Response add(Greeting greeting) {
greeting.persist();
return Response.status(Response.Status.CREATED).entity(greeting).build();
}
}
Configure the Database:
Update src/main/resources/application.properties
for a local PostgreSQL database.
Properties
# Database configuration
quarkus.datasource.db-kind=postgresql
# Important for Unicode! Ensure the client connection talks UTF-8.
quarkus.datasource.jdbc.additional-jdbc-properties.charSet=UTF-8
# Drop and create the schema on startup for development
quarkus.hibernate-orm.schema-management.strategy=drop-and-create
Start the Application:
./mvnw quarkus:dev
Quarkus will automatically start a PostgreSQL container for you.
You now have a basic REST service. Let's start breaking it with Unicode.
Java-Specific Challenges: The Gotchas Appear
Our simple service works fine for ASCII. Now let's introduce a name with an emoji and see what happens.
The String.length()
Lie
Let's add a "safety check" to our resource to prevent overly long names.
Modify the add
method in GreetingResource.java
:
// In GreetingResource.java
@POST
@Transactional
public Response add(Greeting greeting) {
// A seemingly innocent validation check
if (greeting.name != null && greeting.name.length() > 6) {
return Response.status(Response.Status.BAD_REQUEST)
.entity("{\"error\":\"Name cannot exceed 6 characters\"}")
.build();
}
greeting.persist();
return Response.status(Response.Status.CREATED).entity(greeting).build();
}
Now, try to post a greeting using curl
:
curl -X POST http://localhost:8080/greetings \
-H "Content-Type: application/json" \
-d '{ "name": "Team 🚀", "message": "To the moon!" }'
Result: You get a 400 Bad Request
!
{"error":"Name cannot exceed 6 characters"}
But "Team 🚀" looks like 6 characters. What gives? As we learned, the rocket emoji requires a surrogate pair in UTF-16. So greeting.name.length()
returns 7 (T-e-a-m- -🚀[part1]-🚀[part2]), which is greater than 6. Oops.
The Fix: Use codePointCount()
to get the true number of code points, which aligns with the user's perception of "characters".
// In GreetingResource.java
// ...
if (greeting.name != null && greeting.name.codePointCount(0, greeting.name.length()) > 6) {
// ...
Update the code and try the curl
command again. Success! The greeting is created.
Iterating Correctly
Another common mistake is iterating over a String
's char
array. Let's imagine we want to create a slug from a name by filtering characters.
// Don't do this! This is a demonstration of what NOT to do.
public static String createSlug(String input) {
StringBuilder slug = new StringBuilder();
for (char c : input.toCharArray()) {
if (Character.isLetterOrDigit(c)) {
slug.append(Character.toLowerCase(c));
}
}
return slug.toString();
}
// In some test method:
System.out.println(createSlug("User-👍-Name"));
// Expected output: "username"
// Actual output: "username" -> It appears to work, but it silently mangles the emoji.
When the loop encounters the 👍 emoji, it processes each half of the surrogate pair separately. Character.isLetterOrDigit()
returns false for both halves, so they are skipped. This might seem fine, but for other operations, you could end up with half an emoji, which is corrupt data.
The Fix: Use the codePoints()
stream. This correctly presents each code point, regardless of whether it's one or two code units.
public static String createSlugProperly(String input) {
StringBuilder slug = new StringBuilder();
input.codePoints().forEach(codePoint -> {
if (Character.isLetterOrDigit(codePoint)) {
slug.append(Character.toLowerCase(Character.toChars(codePoint)));
}
});
return slug.toString();
}
This version is Unicode-safe. It correctly handles any character from any language or emoji set.
Web Development Pain Points
Now let's tackle problems that arise when our service interacts with other systems, like databases and clients.
The Normalization Search Problem
Add a new greeting:
curl -X POST http://localhost:8080/greetings \
-H "Content-Type: application/json" \
-d '{ "name": "José", "message": "Hola!" }'
The name is stored with the precomposed character é
(U+00E9
). Now, imagine a user with a keyboard that produces the letter e
followed by a combining accent ´
searches for "José". Their search term is byte-for-byte different.
Let's add a search endpoint to see this fail.
In GreetingResource.java
:
@GET
@Path("/search")
public Response search(@QueryParam("name") String name) {
if (name == null) {
return Response.ok(List.of()).build();
}
// This is a naive, direct comparison that will fail
List<Greeting> results = Greeting.list("name", name);
return Response.ok(results).build();
}
Now, try to search for "Jose".
curl "http://localhost:8080/greetings/search?name=Jose"
It obviously returns an empty result. But even if we could type the version with the combining accent, it would also fail.
The Fix: Normalize all strings to a consistent form. We'll use NFC.
Modify the add
method: Normalize the name before saving it.
// In GreetingResource.java's add() method
import java.text.Normalizer;
// ...
greeting.name = Normalizer.normalize(greeting.name, Normalizer.Form.NFC);
// ... then the validation check and persist
Modify the search
method: Normalize the search query before looking it up.
// In GreetingResource.java's search() method
@GET
@Path("/search")
public Response search(@QueryParam("name") String name) {
if (name == null) {
return Response.ok(List.of()).build();
}
String normalizedName = Normalizer.normalize(name, Normalizer.Form.NFC);
// This is a naive, direct comparison that will fail
List<Greeting> results = Greeting.list("name", normalizedName);
return Response.ok(results).build();
}
Now, regardless of how "José" is typed, it will be converted to the same canonical form. And the search? Still fails? WHY? Welcome to character encoding world.
The problem is URL encoding:
Stored in DB: José (bytes: [74, 111, 115, -61, -87])
Received from query: José (bytes: [74, 111, 115, -61, -125, -62, -87])
The é character is being double-encoded when sent via curl. The é (U+00E9) is being encoded as %C3%A9, but then that's being interpreted as é because of how the bytes are being processed.
URL encoding (also called percent-encoding) converts special characters that aren't safe for URLs into a format that can be safely transmitted. For example, the letter "é" becomes "%C3%A9" because it's represented as two bytes (C3 A9 in hexadecimal) in UTF-8 encoding, and each byte is prefixed with a percent sign. This ensures that characters like spaces, accented letters, and symbols don't interfere with URL parsing or cause issues when transmitted across different systems that might handle character encoding differently.
You can either fix the CURL:
curl "http://localhost:8080/greetings/search?name=Jos%C3%A9"
Or update the handling in the search method to be more permissive.
@GET
@Path("/search")
public Response search(@QueryParam("name") String name) {
if (name == null) {
return Response.ok(List.of()).build();
}
// Handle URL decoding issues by normalizing both the query and stored values
String normalizedName = Normalizer.normalize(name, Normalizer.Form.NFC);
// Use a more flexible search that handles encoding differences
List<Greeting> results = Greeting.find("LOWER(name) = LOWER(?1)", normalizedName).list();
// If no results, try a more permissive search
if (results.isEmpty()) {
results = Greeting.find("name LIKE ?1", "%" + normalizedName + "%").list();
}
return Response.ok(results).build();
}
Note: A truly user-friendly search would also be case-insensitive and might even strip accents (e.g., so "Jose" finds "José"). This often requires using like
in the database query and a separate library for accent stripping, but normalization is the essential first step. We only use a very broad, permissive search as fallback here.
Sorting with Collator
Let's add a feature to get a sorted list of greetings.
In GreetingResource.java
:
@GET
@Path("/sorted")
public List<Greeting> getSortedByName() {
return Greeting.list("order by name");
}
Now add these three names to your service: "Zebra", "Ångström", "Aaron".
curl -X POST http://localhost:8080/greetings \
-H "Content-Type: application/json" \
-d '{ "name": "Zebra"}'
curl -X POST http://localhost:8080/greetings \
-H "Content-Type: application/json" \
-d '{ "name": "Ånstöm"}'
curl -X POST http://localhost:8080/greetings \
-H "Content-Type: application/json" \
-d '{ "name": "Aaron"}'
When you call
curl "http://localhost:8080/greetings/sorted"
you'll likely get this order:
Aaron
Ångström
Zebra
This is the default byte-value sort order. However, in Swedish, "Å" is the 27th letter of the alphabet, so "Ångström" should come after "Zebra".
The Fix: For language-sensitive sorting, you must use java.text.Collator
. Since the database sort is naive, we must sort in the Java application code.
// In GreetingResource.java
import java.text.Collator;
import java.util.Comparator;
import java.util.List;
import java.util.Locale;
// ...
@GET
@Path("/sorted")
public List<Greeting> getSortedByName(@QueryParam("locale") @DefaultValue("en-US") String localeTag) {
List<Greeting> greetings = Greeting.listAll();
Locale locale = Locale.forLanguageTag(localeTag);
Collator collator = Collator.getInstance(locale);
collator.setStrength(Collator.PRIMARY); // Makes it case-insensitive too
greetings.sort(Comparator.comparing(g -> g.name, collator));
return greetings;
}
Now, if you call
curl "http://localhost:8080/greetings/sorted?locale=sv-SE"
For Swedish, you will get the correct, culturally-expected order:
Aaron
Zebra
Ångström
The Grand Finale: The Kiss Emoji 💋 Endpoint
Let's add a final, fun endpoint that serves as a practical test of our setup. This endpoint will append a kiss emoji to a greeting's message.
In GreetingResource.java
:
@POST
@Path("/{id}/kiss")
@Transactional
public Response addKiss(@PathParam("id") Long id) {
Greeting greeting = Greeting.findById(id);
if (greeting == null) {
return Response.status(Response.Status.NOT_FOUND).build();
}
// U+1F48B is the code point for the kiss mark emoji 💋
String kissEmoji = new String(Character.toChars(0x1F48B));
greeting.message = greeting.message + " " + kissEmoji;
greeting.persist();
return Response.ok(greeting).build();
}
First, create a greeting to kiss:
curl -X POST http://localhost:8080/greetings \
-H "Content-Type: application/json; charset=utf-8" \
-d '{ "name": "浩宇", "message": "xoxo" }'
(Note the explicit charset=utf-8
in the header. This is a best practice!)
The response will show the newly created greeting with id: 1
(or some other number). Now, use that ID to send a virtual kiss:
curl -s -X POST "http://localhost:8080/greetings/1/kiss" | jq -r .
Result: You should get a perfect JSON response with the emoji correctly rendered.
{
"id": 1,
"name": "浩宇",
"message": "xoxo 💋"
}
This confirms that your entire stack, from the client, through the Quarkus REST layer, to the database, and back, is correctly configured to handle Unicode, including multi-byte characters and emoji.
Conclusion & Best Practices Checklist
Congratulations! You've built a Unicode-aware REST service and tackled some of the most common and frustrating bugs related to text handling in Java.
Keep these in mind for every Unicode-aware service:
Always set
charset=utf-8
inContent-Type
headers.Configure databases and connections explicitly for UTF-8.
Use
codePointCount()
andcodePoints()
instead oflength()
andtoCharArray()
.Normalize user input before storing or comparing.
Use
Collator
for locale-aware sorting.Assume Unicode everywhere. ASCII is no longer a safe baseline.
Text is global. Your code should be too.