Counting Millions of Things with Kilobytes
A Hands-On Quarkus Tutorial for Scalable Unique Counting with HyperLogLog
How do you count millions of unique users, sessions, or IPs using just a few kilobytes of memory, without hitting an OutOfMemoryError under load?
In this tutorial, we’ll walk through how to use the HyperLogLog (HLL) data structure inside a Quarkus application to perform scalable, approximate cardinality estimation. We’ll go from a basic Java implementation to a real Quarkus microservice that ingests data from GitHub archives and estimates unique users.
Why HyperLogLog Beats HashSet at Scale
When you need to count unique identifiers in high-throughput systems, using a HashSet
seems obvious. But storing millions of UUIDs, IPs, or usernames in memory quickly becomes unsustainable. A HashSet
with 1 million 64-bit values can consume over 100 MB. Do this per service instance, and your JVM falls off a memory cliff.
Enter HyperLogLog, a probabilistic data structure that answers “how many unique items have I seen?” with near-constant memory, regardless of input size. You trade a small amount of accuracy (usually <2%) for massive space savings. It’s the ideal fit for analytics, observability, and stream processing use cases.
How HyperLogLog Works (Without the Math Headache)
Here’s the core intuition:
Hash your inputs: Each input (e.g., a user ID) is hashed into a long binary string.
Count leading zeros: Rare hashes start with more zeros. The number of leading zeros gives a clue about cardinality.
Divide into buckets: Use the first few bits of the hash to choose a bucket. Store the highest zero count observed for each.
Harmonic mean: Combine all bucket estimates using a specialized average (harmonic mean) to compute the final count.
This gives you a small, fixed-size sketch that approximates how many unique elements you've seen. For example, using 1024 buckets (p = 10) gives you ~2.4% error with just 1KB of memory.
Build Your Own HyperLogLog in Java
Let’s implement a simplified version in plain Java. Or go straight to my Github repository and clone the source of the working project.
Define the Structure
package org.example;
public class HyperLogLog {
private final int p; // Precision
private final int m; // Number of registers, m = 2^p
private final byte[] registers;
private final double alphaM;
public HyperLogLog(int p) {
if (p < 4 || p > 16) {
throw new IllegalArgumentException("Precision p must be between 4 and 16.");
}
this.p = p;
this.m = 1 << p; // 2^p
this.registers = new byte[m];
this.alphaM = getAlpha(m);
}
// Hash function placeholder (in a real app, use Murmur3)
private int hash(Object o) {
return o.hashCode();
}
// Pre-calculated constant for bias correction
private double getAlpha(int m) {
switch (m) {
case 16:
return 0.673;
case 32:
return 0.697;
case 64:
return 0.709;
default:
return 0.7213 / (1 + 1.079 / m);
}
}
public long getMemoryUsageBytes() {
return this.m; // Each register is a byte
}
public void add(Object o) {
int hash = hash(o);
// 1. Determine the register index from the first 'p' bits
int registerIndex = hash >>> (Integer.SIZE - p);
// 2. Get the rest of the hash for counting leading zeros
int valueForCounting = hash << p;
// 3. Count leading zeros (+1 because we count from 1)
byte leadingZeros = (byte) (Integer.numberOfLeadingZeros(valueForCounting) + 1);
// 4. Update the register with the maximum value seen
registers[registerIndex] = (byte) Math.max(registers[registerIndex], leadingZeros);
}
public double estimate() {
double sum = 0.0;
int zeroRegisters = 0;
for (byte registerValue : registers) {
if (registerValue == 0) {
zeroRegisters++;
}
sum += 1.0 / (1 << registerValue); // sum of 2^(-R_j)
}
double rawEstimate = alphaM * m * m / sum;
// Small and large range corrections (important for accuracy!)
if (rawEstimate <= 2.5 * m) {
if (zeroRegisters > 0) {
// Small range correction
return Math.round(m * Math.log((double) m / zeroRegisters));
} else {
return Math.round(rawEstimate);
}
} else if (rawEstimate > (1.0 / 30.0) * Math.pow(2, 32)) {
// Large range correction
return -Math.pow(2, 32) * Math.log(1.0 - rawEstimate / Math.pow(2, 32));
} else {
return Math.round(rawEstimate);
}
}
}
Integrate HyperLogLog with Quarkus
Now, let's use our HyperLogLog
class in a practical Quarkus microservice. We'll create an endpoint that can directly process a downloaded data file from gharchive.org
.
Create a new Quarkus project
mvn io.quarkus.platform:quarkus-maven-plugin:create \
-DprojectGroupId=org.example \
-DprojectArtifactId=hll-tutorial \
-DclassName="org.example.HllResource" \
-Dpath="/stats" \
-Dextensions="rest-jackson"
cd hll-tutorial
Move HyperLogLog.java
to src/main/java/org/example/
.
The Cardinality Service
This service is our stateful bean that holds the HLL sketches, completely decoupled from the data source.
package org.example;
import jakarta.enterprise.context.ApplicationScoped;
@ApplicationScoped
public class CardinalityService {
// p=10 for ~2.4% error, uses 1KB memory
private final HyperLogLog dailyUsers = new HyperLogLog(10);
// p=12 for ~1.2% error, uses 4KB memory
private final HyperLogLog weeklyRepos = new HyperLogLog(12);
public void trackUser(String userId) {
if (userId != null) {
dailyUsers.add(userId);
}
}
public void trackRepo(String repoName) {
if (repoName != null) {
weeklyRepos.add(repoName);
}
}
public double getDailyUserEstimate() {
return dailyUsers.estimate();
}
public double getWeeklyRepoEstimate() {
return weeklyRepos.estimate();
}
public long getMemoryUsageBytes() {
return dailyUsers.getMemoryUsageBytes() + weeklyRepos.getMemoryUsageBytes();
}
}
Reading and Parsing the GitHub Archive Data
This component contains the logic for reading and parsing the gzipped archive file. It streams the file line by line without loading it all into memory.
package org.example;
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.nio.file.Path;
import java.util.zip.GZIPInputStream;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import jakarta.enterprise.context.ApplicationScoped;
import jakarta.inject.Inject;
@ApplicationScoped
public class DataIngestor {
@Inject
CardinalityService cardinalityService;
@Inject
ObjectMapper mapper;
public long processGhArchiveFile(Path filePath) throws Exception {
long linesProcessed = 0;
try (
FileInputStream fileInputStream = new FileInputStream(filePath.toFile());
GZIPInputStream gzipInputStream = new GZIPInputStream(fileInputStream);
InputStreamReader inputStreamReader = new InputStreamReader(gzipInputStream);
BufferedReader bufferedReader = new BufferedReader(inputStreamReader)) {
String line;
while ((line = bufferedReader.readLine()) != null) {
JsonNode event = mapper.readTree(line);
JsonNode actor = event.get("actor");
if (actor != null && actor.has("login")) {
cardinalityService.trackUser(actor.get("login").asText());
}
JsonNode repo = event.get("repo");
if (repo != null && repo.has("name")) {
cardinalityService.trackRepo(repo.get("name").asText());
}
linesProcessed++;
}
}
return linesProcessed;
}
}
REST Endpoints
We'll add a new endpoint, POST /ingest
, to the HllResource
to kick off the file processing.
package org.example;
import java.nio.file.Paths;
import java.util.Map;
import jakarta.inject.Inject;
import jakarta.ws.rs.GET;
import jakarta.ws.rs.POST;
import jakarta.ws.rs.Path;
import jakarta.ws.rs.Produces;
import jakarta.ws.rs.core.MediaType;
import jakarta.ws.rs.core.Response;
@Path("/")
public class HllResource {
@Inject
CardinalityService service;
@Inject
DataIngestor ingestor;
@POST
@Path("/ingest")
@Produces(MediaType.APPLICATION_JSON)
public Response ingest(Map<String, String> input) {
try {
long count = ingestor.processGhArchiveFile(Paths.get(input.get("path")));
return Response.ok(Map.of("status", "ok", "lines", count)).build();
} catch (Exception e) {
return Response.serverError().entity(Map.of("error", e.getMessage())).build();
}
}
@GET
@Path("/stats/unique-users/today")
public Response users() {
return Response.ok(Map.of("estimate", service.getDailyUserEstimate())).build();
}
@GET
@Path("/stats/unique-repos/weekly")
public Response repos() {
return Response.ok(Map.of("estimate", service.getWeeklyRepoEstimate())).build();
}
@GET
@Path("/stats/memory-usage")
public Response mem() {
return Response.ok(Map.of("bytes", service.getMemoryUsageBytes())).build();
}
}
Run and See the Magic
Download an hourly GitHub event file from https://www.gharchive.org/
wget https://data.gharchive.org/2025-07-20-15.json.gz
(If you’re on MacOS and need wget, do a “brew install wget”)
Start your app:
./mvnw quarkus:dev
Trigger ingestion:
curl -X POST http://localhost:8080/ingest \
-H "Content-Type: application/json" \
-d '{"path": "/tmp/2025-07-20-15.json.gz"}'
And you’ll get a result:
{
"status":"ok",
"lines":157625
}
Query the estimate:
curl http://localhost:8080/stats/unique-users/today
curl http://localhost:8080/stats/memory-usage
Memory vs. Accuracy – The Trade-off
HashSet: ~55 MB for 300k users, 100% accuracy
HyperLogLog (p=12): 4 KB, ~98.4% accuracy
This is the trade-off: you sacrifice a tiny amount of accuracy for a 13,000x reduction in memory usage. For most "how many" questions in monitoring, analytics, and data streaming, this is a phenomenal deal.
You get near-instant, low-cost estimates without worrying about JVM pressure.
Where to Go Next
Now that you have a working probabilistic counter:
Try using Redis’s native
PFADD
/PFCOUNT
support.Merge sketches across distributed services for global stats.
Serialize the sketch for durable state or historical analysis.
Build a dashboard in Quarkus + Qute showing estimate convergence.
This is how you count millions on kilobytes. HyperLogLog isn't just a clever trick, it's a serious engineering tool.