Taming Unstructured Data: From PDFs to JSON with Quarkus and Docling
Build a fast, scalable converter to turn business documents into structured data using Quarkus and Docling — perfect for RAG pipelines, search indexing, and LLM prep.
I have updated the article originally published July 21, 2025
to use the latest API and check the functionality.
Make sure to grab the latest source from my Github repository.
Enterprise data rarely arrives in clean, structured formats. In the real world, valuable information is buried in PDFs, Word documents, ELAN files, transcribed field notes, or multilingual glossed text. If you’re building AI-infused applications: A chatbot that explains policy documents, a RAG pipeline that pulls context from user manuals, or a smart index for legal archives, you need a reliable, scalable way to convert these formats into structured, machine-consumable data.
In this hands-on tutorial, we’ll use Quarkus and the Docling extension to build a REST API that transforms unstructured documents into clean JSON, TSV, or XML formats. This data is ready for downstream use: embedding, chunking, vector search, or even linguistic analysis.
Docling and Docling Serve
The Docling Serve project is a lightweight server implementation for the Docling document transformation engine, designed to expose Docling’s powerful format conversion capabilities over a simple HTTP API. It acts as the backend service behind client libraries like the Quarkus Docling extension, enabling developers to convert complex linguistic, annotated, or unstructured document formats, such as ELAN (.eaf), Toolbox, DOCX, or PDF, into structured outputs like JSON, TSV, or XML. Docling Serve is ideal for embedding into NLP pipelines, AI backend services, or digital humanities tools where text segmentation, speaker identification, or gloss extraction are required. It runs as a stateless container and is optimized for easy integration and scalability.
Why This Problem Matters
Most enterprise AI projects don’t fail at the model layer. They fail in the messy middle: data preparation. Consider:
Business knowledge lives in PDFs. Contracts, datasheets, manuals, and policies are almost always stored as
.pdf.Word documents dominate collaboration. Internal playbooks, meeting notes, and feedback are often
.docx.Legacy and linguistic projects use ELAN, Toolbox, or FLEx. Parsing these formats reliably is non-trivial.
If you try to shove these formats directly into a vector DB or LLM pipeline, you’ll get garbage in, garbage out. You need semantic segmentation, metadata extraction, and structure preservation before anything else.
That’s what Docling offers. It understands structured annotations, interlinear glossed text, speaker metadata, and other linguistic features. It also handles common business formats like PDF and DOCX and emits clean, chunkable outputs.
Let’s build an application that wraps all of that in a fast Quarkus service.
Prerequisites
To follow along, make sure you have the following installed:
Java 17+
Maven
Podman (with a running Podman Machine)
An IDE (e.g., IntelliJ IDEA or VS Code)
No need to manually install Docling. Quarkus will take care of that for you when you start the dev service. The complete project is in my Github repository if you prefer to start from there.
Bootstrap Your Quarkus Project
mvn io.quarkus.platform:quarkus-maven-plugin:create \
-DprojectGroupId=com.ibm.developer \
-DprojectArtifactId=quarkus-docling-converter \
-Dextensions="rest-jackson,quarkus-docling"
cd quarkus-docling-converterThis sets up your project with the Docling extension and Jackson-based JSON support.
Create a REST Endpoint
Starting with quarkus-docling 1.3.1, the extension ships a ready-to-inject DoclingService bean. This high-level service wraps the underlying Docling Serve REST client and provides convenience methods for the most common conversion scenarios — you no longer need to build raw API requests by hand. DoclingService is registered as a CDI bean by the extension automatically, so you inject it directly into your resource class.
Available output formats:
OutputFormat.TEXT,OutputFormat.MARKDOWN,OutputFormat.HTML,OutputFormat.JSON,OutputFormat.DOCTAGS
Rename GreetingResource to ConverterResource.java and replace the content with:
package com.ibm.developer;
import java.nio.file.Files;
import org.jboss.resteasy.reactive.RestForm;
import org.jboss.resteasy.reactive.multipart.FileUpload;
import ai.docling.core.DoclingDocument;
import ai.docling.serve.api.convert.request.options.OutputFormat;
import ai.docling.serve.api.convert.response.ConvertDocumentResponse;
import ai.docling.serve.api.convert.response.InBodyConvertDocumentResponse;
import io.quarkiverse.docling.runtime.client.DoclingService;
import jakarta.inject.Inject;
import jakarta.ws.rs.Consumes;
import jakarta.ws.rs.POST;
import jakarta.ws.rs.Path;
import jakarta.ws.rs.Produces;
import jakarta.ws.rs.core.MediaType;
import jakarta.ws.rs.core.Response;
@Path("/convert")
@Consumes(MediaType.MULTIPART_FORM_DATA)
@Produces(MediaType.APPLICATION_JSON)
public class ConverterResource {
@Inject
DoclingService doclingService;
@POST
public Response convert(@RestForm("file") FileUpload file) {
if (file == null) {
return Response.status(Response.Status.BAD_REQUEST)
.entity("Error: No file uploaded.").build();
}
try {
byte[] fileBytes = Files.readAllBytes(file.uploadedFile());
ConvertDocumentResponse result = doclingService.convertFromBytes(
fileBytes,
file.fileName(),
OutputFormat.JSON);
if (!(result instanceof InBodyConvertDocumentResponse inBody)) {
return Response.status(Response.Status.INTERNAL_SERVER_ERROR)
.entity("Unexpected response type: " + result.getResponseType()).build();
}
DoclingDocument document = inBody.getDocument().getJsonContent();
return Response.ok(document).build();
} catch (java.io.IOException e) {
return Response.status(Response.Status.BAD_REQUEST)
.entity("Failed to read uploaded file: " + e.getMessage())
.build();
}
}
}A few things worth noting in this implementation:
File Upload — accepts a file via multipart form data using
@RestForm.Validation — returns HTTP 400 immediately if no file is provided.
Conversion — requests
OutputFormat.JSONfromDoclingService, which instructs Docling Serve to populate thejson_contentfield on the response with a fully structuredDoclingDocumentobject.Type check — casts to
InBodyConvertDocumentResponse, the concrete subtype that carries the converted document.Response — returns the
DoclingDocumentobject directly; Jackson serialises it to JSON because the endpoint is annotated@Produces(APPLICATION_JSON).
Why the instanceof cast?
ConvertDocumentResponse is an abstract sealed class. The extension resolves it to one of three concrete subtypes at runtime depending on the configured response mode.
InBodyConvertDocumentResponse is the default — the response payload is embedded directly in the HTTP body. ZipArchiveConvertDocumentResponse is used when the server is configured to stream a ZIP archive. PreSignedUrlConvertDocumentResponse is used when the server returns a pre-signed S3 URL instead of the content itself.
With the default Dev Service and in-body target, you will always get InBodyConvertDocumentResponse. The explicit check makes the code resilient to misconfiguration and documents the assumption clearly.
Try It Out
Grab a sample PDF that is simple enough for your local Docling runtime but complex enough to show some of the power of Docling. I just grabbed a random PDF from redhat.com as an example.
Start Quarkus in dev mode (Docling Serve will be pulled and started automatically via Dev Services):
./mvnw quarkus:devSend the file to the endpoint using curl:
curl -F "file=@sample.pdf" \
-F "outputFormat=json" \
http://localhost:8080/convertYou will see a structured JSON response representing the full DoclingDocument — complete with document body, text items, tables, pictures, page metadata, and provenance information. This is the native Docling format and the most complete representation of the converted document, making it ideal for downstream processing in RAG pipelines or vector stores.
Each OutputFormat value maps to a dedicated accessor on DocumentResponse. If you want a lighter output, swap the format constant and its matching accessor:
OutputFormat.TEXT→getTextContent()— plain text, no formattingOutputFormat.MARKDOWN→getMarkdownContent()— Markdown with headings, lists, and tablesOutputFormat.HTML→getHtmlContent()— full HTML documentOutputFormat.JSON→getJsonContent()— structuredDoclingDocumentobject (what this example uses)OutputFormat.DOCTAGS→getDoctagsContent()— Docling’s own tag-based interchange format
Both changes are always required together: the format constant controls what Docling Serve generates server-side, and the accessor controls which response field you read back.
Migrating from the Original API (Jul 12, 2025)
If you are upgrading an existing project from quarkus-docling ≤ 1.1.0 to 1.3.1+, the extension’s Java API has undergone a complete breaking restructure. All model and client classes moved from the Quarkiverse namespace to the upstream ai.docling library, and the injectable entry point changed entirely.
What changed
Seven things moved or were renamed between the old API and the new one.
Injectable client. The CDI bean you inject changed from io.quarkiverse.docling.runtime.client.api.DoclingApi to io.quarkiverse.docling.runtime.client.DoclingService.
Conversion request. The hand-constructed io.quarkiverse.docling.runtime.client.model.ConversionRequest is gone. DoclingService builds the underlying ai.docling.serve.api.convert.request.ConvertDocumentRequest for you internally.
Output format enum. io.quarkiverse.docling.runtime.client.model.OutputFormat moved to ai.docling.serve.api.convert.request.options.OutputFormat.
Response type. io.quarkiverse.docling.runtime.client.model.ConvertDocumentResponse is replaced by the abstract ai.docling.serve.api.convert.response.ConvertDocumentResponse, which must be cast to its concrete subtype InBodyConvertDocumentResponse to access the document content.
File source model. io.quarkiverse.docling.runtime.client.model.FileSource no longer exists as a public type. DoclingService.convertFromBase64(...) handles base64-encoded file input internally.
HTTP source model. io.quarkiverse.docling.runtime.client.model.HttpSource is similarly eliminated. Use DoclingService.convertFromUri(...) instead.
API method. The low-level doclingApi.processUrlV1alphaConvertSourcePost(request) is replaced by the purpose-named convenience methods doclingService.convertFromUri(...), doclingService.convertFromBytes(...), and doclingService.convertFromBase64(...).
Step-by-step migration
1. Update the dependency version
<!-- pom.xml -->
<dependency>
<groupId>io.quarkiverse.docling</groupId>
<artifactId>quarkus-docling</artifactId>
<version>1.3.1</version>
</dependency>Note:
quarkus-docling1.3.1 requires Quarkus 3.33 or later. Update yourquarkus.platform.versioninpom.xmlaccordingly.
2. Remove the passthrough wrapper class and inject DoclingService directly
Earlier versions of this tutorial introduced a Docling.java wrapper that delegated every call straight to DoclingService without adding any logic. That indirection has been removed. If your project has a similar wrapper that does nothing beyond forwarding calls, remove it and inject DoclingService directly:
// Before — unnecessary wrapper
@Inject
Docling docling;
// After — inject the extension's bean directly
@Inject
DoclingService doclingService;3. Remove hand-built request objects
The old API required you to construct ConversionRequest, ConvertDocumentsOptions, FileSource, and HttpSource manually. All of that is now encapsulated inside DoclingService:
// Before — verbose manual request building
FileSource fileSource = new FileSource();
fileSource.base64String(base64Content);
fileSource.setFilename(filename);
ConversionRequest conversionRequest = new ConversionRequest();
conversionRequest.addFileSourcesItem(fileSource);
ConvertDocumentsOptions opts = new ConvertDocumentsOptions();
opts.setToFormats(List.of(outputFormat));
conversionRequest.options(opts);
return doclingApi.processUrlV1alphaConvertSourcePost(conversionRequest);
// After — one line
return doclingService.convertFromBase64(base64Content, filename, outputFormat);4. Update response handling
ConvertDocumentResponse is now an abstract sealed class. Calling .getDocument() directly no longer compiles. Use a pattern-match cast to InBodyConvertDocumentResponse:
// Before
String text = result.getDocument().getTextContent();
// After
if (result instanceof InBodyConvertDocumentResponse inBody) {
String text = inBody.getDocument().getTextContent();
}5. Update all imports
Replace every occurrence of io.quarkiverse.docling.runtime.client.model.* and io.quarkiverse.docling.runtime.client.api.* with the corresponding classes from ai.docling.serve.api.*:
// Remove these
import io.quarkiverse.docling.runtime.client.api.DoclingApi;
import io.quarkiverse.docling.runtime.client.model.ConversionRequest;
import io.quarkiverse.docling.runtime.client.model.ConvertDocumentResponse;
import io.quarkiverse.docling.runtime.client.model.ConvertDocumentsOptions;
import io.quarkiverse.docling.runtime.client.model.FileSource;
import io.quarkiverse.docling.runtime.client.model.HttpSource;
import io.quarkiverse.docling.runtime.client.model.OutputFormat;
// Add these
import ai.docling.serve.api.convert.request.options.OutputFormat;
import ai.docling.serve.api.convert.response.ConvertDocumentResponse;
import ai.docling.serve.api.convert.response.InBodyConvertDocumentResponse;
import io.quarkiverse.docling.runtime.client.DoclingService;Why the API changed
The quarkus-docling extension was originally a thin OpenAPI-generated wrapper around Docling Serve’s HTTP endpoints. As Docling Serve matured, the upstream project extracted its type system into a standalone Java library (ai.docling:docling-serve-api) that is now shared between the Quarkus extension, a Spring Boot extension, and other clients. The Quarkus extension now exposes the upstream types directly, ensuring feature parity and reducing the maintenance overhead of a generated shim.
Going Further
There’s a lot more that can be done. For now, I will leave you here to take this further with your own experimentations. Keep in mind that this is just the very beginning or a Docling integration with Quarkus and the eventual goal is to unify the DoclingDocument format with LangChain4j’s Document abstraction so that Docling can be used in a LangChain4j RAG pipeline for ingesting data.
What you could to today if you like:
Add a
/formatsendpoint to expose available input/output formatsSupport bulk conversion from ZIP files
Add integration with LangChain4j to process the output directly
Store converted chunks in a vector DB like Weaviate or pg-vector
With Docling and Quarkus, you now have a scalable, flexible foundation for turning unstructured documents into structured inputs for AI. Your models are only as good as the data they see: Make that data count.




Hi Markus, thanks for the great blog post! When I try to convert any document to JSON, I consistently get this error:
Caused by: com.fasterxml.jackson.databind.JsonMappingException: Numeric value (956461376471083675) out of range of int (-2147483648 - 2147483647)
at [Source: REDACTED (`StreamReadFeature.INCLUDE_SOURCE_IN_LOCATION` disabled); line: 1, column: 260] (through reference chain: io.quarkiverse.docling.runtime.client.model.ConvertDocumentResponse["document"]->io.quarkiverse.docling.runtime.client.model.DocumentResponse["json_content"]->io.quarkiverse.docling.runtime.client.model.DoclingDocument["origin"]->io.quarkiverse.docling.runtime.client.model.DocumentOrigin["binary_hash"])
DocumentOrigin.binaryHash is an Integer but the hash doesn't fit into Integer. Can we change the type of binaryHash to Long?
converting documents to JSON and other formats using the docling-ui (via quarkus dev-ui) works as expected.
I tried with jdk24 and get the following error.
curl -F "file=@sample.pdf" http://localhost:8080/convert
500 - Internal Server Error
---------------------------
Details:
Error id b3c5c84f-3826-487b-9a3e-08dd219d0910-3, org.jboss.resteasy.reactive.ClientWebApplicationException: Received: 'Not Found, status code 404' when invoking REST Client method: 'io.quarkiverse.docling.runtime.client.api.DoclingApi#processUrlV1alphaConvertSourcePost'
Stack:
org.jboss.resteasy.reactive.ClientWebApplicationException: Received: 'Not Found, status code 404' when invoking REST Client method: 'io.quarkiverse.docling.runtime.client.api.DoclingApi#processUrlV1alphaConvertSourcePost'
at org.jboss.resteasy.reactive.client.impl.RestClientRequestContext.unwrapException(RestClientRequestContext.java:205)
at org.jboss.resteasy.reactive.common.core.AbstractResteasyReactiveContext.handleException(AbstractResteasyReactiveContext.java:329)
at org.jboss.resteasy.reactive.common.core.AbstractResteasyReactiveContext.run(AbstractResteasyReactiveContext.java:175)
at io.smallrye.context.impl.wrappers.SlowContextualRunnable.run(SlowContextualRunnable.java:19)
at org.jboss.resteasy.reactive.client.handlers.ClientSwitchToRequestContextRestHandler$1$1.handle(ClientSwitchToRequestContextRestHandler.java:38)
at org.jboss.resteasy.reactive.client.handlers.ClientSwitchToRequestContextRestHandler$1$1.handle(ClientSwitchToRequestContextRestHandler.java:35)
at io.vertx.core.impl.ContextInternal.dispatch(ContextInternal.java:270)
at io.vertx.core.impl.ContextInternal.dispatch(ContextInternal.java:252)
at io.vertx.core.impl.ContextInternal.lambda$runOnContext$0(ContextInternal.java:50)
at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:566)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:1447)
Caused by: jakarta.ws.rs.WebApplicationException: Not Found, status code 404
at io.quarkus.rest.client.reactive.runtime.DefaultMicroprofileRestClientExceptionMapper.toThrowable(DefaultMicroprofileRestClientExceptionMapper.java:19)
at io.quarkus.rest.client.reactive.runtime.MicroProfileRestClientResponseFilter.filter(MicroProfileRestClientResponseFilter.java:54)
at org.jboss.resteasy.reactive.client.handlers.ClientResponseFilterRestHandler.handle(ClientResponseFilterRestHandler.java:21)
at org.jboss.resteasy.reactive.client.handlers.ClientResponseFilterRestHandler.handle(ClientResponseFilterRestHandler.java:10)
at org.jboss.resteasy.reactive.common.core.AbstractResteasyReactiveContext.invokeHandler(AbstractResteasyReactiveContext.java:231)
at org.jboss.resteasy.reactive.common.core.AbstractResteasyReactiveContext.run(AbstractResteasyReactiveContext.java:147)
... 14 more