LangChain4j + Quarkus: Building AI-Powered Java Services

Most Java teams do not need to switch to Python to integrate AI into their applications. If your stack already includes Spring Boot, Quarkus, or Micronaut, you have a clear path to add AI capabilities without rewriting your services. The ecosystem has matured to the point where you can treat model integration as just another backend dependency.

The real challenge is not choosing a model. The real challenge is making AI behave like a reliable service: handling latency, managing failures, measuring cost, and integrating with your existing security and observability stack. That is where LangChain4j with Quarkus provides value.

This article covers how to integrate LangChain4j into Quarkus applications, what the 1.12.1 release adds for enterprise teams, and practical patterns for production workloads.

The Integration Problem Most Teams Face

When teams add AI to Java applications, they typically start with a straightforward HTTP client that calls an OpenAI or Anthropic endpoint. This approach works for prototypes. It breaks in production.

The problems appear in three areas:

First, orchestration complexity. A single model call is easy. Multiple model calls, retrieval pipelines, tool invocations, and fallback chains require structured code patterns. Ad hoc HTTP clients become unmaintainable quickly.

Second, streaming behavior. LLM inference is slow compared to typical REST calls. A simple synchronous request blocks threads while the model generates hundreds of tokens. Under load, this pattern exhausts connection pools and degrades entire services.

Third, observability and control. You need token usage metrics, latency tracking, cost attribution, and the ability to cancel in-progress requests. Raw HTTP clients lack these capabilities.

LangChain4j solves the orchestration layer. Quarkus provides the runtime. Together, they give you a production-ready AI service architecture without leaving the Java ecosystem.

LangChain4j 1.12.1: What Changed for Production Teams

The LangChain4j 1.12.1 release, published in March 2026, includes several features that matter for production deployments.

Hibernate EmbeddingStore Integration

The most significant addition for teams using Hibernate ORM is the HibernateEmbeddingStore. This integration allows you to store and retrieve vector embeddings directly in your existing database, using the same Hibernate session and transaction management you already use for relational data.

According to the release notes, the integration works with Hibernate ORM 7.1+ and 7.2+. This matters because you do not need to adopt a dedicated vector database to start with retrieval-augmented generation. You can persist embeddings in your existing PostgreSQL or MySQL database using the hibernate-vector module.

The practical implication is straightforward: teams with established Hibernate mappings can add embedding storage without new infrastructure dependencies. Embedding tables follow the same patterns as entity tables. Querying uses the same Session API.

Observability Through Micrometer

The release adds Micrometer integration for metrics collection. The MicrometerChatModelListener provides counters and timers for latency tracking at the model level.

For teams running Quarkus, this is valuable because Quarkus has built-in Micrometer support. You can emit AI metrics to the same observability pipeline that handles HTTP requests, database queries, and cache hits. Token counts, request duration, and error rates become visible in dashboards without custom instrumentation.

The metrics layer includes counters for successful requests, timers for latency measurement, and gauges for active streaming connections. This gives you the visibility needed to set up alerts for anomalous latency or error spikes.

Agent Support and Tool Search

The 1.12.1 release also introduces agent skills and tool search. These features support more complex agentic patterns where a model can discover and select tools dynamically rather than having tools hard-coded.

Tool search allows an agent to find relevant tools based on the user query, reducing the context window consumed by tool descriptions. Agent skills provide a way to define reusable capability groups that can be attached to agents.

For most teams, these features matter if you are building multi-agent systems or autonomous tool use. For simpler chat and retrieval use cases, they are optional.

Quarkus Integration: The Practical Setup

The Quarkus extension for LangChain4j uses the OpenAI-compatible interface. This means you can connect to OpenAI, Azure OpenAI, Ollama, or any provider that implements the OpenAI API specification. Anthropic also exposes an OpenAI-compatible endpoint, but Anthropic’s own documentation classifies it as intended for testing and evaluation rather than production workloads. For production use with Anthropic models, prefer the dedicated quarkus-langchain4j-anthropic extension, which maps to the native Anthropic API.

A basic setup involves three dependencies:

<dependency>
    <groupId>io.quarkiverse.langchain4j</groupId>
    <artifactId>quarkus-langchain4j-openai</artifactId>
</dependency>
<dependency>
    <groupId>io.quarkus</groupId>
    <artifactId>quarkus-rest</artifactId>
</dependency>
<dependency>
    <groupId>io.quarkus</groupId>
    <artifactId>quarkus-micrometer</artifactId>
</dependency>

The quarkus-langchain4j-openai extension is part of the Quarkiverse ecosystem (io.quarkiverse.langchain4j groupId), maintained separately from the Quarkus core. The quarkus-rest and quarkus-micrometer extensions belong to the Quarkus core (io.quarkus groupId). This distinction matters: mixing groupIds incorrectly in your POM will cause build failures.

The extension auto-configures a ChatModel or StreamingChatModel bean based on your application properties. Configuration in application.properties:

quarkus.langchain4j.openai.api-key=${OPENAI_API_KEY}
quarkus.langchain4j.openai.base-url=https://api.openai.com/v1
quarkus.langchain4j.openai.chat-model.model-name=gpt-4o
quarkus.langchain4j.openai.log-requests=true
quarkus.langchain4j.openai.log-responses=true

For local models, swap the base URL. If you use Ollama to run models locally, the endpoint becomes:

quarkus.langchain4j.openai.base-url=http://localhost:11434/v1
quarkus.langchain4j.openai.api-key=ollama

Ollama does not validate API keys. The ollama value is the canonical convention used throughout Ollama’s documentation and satisfies the client initialization without requiring a real key.

Streaming: Why It Matters and How to Implement

Synchronous chat calls block the request thread until the model finishes generating. For small models on fast hardware, this might be acceptable. For larger models, especially when running locally or calling cloud APIs with cold starts, a single request can take ten seconds or more.

Streaming changes the interaction model. Instead of waiting for the complete response, the application receives tokens as they are generated. The user sees output immediately. The perceived latency drops dramatically even if the total generation time remains the same.

LangChain4j provides StreamingChatModel and StreamingChatResponseHandler for this pattern. The handler interface receives callbacks for each token, thinking chunk, tool call, and completion event.

In Quarkus, the idiomatic reactive primitive is Mutiny’s Multi<String>, not Project Reactor’s Flux. The following example bridges LangChain4j’s callback-based handler with Mutiny’s emitter API, which integrates directly with Quarkus REST Server-Sent Events:

import dev.langchain4j.model.chat.StreamingChatModel;
import dev.langchain4j.model.chat.response.ChatResponse;
import dev.langchain4j.model.chat.StreamingChatResponseHandler;
import io.smallrye.mutiny.Multi;
import jakarta.inject.Inject;
import jakarta.ws.rs.GET;
import jakarta.ws.rs.Path;
import jakarta.ws.rs.Produces;
import jakarta.ws.rs.QueryParam;
import jakarta.ws.rs.core.MediaType;

@Path("/chat")
public class ChatResource {

    @Inject
    StreamingChatModel streamingChatModel;

    @GET
    @Path("/stream")
    @Produces(MediaType.SERVER_SENT_EVENTS)
    public Multi<String> streamChat(@QueryParam("message") String userMessage) {
        return Multi.createFrom().emitter(emitter ->
            streamingChatModel.chat(userMessage, new StreamingChatResponseHandler() {
                @Override
                public void onPartialResponse(String partialResponse) {
                    emitter.emit(partialResponse);
                }

                @Override
                public void onCompleteResponse(ChatResponse completeResponse) {
                    emitter.complete();
                }

                @Override
                public void onError(Throwable error) {
                    emitter.fail(error);
                }
            })
        );
    }
}

This pattern uses Mutiny’s emitter bridge. The Multi.createFrom().emitter() factory subscribes lazily: the streaming call to the model only starts when a downstream subscriber connects. The endpoint declares @Produces(MediaType.SERVER_SENT_EVENTS), which is the JAX-RS constant for SSE in Quarkus REST. Quarkus handles backpressure and connection lifecycle automatically for Multi-returning endpoints.

If you prefer Project Reactor’s Flux — for example, if you are sharing streaming logic with a Spring module — LangChain4j provides a langchain4j-reactor module (dev.langchain4j:langchain4j-reactor) that makes Flux available as a return type from AI Services. However, this is not the default Quarkus model and requires the additional dependency.

Running Local Models with Ollama

For development and testing, running models locally avoids API costs and network latency. Ollama simplifies this by providing a local model server with an OpenAI-compatible API out of the box.

After installing Ollama, pull and run a model:

ollama run llama3.2

This command downloads the model and starts the server on http://localhost:11434. The OpenAI-compatible endpoint is available at http://localhost:11434/v1. Your Quarkus application points to this endpoint without code changes.

To switch models:

ollama pull gemma3
ollama run gemma3

Ollama supports a wide model library. You can list available local models with ollama list and pull any model from the Ollama registry with ollama pull <model-name>.

On macOS, Ollama uses Metal for GPU acceleration automatically when running natively. No additional configuration is needed. On Linux with NVIDIA GPUs, Ollama detects CUDA automatically. On CPU-only environments, inference is significantly slower — model choice matters more in that context, so prefer smaller quantized models such as llama3.2:3b or qwen2.5:3b for a usable development experience.

Native Ollama Extension

The approach above uses quarkus-langchain4j-openai with a custom base URL, which works because Ollama exposes an OpenAI-compatible endpoint. For tighter integration, the Quarkiverse ecosystem also provides a dedicated extension:

<dependency>
    <groupId>io.quarkiverse.langchain4j</groupId>
    <artifactId>quarkus-langchain4j-ollama</artifactId>
</dependency>

This extension uses Ollama’s native API rather than the OpenAI-compatible subset, exposes Ollama-specific configuration under quarkus.langchain4j.ollama.*, and integrates with Quarkus Dev Services — meaning Quarkus can automatically start an Ollama container during quarkus dev without any manual setup. If Ollama is your primary local inference target rather than a drop-in replacement for an OpenAI-compatible provider, the dedicated extension is the better fit.

Production Patterns: Resilience and Cost Control

Streaming does not solve all production problems. You still need timeout limits, failure handling, and cost visibility.

Timeouts and Cancellation

LangChain4j supports mid-stream cancellation through PartialResponseContext. When you need to stop generation (for example, if the user clicks a stop button or if your application detects policy violations):

@Override
public void onPartialResponse(PartialResponse partialResponse, PartialResponseContext context) {
    if (shouldCancel()) {
        context.streamingHandle().cancel();
    }
}

Calling cancel() closes the connection to the model provider and stops the stream cleanly.

Prompt Length Limits

Models have fixed context windows. Long prompts consume tokens and increase latency. Production services should validate prompt length before sending it to the model.

A practical approach is to set a character or token limit at the API layer and reject requests that exceed it. This prevents a single large prompt from degrading service for other users.

Token Usage Tracking

Token counts are only available in onCompleteResponse. The ChatResponse.getMetadata().getTokenUsage() method returns input tokens, output tokens, and total tokens for the request.

For cost attribution, log these values with timestamps and user context. Aggregating them in your observability system gives you per-user or per-endpoint cost visibility.

Model Fallback

If your application uses multiple providers, consider a fallback chain. If OpenAI returns an error, route to Anthropic. If both fail, return a cached response or a graceful degradation message.

This pattern requires provider-agnostic prompt templates and abstraction at the model client level. LangChain4j’s model interfaces support this by abstracting provider-specific APIs into common patterns.

Quarkus 3.31 and 3.32: AI-Relevant Improvements

Recent Quarkus releases include features that benefit AI workloads. The items below are organized by the version that introduced them.

Quarkus 3.31:

OIDC PAR (Pushed Authorization Requests): Support for OAuth 2.0 PAR improves security during the authorization request phase. Relevant for AI services exposed to external clients under compliance requirements.
TLS trust-store directory support: You can now point quarkus.tls.trust-store.pem.certs to a directory rather than a single file. In Kubernetes environments, certificate rotation becomes simpler because you do not need to rebuild keystores manually.

Quarkus 3.32:

Reflection-free Jackson serializers: Enabled by default, this reduces serialization overhead and improves native compatibility. For high-throughput AI endpoints that return JSON, this matters.
AOT-jar packaging: A new packaging type that pushes more initialization to build time, improving cold starts without going fully native. This is useful for containerized workloads where native compilation is not practical.
OIDC DPoP nonce providers: Custom DPoP (Demonstrating Proof of Possession) nonce providers close compliance gaps for applications requiring strong proof-of-possession semantics.

These improvements are incremental but meaningful for teams running AI services at scale.

Conclusion

At the end of the day, the goal is not just to integrate a model. The goal is to bring AI into your Java service in a way that works in the real world, with observability, resilience, cost control, and an architecture your team can sustain over time. That is where LangChain4j with Quarkus becomes valuable, not as hype, but as an engineering choice.

A polished demo is easy to admire. A reliable production service is what actually matters. Real AI strategy is not about calling a model. It is about fitting AI into the stack your company already uses, operating it securely, measuring what matters, and evolving it without rebuilding everything from scratch. That is what separates an experiment from a solution.

And if you are a Java developer, do not count yourself out. Your background in backend systems, architecture, performance, cloud, and production software already gives you an advantage. Now the move is simple: build, test, measure, and share what you learn. Because it is not enough to know. People need to know that you know. The developers who apply what they learn and communicate it clearly are the ones who become impossible to ignore.

Sources

LangChain4j 1.12.1 Release Notes: https://github.com/langchain4j/langchain4j/releases/tag/1.12.1
LangChain4j HibernateEmbeddingStore PR: https://github.com/langchain4j/langchain4j/pull/4622
LangChain4j Micrometer Integration PR: https://github.com/langchain4j/langchain4j/pull/4556
Ollama Documentation: https://docs.ollama.com
Ollama OpenAI Compatibility: https://ollama.com/blog/openai-compatibility
Quarkus LangChain4j Extension (Quarkiverse): https://github.com/quarkiverse/quarkus-langchain4j
Quarkus LangChain4j Extension Documentation: https://docs.quarkiverse.io/quarkus-langchain4j/dev/index.html
Quarkus LangChain4j Ollama Extension: https://quarkus.io/extensions/io.quarkiverse.langchain4j/quarkus-langchain4j-ollama/
LangChain4j Streaming Response Guide: https://bootcamptoprod.com/langchain4j-streaming-response/
LangChain4j Response Streaming Docs: https://docs.langchain4j.dev/tutorials/response-streaming/
Quarkus 3.31/3.32 Hidden Features: https://www.the-main-thread.com/p/quarkus-3-31-3-32-hidden-features-java
Quarkus 3.31 Release Blog: https://quarkus.io/blog/quarkus-3-31-released/
Quarkus 3.32 Release Blog: https://quarkus.io/blog/quarkus-3-32-released/
Quarkus Reflection-Free Jackson PR: https://github.com/quarkusio/quarkus/pull/51802
Quarkus AOT-jar Packaging PR: https://github.com/quarkusio/quarkus/pull/52224
Anthropic OpenAI SDK Compatibility: https://docs.anthropic.com/en/api/openai-sdk

LangChain4j + Quarkus: Building AI-Powered Java Services

The Integration Problem Most Teams Face

LangChain4j 1.12.1: What Changed for Production Teams

Hibernate EmbeddingStore Integration

Observability Through Micrometer

Agent Support and Tool Search

Quarkus Integration: The Practical Setup

Streaming: Why It Matters and How to Implement

Running Local Models with Ollama

Native Ollama Extension

Production Patterns: Resilience and Cost Control

Timeouts and Cancellation

Prompt Length Limits

Token Usage Tracking

Model Fallback

Quarkus 3.31 and 3.32: AI-Relevant Improvements

Conclusion

Sources

Like this:

Related

Leave a Comment Cancel reply

The Integration Problem Most Teams Face

LangChain4j 1.12.1: What Changed for Production Teams

Hibernate EmbeddingStore Integration

Observability Through Micrometer

Agent Support and Tool Search

Quarkus Integration: The Practical Setup

Streaming: Why It Matters and How to Implement

Running Local Models with Ollama

Native Ollama Extension

Production Patterns: Resilience and Cost Control

Timeouts and Cancellation

Prompt Length Limits

Token Usage Tracking

Model Fallback

Quarkus 3.31 and 3.32: AI-Relevant Improvements

Conclusion

Sources

Share this:

Like this:

Related

Leave a Comment Cancel reply