Skill

evaluate-langchain4j

Sets up Dokimos evaluation for LangChain4j apps and RAG pipelines with Q&A tasks, faithfulness, relevance, and retrieval checks.

Java

OpenAI

Langchain

ai-ml

testing

npx claudepluginhub dokimos-dev/dokimos --plugin evaluate-langchain4j

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Set up Dokimos evaluation for a LangChain4j application. The user will describe their application and evaluation goals via `$ARGUMENTS`.

SKILL.md

Similar Skills

langchain4j-testing-strategies

174

Provides unit test mocks, integration tests with Testcontainers, and RAG validation patterns for LangChain4j Java AI applications. Use for testing AI services, retrieval chains, and LLM workflows.

5 files6 tools

developer-kit-java

evaluate-spring-ai

Sets up Dokimos evaluation for Spring AI apps including ChatClient, RAG pipelines, and advisor chains. Use for Spring Boot LLM testing and benchmarking.

evaluate-spring-ai

langsmith-evaluator

Builds LangSmith evaluation pipelines: create LLM-as-Judge/custom evaluators, capture agent outputs/trajectories via run functions, run locally with evaluate() or CLI.

langsmith-skills

Stats

Parent Repo Stars20

Parent Repo Forks3

Last CommitFeb 21, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Evaluate LangChain4j

Set up Dokimos evaluation for a LangChain4j application. The user will describe their application and evaluation goals via $ARGUMENTS.

Where things live

LangChain4j support: dokimos-langchain4j/src/main/java/dev/dokimos/langchain4j/LangChain4jSupport.java
Example: dokimos-examples/src/main/java/dev/dokimos/examples/langchain4j/LangChain4jRAGExample.java
Maven dependency: dev.dokimos:dokimos-langchain4j

Before writing code, read LangChain4jSupport.java to understand the available utilities.

Key utilities

LangChain4jSupport provides:

asJudge(ChatModel) — wraps a LangChain4j ChatModel into a JudgeLM
simpleTask(ChatModel) — creates a Task for simple Q&A evaluation
ragTask(Function<String, Result<String>>) — creates a Task for RAG evaluation that captures both output and retrieval context
ragTask(..., inputKey, outputKey, contextKey) — RAG task with custom key names
customTask(Task) — pass-through for full control
extractTexts(List<Content>) — extracts text from LangChain4j Content objects

Evaluation patterns

Simple Q&A evaluation

ChatModel model = OpenAiChatModel.builder()
        .apiKey(System.getenv("OPENAI_API_KEY"))
        .modelName("gpt-4o-mini")
        .build();

Task task = LangChain4jSupport.simpleTask(model);

ExperimentResult result = Experiment.builder()
        .name("QA Evaluation")
        .dataset(Dataset.fromJson(Path.of("datasets/qa.json")))
        .task(task)
        .evaluator(ExactMatchEvaluator.builder().build())
        .build()
        .run();

RAG evaluation

The RAG task captures both the model output and the retrieved context, enabling evaluators like FaithfulnessEvaluator and ContextualRelevanceEvaluator:

// 1. Build your LangChain4j AiService that returns Result<String>
interface Assistant {
    Result<String> chat(String userMessage);
}

Assistant assistant = AiServices.builder(Assistant.class)
        .chatModel(chatModel)
        .retrievalAugmentor(retrievalAugmentor)
        .build();

// 2. Create the RAG task
Task task = LangChain4jSupport.ragTask(assistant::chat);

// 3. Create a judge for LLM-based evaluators
JudgeLM judge = LangChain4jSupport.asJudge(judgeChatModel);

// 4. Run with RAG-specific evaluators
// ragTask() stores context under "context" key by default.
// FaithfulnessEvaluator and HallucinationEvaluator default to contextKey="context" (matches).
// ContextualRelevanceEvaluator defaults to retrievalContextKey="retrievalContext",
// so set it explicitly to match the ragTask output.
ExperimentResult result = Experiment.builder()
        .name("RAG Evaluation")
        .dataset(dataset)
        .task(task)
        .evaluators(List.of(
                FaithfulnessEvaluator.builder()
                        .judge(judge)
                        .threshold(0.7)
                        .build(),
                ContextualRelevanceEvaluator.builder()
                        .judge(judge)
                        .retrievalContextKey("context")
                        .threshold(0.5)
                        .build(),
                HallucinationEvaluator.builder()
                        .judge(judge)
                        .threshold(0.5)
                        .build()
        ))
        .build()
        .run();

Dependencies

<dependency>
    <groupId>dev.dokimos</groupId>
    <artifactId>dokimos-langchain4j</artifactId>
    <version>${dokimos.version}</version>
</dependency>

LangChain4j itself is a provided-scope dependency — the user must bring their own version.

Steps

Understand from $ARGUMENTS what the LangChain4j application does (Q&A, RAG, chat, etc.)
Determine if it's simple Q&A or RAG evaluation (RAG needs Result<String> return type)
Choose appropriate evaluators:
- Q&A: ExactMatchEvaluator, RegexEvaluator, LLMJudgeEvaluator
- RAG: FaithfulnessEvaluator, ContextualRelevanceEvaluator, HallucinationEvaluator, PrecisionEvaluator, RecallEvaluator
Create a dataset matching the application's domain
Wire everything together using LangChain4jSupport utilities