From evaluate-langchain4j
Sets up Dokimos evaluation for LangChain4j apps and RAG pipelines with Q&A tasks, faithfulness, relevance, and retrieval checks.
How this skill is triggered — by the user, by Claude, or both
Slash command
/evaluate-langchain4j:evaluate-langchain4jThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Set up Dokimos evaluation for a LangChain4j application. The user will describe their application and evaluation goals via `$ARGUMENTS`.
Set up Dokimos evaluation for a LangChain4j application. The user will describe their application and evaluation goals via $ARGUMENTS.
dokimos-langchain4j/src/main/java/dev/dokimos/langchain4j/LangChain4jSupport.javadokimos-examples/src/main/java/dev/dokimos/examples/langchain4j/LangChain4jRAGExample.javadev.dokimos:dokimos-langchain4jBefore writing code, read LangChain4jSupport.java to understand the available utilities.
LangChain4jSupport provides:
asJudge(ChatModel) — wraps a LangChain4j ChatModel into a JudgeLMsimpleTask(ChatModel) — creates a Task for simple Q&A evaluationragTask(Function<String, Result<String>>) — creates a Task for RAG evaluation that captures both output and retrieval contextragTask(..., inputKey, outputKey, contextKey) — RAG task with custom key namescustomTask(Task) — pass-through for full controlextractTexts(List<Content>) — extracts text from LangChain4j Content objectsChatModel model = OpenAiChatModel.builder()
.apiKey(System.getenv("OPENAI_API_KEY"))
.modelName("gpt-4o-mini")
.build();
Task task = LangChain4jSupport.simpleTask(model);
ExperimentResult result = Experiment.builder()
.name("QA Evaluation")
.dataset(Dataset.fromJson(Path.of("datasets/qa.json")))
.task(task)
.evaluator(ExactMatchEvaluator.builder().build())
.build()
.run();
The RAG task captures both the model output and the retrieved context, enabling evaluators like FaithfulnessEvaluator and ContextualRelevanceEvaluator:
// 1. Build your LangChain4j AiService that returns Result<String>
interface Assistant {
Result<String> chat(String userMessage);
}
Assistant assistant = AiServices.builder(Assistant.class)
.chatModel(chatModel)
.retrievalAugmentor(retrievalAugmentor)
.build();
// 2. Create the RAG task
Task task = LangChain4jSupport.ragTask(assistant::chat);
// 3. Create a judge for LLM-based evaluators
JudgeLM judge = LangChain4jSupport.asJudge(judgeChatModel);
// 4. Run with RAG-specific evaluators
// ragTask() stores context under "context" key by default.
// FaithfulnessEvaluator and HallucinationEvaluator default to contextKey="context" (matches).
// ContextualRelevanceEvaluator defaults to retrievalContextKey="retrievalContext",
// so set it explicitly to match the ragTask output.
ExperimentResult result = Experiment.builder()
.name("RAG Evaluation")
.dataset(dataset)
.task(task)
.evaluators(List.of(
FaithfulnessEvaluator.builder()
.judge(judge)
.threshold(0.7)
.build(),
ContextualRelevanceEvaluator.builder()
.judge(judge)
.retrievalContextKey("context")
.threshold(0.5)
.build(),
HallucinationEvaluator.builder()
.judge(judge)
.threshold(0.5)
.build()
))
.build()
.run();
<dependency>
<groupId>dev.dokimos</groupId>
<artifactId>dokimos-langchain4j</artifactId>
<version>${dokimos.version}</version>
</dependency>
LangChain4j itself is a provided-scope dependency — the user must bring their own version.
$ARGUMENTS what the LangChain4j application does (Q&A, RAG, chat, etc.)Result<String> return type)ExactMatchEvaluator, RegexEvaluator, LLMJudgeEvaluatorFaithfulnessEvaluator, ContextualRelevanceEvaluator, HallucinationEvaluator, PrecisionEvaluator, RecallEvaluatorLangChain4jSupport utilitiesnpx claudepluginhub dokimos-dev/dokimos --plugin evaluate-langchain4jProvides unit test mocks, integration tests with Testcontainers, and RAG validation patterns for LangChain4j Java AI applications. Use for testing AI services, retrieval chains, and LLM workflows.
Sets up Dokimos evaluation for Spring AI apps including ChatClient, RAG pipelines, and advisor chains. Use for Spring Boot LLM testing and benchmarking.
Builds LangSmith evaluation pipelines: create LLM-as-Judge/custom evaluators, capture agent outputs/trajectories via run functions, run locally with evaluate() or CLI.