Dokimos
LLM Evaluation Framework for Java
Documentation •
Getting Started •
Examples •
Issues
Dokimos is an evaluation framework for LLM applications in Java and Kotlin. It helps you evaluate responses, track quality over time, and catch regressions before they reach production.
It integrates with JUnit, LangChain4j, Spring AI and Koog so you can run evaluations as part of your existing test suite and CI/CD pipeline.
Why Dokimos?
- JUnit integration: Run evaluations as parameterized tests in your existing test suite.
- Framework agnostic: Works with LangChain4j, Spring AI, Koog or any LLM client. Powered by any LLM.
- Built in evaluators: Hallucination detection, faithfulness, contextual relevance, LLM as a judge, and more.
- Agent evaluation: Evaluate AI agents with tool call validation, task completion, argument hallucination detection, and tool reliability checks.
- Custom evaluators: Build your own metrics by extending
BaseEvaluator or using LLMJudgeEvaluator.
- Dataset support: Load test cases from JSON, CSV, or define them programmatically.
- CI/CD ready: Runs locally or in any CI/CD environment. Fail builds when quality drops.
- Kotlin as first-class citizen: Compose all tests with a convenient Kotlin DSL.
Quick Start
Add the dependency to your pom.xml (check Maven Central for the latest version):
<dependency>
<groupId>dev.dokimos</groupId>
<artifactId>dokimos-core</artifactId>
<version>${dokimos.version}</version>
</dependency>
Run a standalone evaluator
Evaluate a single response directly:
Java
Evaluator evaluator = ExactMatchEvaluator.builder()
.name("Exact Match")
.threshold(1.0)
.build();
EvalTestCase testCase = EvalTestCase.of("What is 2+2?", "4", "4");
EvalResult result = evaluator.evaluate(testCase);
System.out.println("Passed: " + result.success()); // true
System.out.println("Score: " + result.score()); // 1.0
Kotlin
val evaluator = exactMatch {
name = "Exact Match"
threshold = 1.0
}
val testCase = EvalTestCase.of("What is 2+2?", "4", "4")
val result = evaluator.evaluate(testCase)
println("Passed: ${result.success()}") // true
println("Score: ${result.score()}") // 1.0
Write a JUnit test
Use @DatasetSource to run evaluations as parameterized tests:
Java
JudgeLM judgeLM = prompt -> openAiClient.generate(prompt);
Evaluator correctnessEvaluator = LLMJudgeEvaluator.builder()
.name("Correctness")
.criteria("Is the answer correct and complete?")
.evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))
.judge(judgeLM)
.build();
@ParameterizedTest
@DatasetSource("classpath:datasets/qa.json")
void testQAResponses(Example example) {
String response = assistant.chat(example.input());
EvalTestCase testCase = example.toTestCase(response);
Assertions.assertEval(testCase, correctnessEvaluator);
}
Kotlin
val judgeLM = JudgeLM { prompt -> openAiClient.generate(prompt) }
val correctnessEvaluator = llmJudge(judgeLM) {
name = "Correctness"
criteria = "Is the answer correct and complete?"
params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)
}
class QaTests {
@ParameterizedTest
@DatasetSource("classpath:datasets/qa.json")
fun testQAResponses(example: Example) {
val response = assistant.chat(example.input())
val testCase = example.toTestCase(response)
Assertions.assertEval(testCase, correctnessEvaluator)
}
}
Evaluate a dataset in bulk
Run experiments across entire datasets with aggregated metrics:
Java
JudgeLM judgeLM = prompt -> openAiClient.generate(prompt);