dokimos Marketplace

Dokimos

LLM Evaluation Framework for Java

Documentation • Getting Started • Examples • Issues

Dokimos is an evaluation framework for LLM applications in Java and Kotlin. It helps you evaluate responses, track quality over time, and catch regressions before they reach production.

It integrates with JUnit, LangChain4j, Spring AI and Koog so you can run evaluations as part of your existing test suite and CI/CD pipeline.

Why Dokimos?

JUnit integration: Run evaluations as parameterized tests in your existing test suite.
Framework agnostic: Works with LangChain4j, Spring AI, Koog or any LLM client. Powered by any LLM.
Built in evaluators: Hallucination detection, faithfulness, contextual relevance, LLM as a judge, and more.
Agent evaluation: Evaluate AI agents with tool call validation, task completion, argument hallucination detection, and tool reliability checks.
Custom evaluators: Build your own metrics by extending BaseEvaluator or using LLMJudgeEvaluator.
Dataset support: Load test cases from JSON, CSV, or define them programmatically.
CI/CD ready: Runs locally or in any CI/CD environment. Fail builds when quality drops.
Kotlin as first-class citizen: Compose all tests with a convenient Kotlin DSL.

Quick Start

Add the dependency to your pom.xml (check Maven Central for the latest version):

<dependency>
    <groupId>dev.dokimos</groupId>
    <artifactId>dokimos-core</artifactId>
    <version>${dokimos.version}</version>
</dependency>

Run a standalone evaluator

Evaluate a single response directly:

Java

Evaluator evaluator = ExactMatchEvaluator.builder()
    .name("Exact Match")
    .threshold(1.0)
    .build();

EvalTestCase testCase = EvalTestCase.of("What is 2+2?", "4", "4");
EvalResult result = evaluator.evaluate(testCase);

System.out.println("Passed: " + result.success());  // true
System.out.println("Score: " + result.score());     // 1.0

Kotlin

val evaluator = exactMatch {
    name = "Exact Match"
    threshold = 1.0
}

val testCase = EvalTestCase.of("What is 2+2?", "4", "4")
val result = evaluator.evaluate(testCase)

println("Passed: ${result.success()}")  // true
println("Score: ${result.score()}")     // 1.0

Write a JUnit test

Use @DatasetSource to run evaluations as parameterized tests:

Java

JudgeLM judgeLM = prompt -> openAiClient.generate(prompt);

Evaluator correctnessEvaluator = LLMJudgeEvaluator.builder()
    .name("Correctness")
    .criteria("Is the answer correct and complete?")
    .evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))
    .judge(judgeLM)
    .build();

@ParameterizedTest
@DatasetSource("classpath:datasets/qa.json")
void testQAResponses(Example example) {
    String response = assistant.chat(example.input());
    EvalTestCase testCase = example.toTestCase(response);

    Assertions.assertEval(testCase, correctnessEvaluator);
}

Kotlin

val judgeLM = JudgeLM { prompt -> openAiClient.generate(prompt) }

val correctnessEvaluator = llmJudge(judgeLM) {
    name = "Correctness"
    criteria = "Is the answer correct and complete?"
    params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)
}

class QaTests {
    @ParameterizedTest
    @DatasetSource("classpath:datasets/qa.json")
    fun testQAResponses(example: Example) {
        val response = assistant.chat(example.input())
        val testCase = example.toTestCase(response)

        Assertions.assertEval(testCase, correctnessEvaluator)
    }
}

Evaluate a dataset in bulk

Run experiments across entire datasets with aggregated metrics:

Java

JudgeLM judgeLM = prompt -> openAiClient.generate(prompt);

Dokimos

LLM Evaluation Framework for Java

Documentation • Getting Started • Examples • Issues

Dokimos is an evaluation framework for LLM applications in Java and Kotlin. It helps you evaluate responses, track quality over time, and catch regressions before they reach production.

It integrates with JUnit, LangChain4j, Spring AI and Koog so you can run evaluations as part of your existing test suite and CI/CD pipeline.

Why Dokimos?

JUnit integration: Run evaluations as parameterized tests in your existing test suite.
Framework agnostic: Works with LangChain4j, Spring AI, Koog or any LLM client. Powered by any LLM.
Built in evaluators: Hallucination detection, faithfulness, contextual relevance, LLM as a judge, and more.
Agent evaluation: Evaluate AI agents with tool call validation, task completion, argument hallucination detection, and tool reliability checks.
Custom evaluators: Build your own metrics by extending BaseEvaluator or using LLMJudgeEvaluator.
Dataset support: Load test cases from JSON, CSV, or define them programmatically.
CI/CD ready: Runs locally or in any CI/CD environment. Fail builds when quality drops.
Kotlin as first-class citizen: Compose all tests with a convenient Kotlin DSL.

Quick Start

Add the dependency to your pom.xml (check Maven Central for the latest version):

<dependency>
    <groupId>dev.dokimos</groupId>
    <artifactId>dokimos-core</artifactId>
    <version>${dokimos.version}</version>
</dependency>

Run a standalone evaluator

Evaluate a single response directly:

Java

Evaluator evaluator = ExactMatchEvaluator.builder()
    .name("Exact Match")
    .threshold(1.0)
    .build();

EvalTestCase testCase = EvalTestCase.of("What is 2+2?", "4", "4");
EvalResult result = evaluator.evaluate(testCase);

System.out.println("Passed: " + result.success());  // true
System.out.println("Score: " + result.score());     // 1.0

Kotlin

val evaluator = exactMatch {
    name = "Exact Match"
    threshold = 1.0
}

val testCase = EvalTestCase.of("What is 2+2?", "4", "4")
val result = evaluator.evaluate(testCase)

println("Passed: ${result.success()}")  // true
println("Score: ${result.score()}")     // 1.0

Write a JUnit test

Use @DatasetSource to run evaluations as parameterized tests:

Java

JudgeLM judgeLM = prompt -> openAiClient.generate(prompt);

Evaluator correctnessEvaluator = LLMJudgeEvaluator.builder()
    .name("Correctness")
    .criteria("Is the answer correct and complete?")
    .evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))
    .judge(judgeLM)
    .build();

@ParameterizedTest
@DatasetSource("classpath:datasets/qa.json")
void testQAResponses(Example example) {
    String response = assistant.chat(example.input());
    EvalTestCase testCase = example.toTestCase(response);

    Assertions.assertEval(testCase, correctnessEvaluator);
}

Kotlin

val judgeLM = JudgeLM { prompt -> openAiClient.generate(prompt) }

val correctnessEvaluator = llmJudge(judgeLM) {
    name = "Correctness"
    criteria = "Is the answer correct and complete?"
    params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)
}

class QaTests {
    @ParameterizedTest
    @DatasetSource("classpath:datasets/qa.json")
    fun testQAResponses(example: Example) {
        val response = assistant.chat(example.input())
        val testCase = example.toTestCase(response)

        Assertions.assertEval(testCase, correctnessEvaluator)
    }
}

Evaluate a dataset in bulk

Run experiments across entire datasets with aggregated metrics:

Java

JudgeLM judgeLM = prompt -> openAiClient.generate(prompt);

dokimos

README

Dokimos

Why Dokimos?

Quick Start

Run a standalone evaluator

Java

Kotlin

Write a JUnit test

Java

Kotlin

Evaluate a dataset in bulk

Java

Help us improve

dokimos

README

Dokimos

Why Dokimos?

Quick Start

Run a standalone evaluator

Java

Kotlin

Write a JUnit test

Java

Kotlin

Evaluate a dataset in bulk

Java

8 Plugins

create-evaluator

create-dataset

create-tests

create-experiment

evaluate-agent

evaluate-koog

evaluate-langchain4j

evaluate-spring-ai

Related Marketplaces

antigravity-awesome-skills

claude-plugins-official

knowledge-work-plugins

Help us improve