Plugin

evaluate-agent

Name: evaluate-agent
Author: dokimos-dev

Set up Dokimos evaluations for AI agents to validate tool calls, check correctness and task completion, detect argument hallucinations, and assess tool definition quality across any agent framework.

npx claudepluginhub dokimos-dev/dokimos --plugin evaluate-agent

Component Overview

Skills

Component Details

Skills (1)

evaluate-agent

/evaluate-agent

Sets up evaluation of AI agents using Dokimos. Use this skill when the user wants to evaluate, test, or benchmark an AI agent that uses tools, or assess tool call correctness, task completion, argument hallucination, or tool definition quality. Also use when the user mentions agent evaluation, tool call validation, agent testing, or tool reliability checks.

README

Dokimos

LLM Evaluation Framework for Java

Documentation • Getting Started • Examples • Issues

Dokimos is an evaluation framework for LLM applications in Java and Kotlin. It helps you evaluate responses, track quality over time, and catch regressions before they reach production.

It integrates with JUnit, LangChain4j, Spring AI and Koog so you can run evaluations as part of your existing test suite and CI/CD pipeline.

Why Dokimos?

JUnit integration: Run evaluations as parameterized tests in your existing test suite.
Framework agnostic: Works with LangChain4j, Spring AI, Koog or any LLM client. Powered by any LLM.
Built in evaluators: Hallucination detection, faithfulness, contextual relevance, LLM as a judge, and more.
Agent evaluation: Evaluate AI agents with tool call validation, task completion, argument hallucination detection, and tool reliability checks.
Custom evaluators: Build your own metrics by extending BaseEvaluator or using LLMJudgeEvaluator.
Dataset support: Load test cases from JSON, CSV, or define them programmatically.
CI/CD ready: Runs locally or in any CI/CD environment. Fail builds when quality drops.
Kotlin as first-class citizen: Compose all tests with a convenient Kotlin DSL.

Quick Start

Add the dependency to your pom.xml (check Maven Central for the latest version):

<dependency>
    <groupId>dev.dokimos</groupId>
    <artifactId>dokimos-core</artifactId>
    <version>${dokimos.version}</version>
</dependency>

Run a standalone evaluator

Evaluate a single response directly:

Java

Evaluator evaluator = ExactMatchEvaluator.builder()
    .name("Exact Match")
    .threshold(1.0)
    .build();

EvalTestCase testCase = EvalTestCase.of("What is 2+2?", "4", "4");
EvalResult result = evaluator.evaluate(testCase);

System.out.println("Passed: " + result.success());  // true
System.out.println("Score: " + result.score());     // 1.0

Kotlin

val evaluator = exactMatch {
    name = "Exact Match"
    threshold = 1.0
}

val testCase = EvalTestCase.of("What is 2+2?", "4", "4")
val result = evaluator.evaluate(testCase)

println("Passed: ${result.success()}")  // true
println("Score: ${result.score()}")     // 1.0

Write a JUnit test

Use @DatasetSource to run evaluations as parameterized tests:

Java

JudgeLM judgeLM = prompt -> openAiClient.generate(prompt);

Evaluator correctnessEvaluator = LLMJudgeEvaluator.builder()
    .name("Correctness")
    .criteria("Is the answer correct and complete?")
    .evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))
    .judge(judgeLM)
    .build();

@ParameterizedTest
@DatasetSource("classpath:datasets/qa.json")
void testQAResponses(Example example) {
    String response = assistant.chat(example.input());
    EvalTestCase testCase = example.toTestCase(response);

    Assertions.assertEval(testCase, correctnessEvaluator);
}

Kotlin

val judgeLM = JudgeLM { prompt -> openAiClient.generate(prompt) }

val correctnessEvaluator = llmJudge(judgeLM) {
    name = "Correctness"
    criteria = "Is the answer correct and complete?"
    params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)
}

class QaTests {
    @ParameterizedTest
    @DatasetSource("classpath:datasets/qa.json")
    fun testQAResponses(example: Example) {
        val response = assistant.chat(example.input())
        val testCase = example.toTestCase(response)

        Assertions.assertEval(testCase, correctnessEvaluator)
    }
}

Evaluate a dataset in bulk

Run experiments across entire datasets with aggregated metrics:

Java

JudgeLM judgeLM = prompt -> openAiClient.generate(prompt);

View full README on GitHub

Similar Plugins

evaluate-koog

Set up evaluation of Koog AI agents using Dokimos. Wires Koog agents as the system under test or as LLM judges via KoogSupport utilities, with Kotlin DSL support.

2mo

v0.1.0

Stats

Version0.1.0

Parent Repo Stars20

Parent Repo Forks3

MaintenanceFair

LicenseMIT

AddedMar 18, 2026

Actions

View on GitHub View README Plugin Marketplace JSON Homepage

Available In

dokimos20

Help us improve

Share bugs, ideas, or general feedback.

Back to Plugins

Dokimos

LLM Evaluation Framework for Java

Documentation • Getting Started • Examples • Issues

Dokimos is an evaluation framework for LLM applications in Java and Kotlin. It helps you evaluate responses, track quality over time, and catch regressions before they reach production.

It integrates with JUnit, LangChain4j, Spring AI and Koog so you can run evaluations as part of your existing test suite and CI/CD pipeline.

Why Dokimos?

JUnit integration: Run evaluations as parameterized tests in your existing test suite.
Framework agnostic: Works with LangChain4j, Spring AI, Koog or any LLM client. Powered by any LLM.
Built in evaluators: Hallucination detection, faithfulness, contextual relevance, LLM as a judge, and more.
Agent evaluation: Evaluate AI agents with tool call validation, task completion, argument hallucination detection, and tool reliability checks.
Custom evaluators: Build your own metrics by extending BaseEvaluator or using LLMJudgeEvaluator.
Dataset support: Load test cases from JSON, CSV, or define them programmatically.
CI/CD ready: Runs locally or in any CI/CD environment. Fail builds when quality drops.
Kotlin as first-class citizen: Compose all tests with a convenient Kotlin DSL.

Quick Start

Add the dependency to your pom.xml (check Maven Central for the latest version):

<dependency>
    <groupId>dev.dokimos</groupId>
    <artifactId>dokimos-core</artifactId>
    <version>${dokimos.version}</version>
</dependency>

Run a standalone evaluator

Evaluate a single response directly:

Java

Evaluator evaluator = ExactMatchEvaluator.builder()
    .name("Exact Match")
    .threshold(1.0)
    .build();

EvalTestCase testCase = EvalTestCase.of("What is 2+2?", "4", "4");
EvalResult result = evaluator.evaluate(testCase);

System.out.println("Passed: " + result.success());  // true
System.out.println("Score: " + result.score());     // 1.0

Kotlin

val evaluator = exactMatch {
    name = "Exact Match"
    threshold = 1.0
}

val testCase = EvalTestCase.of("What is 2+2?", "4", "4")
val result = evaluator.evaluate(testCase)

println("Passed: ${result.success()}")  // true
println("Score: ${result.score()}")     // 1.0

Write a JUnit test

Use @DatasetSource to run evaluations as parameterized tests:

Java

JudgeLM judgeLM = prompt -> openAiClient.generate(prompt);

Evaluator correctnessEvaluator = LLMJudgeEvaluator.builder()
    .name("Correctness")
    .criteria("Is the answer correct and complete?")
    .evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))
    .judge(judgeLM)
    .build();

@ParameterizedTest
@DatasetSource("classpath:datasets/qa.json")
void testQAResponses(Example example) {
    String response = assistant.chat(example.input());
    EvalTestCase testCase = example.toTestCase(response);

    Assertions.assertEval(testCase, correctnessEvaluator);
}

Kotlin

val judgeLM = JudgeLM { prompt -> openAiClient.generate(prompt) }

val correctnessEvaluator = llmJudge(judgeLM) {
    name = "Correctness"
    criteria = "Is the answer correct and complete?"
    params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)
}

class QaTests {
    @ParameterizedTest
    @DatasetSource("classpath:datasets/qa.json")
    fun testQAResponses(example: Example) {
        val response = assistant.chat(example.input())
        val testCase = example.toTestCase(response)

        Assertions.assertEval(testCase, correctnessEvaluator)
    }
}

Evaluate a dataset in bulk

Run experiments across entire datasets with aggregated metrics:

Java

JudgeLM judgeLM = prompt -> openAiClient.generate(prompt);

evaluate-agent

Component Overview

Component Details

Skills (1)

README

Dokimos

Why Dokimos?

Quick Start

Run a standalone evaluator

Java

Kotlin

Write a JUnit test

Java

Kotlin

Evaluate a dataset in bulk

Java

Similar Plugins

evaluate-koog

Help us improve

Help us improve

evaluate-agent

Component Overview

Component Details

Skills (1)

README

Dokimos

Why Dokimos?

Quick Start

Run a standalone evaluator

Java

Kotlin

Write a JUnit test

Java

Kotlin

Evaluate a dataset in bulk

Java

Similar Plugins

evaluate-koog

Help us improve

agentv-dev

evalview

promptfoo-evals

agent-validator

skill-optimizer