Skill

evaluate-agent

Sets up Dokimos evaluation for AI agents using tools, assessing tool call validity, correctness, task completion, argument hallucinations, and tool definition quality.

Java

Kotlin

ai-ml

testing

npx claudepluginhub dokimos-dev/dokimos --plugin evaluate-agent

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Set up Dokimos agent evaluation for an AI agent that uses tools. The user will describe their agent and evaluation goals via `$ARGUMENTS`.

SKILL.md

Similar Skills

anthropic-evaluations

Builds AI agent evaluations using Anthropic patterns: code/model/human graders, tasks, trials, benchmarks for coding, conversational, research agents.

8 files2 tools

toolkit

Agent Evaluation

Evaluates LLM agents via behavioral testing, capability assessment, reliability metrics, and production monitoring to assess real-world performance.

3 files

omer-metin-skills-for-antigravity-2

evaluation

14.2k

Builds evaluation frameworks for LLM agents using multi-dimensional rubrics, LLM-as-judge, and analysis of token usage, tool calls, and model performance drivers.

2 files

context-engineering

Stats

Parent Repo Stars20

Parent Repo Forks3

Last CommitFeb 27, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Evaluate AI Agent

Set up Dokimos agent evaluation for an AI agent that uses tools. The user will describe their agent and evaluation goals via $ARGUMENTS.

Where things live

Agent data model: dokimos-core/src/main/java/dev/dokimos/core/agents/ — ToolCall.java, ToolDefinition.java, AgentTrace.java
Agent evaluators: dokimos-core/src/main/java/dev/dokimos/core/evaluators/agents/
Kotlin DSL: dokimos-kotlin/src/main/kotlin/dev/dokimos/kotlin/dsl/evaluators/EvaluatorDsl.kt and CoreDsl.kt
Example: dokimos-examples/src/main/java/dev/dokimos/examples/basic/AgentEvaluationExample.java
Integration tests: dokimos-core/src/test/java/dev/dokimos/core/integration/AgentEvaluatorIT.java
Maven dependency: dev.dokimos:dokimos-core (agent evaluation is built in, no extra dependencies)

Before writing code, read the data model files and any relevant evaluator files to understand exact APIs.

Available evaluators

Evaluator	What it checks	LLM required?	Default threshold
`ToolCallValidityEvaluator`	Tool calls match JSON schema (names, required params, types, enums)	No	1.0
`ToolCorrectnessEvaluator`	Agent used the expected set of tools	No	1.0
`TaskCompletionEvaluator`	Agent completed the user's tasks	Yes	0.5
`ToolArgumentHallucinationEvaluator`	Arguments are grounded in user input	Yes	0.8
`ToolNameReliabilityEvaluator`	Tool names follow conventions (snake_case, conciseness, clarity, ordering, intent)	Optional	0.8
`ToolDescriptionReliabilityEvaluator`	Tool descriptions are well-crafted (structure, clarity, args documented, examples, usage notes)	Optional	0.8

Data model essentials

ToolCall: A tool invocation record (name, arguments map, optional result string, optional metadata). Create with ToolCall.of(name, args) or ToolCall.builder().
ToolDefinition: A tool's contract (name, description, JSON Schema map for arguments). Create with ToolDefinition.of(name, desc, schema). The schema must follow JSON Schema format with "type": "object", "properties", and optional "required".
AgentTrace: Wraps a full agent execution (tool calls, reasoning steps, final response). Build with AgentTrace.builder().addToolCall(...).finalResponse(...).build(). Call toOutputMap() to get a map with keys "output", "toolCalls", "reasoningSteps" for use in EvalTestCase.

EvalTestCase key conventions

Evaluators read from specific keys in EvalTestCase maps:

Map	Key	Type	Used by
`actualOutputs`	`"toolCalls"`	`List<ToolCall>`	Validity, Correctness, Hallucination
`actualOutputs`	`"output"`	`String`	Task Completion
`expectedOutputs`	`"toolCalls"`	`List<ToolCall>`	Correctness
`metadata`	`"tools"`	`List<ToolDefinition>`	Validity, Name Reliability, Description Reliability
`metadata`	`"tasks"`	`List<String>`	Task Completion
`metadata`	`"constraints"`	`String`	Task Completion

IMPORTANT: In an Experiment, evaluators read metadata from the Example, NOT from the Experiment. Put "tools" and "tasks" in each Example's metadata (in the dataset), not on the Experiment builder.

Evaluator configuration

// Rule-based — just use defaults or set threshold/strictMode
ToolCallValidityEvaluator.builder().strictMode(true).threshold(1.0).build();
ToolCorrectnessEvaluator.builder().matchMode(ToolCorrectnessEvaluator.MatchMode.NAMES_ONLY).build();

// LLM-based — require a JudgeLM
JudgeLM judge = prompt -> openAiClient.generate(prompt);
TaskCompletionEvaluator.builder().judge(judge).threshold(0.5).build();
ToolArgumentHallucinationEvaluator.builder().judge(judge).threshold(0.8).build();

// Tool reliability — optional JudgeLM for semantic checks
ToolNameReliabilityEvaluator.builder().judge(judge).threshold(0.8).build();
ToolDescriptionReliabilityEvaluator.builder().maxInputArgs(5).maxOptionalArgs(3).judge(judge).build();

ToolCorrectnessEvaluator match modes: NAMES_ONLY (default, F1 score), NAMES_AND_ORDER (LCS similarity), NAMES_AND_ARGS (full structural comparison).

Minimal pattern — single test case

List<ToolDefinition> tools = List.of(
    ToolDefinition.of("search_flights", "Search for flights", Map.of(
        "type", "object",
        "properties", Map.of(
            "origin", Map.of("type", "string", "description", "Origin airport code"),
            "destination", Map.of("type", "string", "description", "Destination airport code")
        ),
        "required", List.of("origin", "destination")
    ))
);

AgentTrace trace = AgentTrace.builder()
    .addToolCall(ToolCall.of("search_flights", Map.of("origin", "JFK", "destination", "CDG")))
    .finalResponse("Found flights to Paris.")
    .build();

var testCase = EvalTestCase.builder()
    .input("Find flights from NYC to Paris")
    .actualOutput("toolCalls", trace.toolCalls())
    .actualOutput("output", trace.finalResponse())
    .metadata("tools", tools)
    .build();

var result = ToolCallValidityEvaluator.builder().build().evaluate(testCase);

Experiment pattern — across a dataset

JudgeLM judge = prompt -> openAiClient.generate(prompt);

// Tools and tasks go in each Example's metadata
Dataset dataset = Dataset.builder()
    .name("Agent Evaluation")
    .addExample(Example.builder()
        .input("input", "Find flights to Paris and book a hotel")
        .expectedOutput("toolCalls", List.of(
            ToolCall.of("search_flights", Map.of()),
            ToolCall.of("book_hotel", Map.of())
        ))
        .metadata("tools", tools)
        .metadata("tasks", List.of("Search flights", "Book hotel"))
        .build())
    .build();

ExperimentResult result = Experiment.builder()
    .name("Agent Evaluation")
    .dataset(dataset)
    .task(example -> {
        AgentTrace trace = myAgent.run(example.input());
        return trace.toOutputMap();
    })
    .evaluators(List.of(
        ToolCallValidityEvaluator.builder().build(),
        ToolCorrectnessEvaluator.builder().build(),
        TaskCompletionEvaluator.builder().judge(judge).build(),
        ToolArgumentHallucinationEvaluator.builder().judge(judge).build()
    ))
    .build()
    .run();

Kotlin DSL

val judge = JudgeLM { prompt -> openAiClient.generate(prompt) }

// Standalone evaluator
val validity = toolCallValidity { threshold = 1.0 }

// In an experiment
evaluators {
    toolCallValidity { strictMode = true }
    toolCorrectness { matchMode = ToolCorrectnessEvaluator.MatchMode.NAMES_ONLY }
    taskCompletion(judge) { threshold = 0.5 }
    toolArgumentHallucination(judge) { threshold = 0.8 }
    toolNameReliability { judge = judgeLM }
    toolDescriptionReliability { maxInputArgs = 5; maxOptionalArgs = 3 }
}

Steps

Understand from $ARGUMENTS what the agent does, what tools it uses, and the evaluation goals
Read the data model files (ToolCall.java, ToolDefinition.java, AgentTrace.java) and any relevant evaluator source files
Determine which evaluators are needed based on the table above
Define ToolDefinition objects for each tool the agent can use (with JSON Schema for arguments including "type", "properties", "required")
Create a dataset with examples — each Example should include metadata("tools", tools) and optionally metadata("tasks", taskList) and expectedOutput("toolCalls", expectedCalls)
Build the Task using AgentTrace.toOutputMap() to capture tool calls and reasoning
Wire evaluators and run the experiment
Start with rule-based evaluators (ToolCallValidityEvaluator, ToolCorrectnessEvaluator) first — they don't need an LLM and give fast deterministic feedback. Add LLM-based evaluators once basics pass.