From evaluate-agent
Sets up Dokimos evaluation for AI agents using tools, assessing tool call validity, correctness, task completion, argument hallucinations, and tool definition quality.
npx claudepluginhub dokimos-dev/dokimos --plugin evaluate-agentThis skill uses the workspace's default tool permissions.
Set up Dokimos agent evaluation for an AI agent that uses tools. The user will describe their agent and evaluation goals via `$ARGUMENTS`.
Builds AI agent evaluations using Anthropic patterns: code/model/human graders, tasks, trials, benchmarks for coding, conversational, research agents.
Evaluates LLM agents via behavioral testing, capability assessment, reliability metrics, and production monitoring to assess real-world performance.
Builds evaluation frameworks for LLM agents using multi-dimensional rubrics, LLM-as-judge, and analysis of token usage, tool calls, and model performance drivers.
Share bugs, ideas, or general feedback.
Set up Dokimos agent evaluation for an AI agent that uses tools. The user will describe their agent and evaluation goals via $ARGUMENTS.
dokimos-core/src/main/java/dev/dokimos/core/agents/ — ToolCall.java, ToolDefinition.java, AgentTrace.javadokimos-core/src/main/java/dev/dokimos/core/evaluators/agents/dokimos-kotlin/src/main/kotlin/dev/dokimos/kotlin/dsl/evaluators/EvaluatorDsl.kt and CoreDsl.ktdokimos-examples/src/main/java/dev/dokimos/examples/basic/AgentEvaluationExample.javadokimos-core/src/test/java/dev/dokimos/core/integration/AgentEvaluatorIT.javadev.dokimos:dokimos-core (agent evaluation is built in, no extra dependencies)Before writing code, read the data model files and any relevant evaluator files to understand exact APIs.
| Evaluator | What it checks | LLM required? | Default threshold |
|---|---|---|---|
ToolCallValidityEvaluator | Tool calls match JSON schema (names, required params, types, enums) | No | 1.0 |
ToolCorrectnessEvaluator | Agent used the expected set of tools | No | 1.0 |
TaskCompletionEvaluator | Agent completed the user's tasks | Yes | 0.5 |
ToolArgumentHallucinationEvaluator | Arguments are grounded in user input | Yes | 0.8 |
ToolNameReliabilityEvaluator | Tool names follow conventions (snake_case, conciseness, clarity, ordering, intent) | Optional | 0.8 |
ToolDescriptionReliabilityEvaluator | Tool descriptions are well-crafted (structure, clarity, args documented, examples, usage notes) | Optional | 0.8 |
ToolCall: A tool invocation record (name, arguments map, optional result string, optional metadata). Create with ToolCall.of(name, args) or ToolCall.builder().ToolDefinition: A tool's contract (name, description, JSON Schema map for arguments). Create with ToolDefinition.of(name, desc, schema). The schema must follow JSON Schema format with "type": "object", "properties", and optional "required".AgentTrace: Wraps a full agent execution (tool calls, reasoning steps, final response). Build with AgentTrace.builder().addToolCall(...).finalResponse(...).build(). Call toOutputMap() to get a map with keys "output", "toolCalls", "reasoningSteps" for use in EvalTestCase.Evaluators read from specific keys in EvalTestCase maps:
| Map | Key | Type | Used by |
|---|---|---|---|
actualOutputs | "toolCalls" | List<ToolCall> | Validity, Correctness, Hallucination |
actualOutputs | "output" | String | Task Completion |
expectedOutputs | "toolCalls" | List<ToolCall> | Correctness |
metadata | "tools" | List<ToolDefinition> | Validity, Name Reliability, Description Reliability |
metadata | "tasks" | List<String> | Task Completion |
metadata | "constraints" | String | Task Completion |
IMPORTANT: In an Experiment, evaluators read metadata from the Example, NOT from the Experiment. Put "tools" and "tasks" in each Example's metadata (in the dataset), not on the Experiment builder.
// Rule-based — just use defaults or set threshold/strictMode
ToolCallValidityEvaluator.builder().strictMode(true).threshold(1.0).build();
ToolCorrectnessEvaluator.builder().matchMode(ToolCorrectnessEvaluator.MatchMode.NAMES_ONLY).build();
// LLM-based — require a JudgeLM
JudgeLM judge = prompt -> openAiClient.generate(prompt);
TaskCompletionEvaluator.builder().judge(judge).threshold(0.5).build();
ToolArgumentHallucinationEvaluator.builder().judge(judge).threshold(0.8).build();
// Tool reliability — optional JudgeLM for semantic checks
ToolNameReliabilityEvaluator.builder().judge(judge).threshold(0.8).build();
ToolDescriptionReliabilityEvaluator.builder().maxInputArgs(5).maxOptionalArgs(3).judge(judge).build();
ToolCorrectnessEvaluator match modes: NAMES_ONLY (default, F1 score), NAMES_AND_ORDER (LCS similarity), NAMES_AND_ARGS (full structural comparison).
List<ToolDefinition> tools = List.of(
ToolDefinition.of("search_flights", "Search for flights", Map.of(
"type", "object",
"properties", Map.of(
"origin", Map.of("type", "string", "description", "Origin airport code"),
"destination", Map.of("type", "string", "description", "Destination airport code")
),
"required", List.of("origin", "destination")
))
);
AgentTrace trace = AgentTrace.builder()
.addToolCall(ToolCall.of("search_flights", Map.of("origin", "JFK", "destination", "CDG")))
.finalResponse("Found flights to Paris.")
.build();
var testCase = EvalTestCase.builder()
.input("Find flights from NYC to Paris")
.actualOutput("toolCalls", trace.toolCalls())
.actualOutput("output", trace.finalResponse())
.metadata("tools", tools)
.build();
var result = ToolCallValidityEvaluator.builder().build().evaluate(testCase);
JudgeLM judge = prompt -> openAiClient.generate(prompt);
// Tools and tasks go in each Example's metadata
Dataset dataset = Dataset.builder()
.name("Agent Evaluation")
.addExample(Example.builder()
.input("input", "Find flights to Paris and book a hotel")
.expectedOutput("toolCalls", List.of(
ToolCall.of("search_flights", Map.of()),
ToolCall.of("book_hotel", Map.of())
))
.metadata("tools", tools)
.metadata("tasks", List.of("Search flights", "Book hotel"))
.build())
.build();
ExperimentResult result = Experiment.builder()
.name("Agent Evaluation")
.dataset(dataset)
.task(example -> {
AgentTrace trace = myAgent.run(example.input());
return trace.toOutputMap();
})
.evaluators(List.of(
ToolCallValidityEvaluator.builder().build(),
ToolCorrectnessEvaluator.builder().build(),
TaskCompletionEvaluator.builder().judge(judge).build(),
ToolArgumentHallucinationEvaluator.builder().judge(judge).build()
))
.build()
.run();
val judge = JudgeLM { prompt -> openAiClient.generate(prompt) }
// Standalone evaluator
val validity = toolCallValidity { threshold = 1.0 }
// In an experiment
evaluators {
toolCallValidity { strictMode = true }
toolCorrectness { matchMode = ToolCorrectnessEvaluator.MatchMode.NAMES_ONLY }
taskCompletion(judge) { threshold = 0.5 }
toolArgumentHallucination(judge) { threshold = 0.8 }
toolNameReliability { judge = judgeLM }
toolDescriptionReliability { maxInputArgs = 5; maxOptionalArgs = 3 }
}
$ARGUMENTS what the agent does, what tools it uses, and the evaluation goalsToolCall.java, ToolDefinition.java, AgentTrace.java) and any relevant evaluator source filesToolDefinition objects for each tool the agent can use (with JSON Schema for arguments including "type", "properties", "required")metadata("tools", tools) and optionally metadata("tasks", taskList) and expectedOutput("toolCalls", expectedCalls)Task using AgentTrace.toOutputMap() to capture tool calls and reasoningToolCallValidityEvaluator, ToolCorrectnessEvaluator) first — they don't need an LLM and give fast deterministic feedback. Add LLM-based evaluators once basics pass.