From adk
Write, run, and debug ADK evals: automated conversation tests for agents covering file format, assertions, CLI usage, and patterns for tools, workflows, state, tables.
npx claudepluginhub botpress/skills --plugin adkThis skill uses the workspace's default tool permissions.
Evals are automated conversation tests for ADK agents. Each eval defines a scenario — a sequence of user messages or events — and asserts on what the bot should do: what it says, which tools it calls, how state changes, what gets written to tables, and more.
Debugs ADK agents by reading traces, analyzing logs, diagnosing common failures like tool errors, workflow issues, and LLM misbehavior, plus guiding the 8-step debug loop.
Writes, edits, reviews, and validates AgentV EVAL.yaml files for agent skill evaluations. Adds test cases, configures graders, converts from evals.json or chat transcripts.
Builds AI agent evaluations using Anthropic patterns: code/model/human graders, tasks, trials, benchmarks for coding, conversational, research agents.
Share bugs, ideas, or general feedback.
Evals are automated conversation tests for ADK agents. Each eval defines a scenario — a sequence of user messages or events — and asserts on what the bot should do: what it says, which tools it calls, how state changes, what gets written to tables, and more.
Evals run against a live dev bot (adk dev), so they test the full stack — not mocks.
Use this skill when the developer asks about:
--format json flag, tagging strategiesOr when you are developing an ADK bot and need to write the equivalent of unit/end-to-end tests.
Trigger questions:
| File | Contents |
|---|---|
references/eval-format.md | Complete file format — all fields, turn types, assertion categories, match operators, setup, outcome, options |
references/testing-workflow.md | Running evals, interpreting output, using traces, the write → test → iterate loop, CI integration |
references/test-patterns.md | Per-primitive patterns for actions, tools, workflows, conversations, tables, and state |
eval-format.md for structure and assertionstesting-workflow.md for CLI commands and outputtest-patterns.md for the relevant sectiontesting-workflow.md (inspect traces) + eval-format.md (check assertion syntax)import { Eval } from '@botpress/adk'
export default new Eval({
name: 'greeting',
type: 'regression',
tags: ['basic'],
setup: {
state: { bot: { welcomeSent: false } },
workflow: { trigger: 'onboarding', input: { userId: 'test-1' } },
},
conversation: [
{
user: 'Hi!',
assert: {
response: [
{ not_contains: 'error' },
{ llm_judge: 'Response is friendly and offers to help' },
],
tools: [{ not_called: 'createTicket' }],
state: [{ path: 'conversation.greeted', equals: true }],
},
},
],
outcome: {
state: [{ path: 'conversation.greeted', equals: true }],
},
options: {
idleTimeout: 20000,
judgePassThreshold: 4,
},
})
| Turn | When to use |
|---|---|
user: 'message' | Standard user message |
event: { type, payload } | Non-message trigger (webhook, integration event) |
expectSilence: true | Assert bot does NOT respond |
| Category | What it checks |
|---|---|
response | Bot reply text (contains, matches, llm_judge, similar_to) |
tools | Tool calls (called, not_called, call_order, params) |
state | Bot/user/conversation state (equals, changed) |
tables | Table rows (row_exists, row_count) |
workflow | Workflow execution (entered, completed) |
timing | Response time in ms (lte, gte) |
adk evals # run all evals
adk evals <name> # run one eval
adk evals --tag <tag> # filter by tag
adk evals --type regression # filter by type
adk evals --verbose # show all assertions
adk evals --format json # JSON output for CI
adk evals runs # list recent runs
adk evals runs --latest # most recent run
adk evals runs --latest -v # with full details
✅ Every turn needs user or event
// CORRECT
{ user: 'hello', expectSilence: true }
{ event: { type: 'payment.failed' }, expectSilence: true }
❌ expectSilence alone is not a valid turn
// WRONG — missing user or event
{ expectSilence: true }
✅ Assert tool params to verify correct extraction
// CORRECT — verifies the LLM extracted the right values
{ called: 'createTicket', params: { priority: { equals: 'high' } } }
❌ Only asserting the tool was called
// INCOMPLETE — doesn't verify params were correct
{ called: 'createTicket' }
✅ Use outcome for post-conversation state and table assertions
// CORRECT — final state checked once after all turns
outcome: {
state: [{ path: 'conversation.resolved', equals: true }],
tables: [{ table: 'ticketsTable', row_exists: { status: { equals: 'open' } } }],
}
❌ Checking tables in per-turn assertions when the write happens at the end
// WRONG — table may not be written until after all turns
conversation: [
{
user: 'Create a ticket',
assert: { tables: [{ table: 'ticketsTable', row_exists: { status: { equals: 'open' } } }] },
},
]
✅ Seed state to test conditional behavior without running setup turns
// CORRECT — start in a known state
setup: {
state: {
user: { plan: 'pro' },
conversation: { phase: 'support' },
},
}
❌ Using conversation turns to set up state (slow and fragile)
// WRONG — depends on the bot correctly processing setup turns
conversation: [
{ user: 'I am on the pro plan' }, // hoping bot sets user.plan
{ user: 'I need help with billing' }, // actual test turn
]
Writing evals:
Running evals:
Debugging:
Per-primitive:
Match depth to the question.
Answer directly — show the relevant table or CLI command. Don't generate a full eval file for an informational question.
new Eval({}) call with realistic field valuesimport { Eval } from '@botpress/adk')adk evals <name>expected / actual diff)