Help us improve
Share bugs, ideas, or general feedback.
From agentic-usability
Run the full evaluation pipeline (execute, judge, report) for SDK usability benchmarks. Supports resume, status checks, and labeling.
npx claudepluginhub pspdfkit-labs/agentic-usability --plugin agentic-usabilityHow this skill is triggered — by the user, by Claude, or both
Slash command
/agentic-usability:eval [project-directory] [--resume] [--fresh] [--label name][project-directory] [--resume] [--fresh] [--label name]This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Run the complete benchmark pipeline: **execute → judge → report**.
Displays a terminal scorecard of benchmark results with pass rates, scores by difficulty, and per-test breakdowns. Use when the user asks about benchmark results, scores, or SDK performance.
Executes skill evaluations against test cases from eval.yaml, scores outputs with judges, reports results, benchmarks, regressions, and model comparisons.
Runs evaluations on Copilot Studio draft agents via Power Platform Evaluation API. Lists test sets, starts/polls runs, fetches results, proposes YAML fixes. Use to test changes without publishing.
Share bugs, ideas, or general feedback.
Run the complete benchmark pipeline: execute → judge → report.
echo "Arguments: $ARGUMENTS"
report.json--resume: Resume from the last checkpoint of an interrupted pipeline--fresh: Only useful with --resume. Resets pipeline state so the run re-executes from scratch in the same run directory. Does NOT delete result files. Without --resume, a new run always starts fresh anyway.--label <name>: Human-readable label for this run--run <runId>: Only used with --resume. Target a specific run instead of auto-detecting the latest incomplete one.Before running, you can check if a pipeline is paused/interrupted by reading the pipeline state file:
Pipeline state location: <project>/results/<runId>/pipeline-state.json
{
"stage": "execute",
"startedAt": "2026-04-25T10:30:00.000Z",
"testCases": 15,
"completed": {
"execute": { "node-20": ["TC-001", "TC-002"] },
"judge": { "node-20": [] }
}
}
How to check status:
stage is "execute" or "judge" → pipeline is incomplete/pausedstage is "report" → pipeline completed successfullycompleted[stage][target].length vs testCases to see progressreport.json in the run directory → pipeline didn't finishresults/ containing run.jsonRun manifest (results/<runId>/run.json):
{
"id": "run-2026-04-25T10-30-00-000Z",
"createdAt": "2026-04-25T10:30:00.000Z",
"targets": ["node-20"],
"testCount": 15,
"label": "baseline v2"
}
When --resume is passed:
stage !== "report"), or uses --run <id>completed map for each targetRun agentic-usability eval -p $ARGUMENTS and monitor the output. If interrupted, suggest --resume to continue.
For detailed pipeline internals, see pipeline-guide.md.