Help us improve
Share bugs, ideas, or general feedback.
From agentic-usability
Displays a terminal scorecard of benchmark results with pass rates, scores by difficulty, and per-test breakdowns. Use when the user asks about benchmark results, scores, or SDK performance.
npx claudepluginhub pspdfkit-labs/agentic-usability --plugin agentic-usabilityHow this skill is triggered — by the user, by Claude, or both
Slash command
/agentic-usability:report [project-directory] [--json] [--run runId][project-directory] [--json] [--run runId]This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Display the benchmark scorecard for the pipeline.
Analyzes SDK benchmark results to identify failure patterns, documentation gaps, and API design issues. Use when reviewing evaluation runs or improving SDK usability.
Views evaluation results and benchmark reports for Claude Code skills and plugins. Reviews past evals, compares benchmark runs, and tracks quality trends via tables.
Executes skill evaluations against test cases from eval.yaml, scores outputs with judges, reports results, benchmarks, regressions, and model comparisons.
Share bugs, ideas, or general feedback.
Display the benchmark scorecard for the pipeline.
agentic-usability report -p $ARGUMENTS
--json: Output raw structured JSON instead of the colored table--run <runId>: Show results for a specific run (default: latest)results/<runId>/
report.json # Aggregate scorecard
<targetName>/<testId>/
judge.json # Per-test judge scores
generated-solution.json # Agent's solution
agent-notes.md # Agent's working notes
| Dimension | Range | What it measures |
|---|---|---|
apiDiscovery | 0-100 | Found correct SDK endpoints/methods? |
callCorrectness | 0-100 | API calls constructed correctly? |
completeness | 0-100 | All requirements handled? |
functionalCorrectness | 0-100 | Code runs and produces correct output? |
overallVerdict | boolean | Solution works? (pass/fail) |
The report aggregates these across all test cases and breaks them down by difficulty (easy/medium/hard).
Runs are stored as subdirectories in results/ containing run.json:
{ "id": "run-2026-04-25T10-30-00-000Z", "createdAt": "...", "targets": [...], "testCount": 15, "label": "..." }
To list all runs, look for results/*/run.json files.
Present the results to the user. If they want deeper analysis, suggest using the insights skill.
For detailed file inventory, see pipeline-guide.md.