Help us improve
Share bugs, ideas, or general feedback.
From agentic-usability
Executes benchmark test cases in sandboxed VMs using AI agents. Spins up containers, scaffolds workspace, runs agent, and extracts solution artifacts.
npx claudepluginhub pspdfkit-labs/agentic-usability --plugin agentic-usabilityHow this skill is triggered — by the user, by Claude, or both
Slash command
/agentic-usability:execute [project-directory] [--tests TC-001,TC-002] [--run runId][project-directory] [--tests TC-001,TC-002] [--run runId]This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Run the executor stage of the benchmark pipeline. For each test case and target, this:
Run the full evaluation pipeline (execute, judge, report) for SDK usability benchmarks. Supports resume, status checks, and labeling.
Runs evaluations on Copilot Studio draft agents via Power Platform Evaluation API. Lists test sets, starts/polls runs, fetches results, proposes YAML fixes. Use to test changes without publishing.
Writes, runs, and analyzes structured test suites for Salesforce Agentforce agents using sf agent test and preview CLI commands for smoke tests, batch execution, result diagnosis, and CI/CD integration.
Share bugs, ideas, or general feedback.
Run the executor stage of the benchmark pipeline. For each test case and target, this:
PROBLEM.md with the problem statement/workspace/sources//workspace/solution/echo "Arguments: $ARGUMENTS"
--tests <ids>: Comma-separated test case IDs to run (e.g., --tests TC-001,TC-003)--run <runId>: Target a specific run directory (default: latest run)Saved to results/<runId>/<target>/<testId>/:
| File | Description |
|---|---|
generated-solution.json | Agent's solution [{path, content}] |
agent-notes.md | Agent's self-reported working notes |
agent-output.log | Raw agent stdout/stderr |
agent-cmd.log | Exact command executed |
agent-session.jsonl | Agent conversation log (if available) |
agent-egress.log.json | Network traffic logs |
workspace-snapshot.tar.gz | Full sandbox workspace tarball |
setup.log | Workspace scaffolding log |
agent-error.log | Error details (only on failure) |
install-error.log | Agent install failure (only on error) |
Progress is tracked in results/<runId>/pipeline-state.json:
completed.execute["<target>"] lists test IDs that have finishedagentic-usability eval --resume to continue from where it stoppedTo check which tests completed, read the pipeline state:
results/<runId>/pipeline-state.json → completed.execute.<targetName>
Failed tests are retried up to 2 times with backoffs of 1s and 3s before being marked as failed.
Controlled by sandbox.concurrency in config.json. Multiple sandboxes run in parallel.
Run agentic-usability execute -p $ARGUMENTS and report the results.
For detailed internals, see pipeline-guide.md.