Help us improve
Share bugs, ideas, or general feedback.
Share bugs, ideas, or general feedback.
Share bugs, ideas, or general feedback.
Run, evaluate, and analyze AI agent benchmark suites for SDK usability. Generates test cases from source code, executes them in sandboxed VMs, scores solutions via LLM judge, and surfaces failure patterns or API design gaps.
npx claudepluginhub pspdfkit-labs/agentic-usability --plugin agentic-usabilityRun the full evaluation pipeline (execute, judge, report) for an SDK usability benchmark. Use when running a complete benchmark end-to-end, resuming an interrupted pipeline, or checking pipeline status.
Execute benchmark test cases in sandboxed environments with AI agents. Spins up microsandbox containers for each test case and extracts solutions.
Export a benchmark pipeline as a zip file for sharing or archiving. Excludes cache and large snapshots.
Generate SDK usability test cases by exploring source code. Use when creating benchmark test suites, generating test cases for an SDK, or when the user wants to create evaluation scenarios.
Initialize a new agentic-usability benchmark pipeline project. Use when setting up a new SDK benchmark, creating a config.json, or starting a new evaluation project.
Share bugs, ideas, or general feedback.
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge.
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge.
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
Open-source testing and regression detection framework for AI agents. Golden baseline diffing, CI/CD integration, works with LangGraph, CrewAI, OpenAI, Anthropic Claude, HuggingFace, Ollama, and MCP.
SDLC enforcement for AI agents — TDD, planning, self-review, CI shepherd
A CLI tool for validating AI coding agents
HelloAGENTS — The orchestration kernel that makes any AI CLI smarter. Adds intelligent routing, quality verification (Ralph Loop), safety guards, and notifications.
Set up evaluation of AI agents with tool call validation, correctness checks, task completion, and tool reliability using Dokimos. Framework-agnostic — works with any agent framework.
Claude Agent SDK Development Plugin
Extract text as structured, semantic Markdown from a PDF.
Document processing skills powered by the Nutrient Document Web Services API.
A CLI tool that measures how well AI coding agents (Claude Code, Codex, Gemini CLI, etc.) can use your SDK. It generates programming problems from your SDK source, runs agents in sandboxed environments to solve them, then scores the results using an LLM judge that compares generated solutions against reference implementations.
stateDiagram-v2
generate: Test Suite Generation Agent
executionSandbox: Sandbox Pool
state executionSandbox {
execution: Test Solver Agent
publicInfo: Public Documentation
}
judgeSandbox: Sandbox Pool
state judgeSandbox {
judge: Test Judge Agent
publicInfo2: Public Documentation
privateInfo: Private Source Code
}
insight: Analyzer Agent
generate --> executionSandbox: Test Cases
executionSandbox --> judgeSandbox: Test Solutions
judgeSandbox --> insight: Test Scores
npm install -g @pspdfkit-labs/agentic-usability
Then run commands directly:
agentic-usability init -p pipelines/my-sdk-eval
git clone https://github.com/PSPDFKit-labs/agentic-usability.git
cd agentic-usability
npm install
npm run build
Then run commands via npx:
npx agentic-usability init -p pipelines/my-sdk-eval
This package includes a Claude Code plugin with skills for every CLI command. Once installed, you can run pipeline stages directly from Claude Code (e.g. /agentic-usability:eval).
From within Claude Code:
/plugin marketplace add PSPDFKit-labs/agentic-usability
/plugin install agentic-usability@agentic-usability-marketplace
/reload-plugins
| Skill | Description |
|---|---|
/agentic-usability:init | Create a new pipeline project |
/agentic-usability:generate | Generate test suite from SDK source |
/agentic-usability:execute | Run agents in sandboxes |
/agentic-usability:judge | LLM judge scoring |
/agentic-usability:report | Display scorecard |
/agentic-usability:eval | Full pipeline (execute → judge → report) |
/agentic-usability:inspect | Open web UI |
/agentic-usability:insights | AI analysis of results |
/agentic-usability:export | Export pipeline as zip |
/agentic-usability:sandbox | Debug shell inside a sandbox |
agentic-usability init -p pipelines/my-sdk-eval
The interactive wizard walks you through configuring:
The wizard explains each field and provides sensible defaults. You can also cd into a directory and run agentic-usability init without -p.
agentic-usability eval -p pipelines/my-sdk-eval
This runs the evaluation pipeline: execute → judge → report.
Or run stages individually:
agentic-usability generate -p pipelines/my-sdk-eval
agentic-usability execute -p pipelines/my-sdk-eval
agentic-usability judge -p pipelines/my-sdk-eval
agentic-usability report -p pipelines/my-sdk-eval
Use --tests to run specific test cases (comma-separated):
agentic-usability execute -p pipelines/my-sdk-eval --tests TC-001,TC-003
agentic-usability judge -p pipelines/my-sdk-eval --tests TC-001,TC-003
Each pipeline project is a self-contained directory. Without -p, the CLI treats CWD as the project directory.