By fastxyz
Benchmark, evaluate, and optimize agent skills across LLMs by authoring Docker-isolated eval workbenches. Define YAML cases, suites, and graders; run CLI-based tests with traces on OpenRouter models; debug issues and generate documentation for reliable performance.
npx claudepluginhub fastxyz/skill-optimizer --plugin skill-optimizerDocker workbench and Agent Skill for running deterministic evals against agent skills.
Use this repo in two ways:
skill-optimizer skill/plugin into your agent so it can author and debug eval suites.Installation differs by agent. The canonical skill is skills/skill-optimizer/SKILL.md; every plugin manifest points at that same file.
Register this repository as a Claude Code plugin marketplace:
/plugin marketplace add fastxyz/skill-optimizer
Then install the plugin:
/plugin install skill-optimizer@skill-optimizer
Register this repository as a Codex plugin marketplace:
codex plugin marketplace add fastxyz/skill-optimizer
Then open the plugin search interface:
/plugins
Select skill-optimizer and install it.
In the Codex app, open Plugins from the sidebar, search for skill-optimizer, and install it from the Coding section.
If it is not listed, install it from Codex CLI first:
codex plugin marketplace add fastxyz/skill-optimizer
Install the skill with the open skills CLI:
npx skills add fastxyz/skill-optimizer --skill skill-optimizer -a cursor -y
Cursor can also import the skill from GitHub via Settings -> Rules -> Project Rules -> Add Rule -> Remote Rule (Github). The Cursor plugin metadata lives at .cursor-plugin/plugin.json.
Tell OpenCode:
Fetch and follow instructions from https://raw.githubusercontent.com/fastxyz/skill-optimizer/refs/heads/main/.opencode/INSTALL.md
Or add the plugin to opencode.json at user or project scope:
{
"plugin": ["skill-optimizer@git+https://github.com/fastxyz/skill-optimizer.git"]
}
Restart OpenCode. See docs/README.opencode.md for details.
Install the Gemini extension from GitHub:
gemini extensions install https://github.com/fastxyz/skill-optimizer
To update:
gemini extensions update skill-optimizer
If you only want the skill files without plugin metadata, use the open skills CLI:
npx skills add fastxyz/skill-optimizer --skill skill-optimizer -a claude-code -a opencode -a codex -a cursor -y
Requirements:
OPENROUTER_API_KEY for real model runsInstall and build:
npm install
npm run build
Only openrouter/... model refs are supported.
Run the suite against the models listed in suite.yml:
npx tsx src/cli.ts run-suite examples/workbench/pdf/suite.yml --trials 1
Run one case directly:
npx tsx src/cli.ts run-case ./case.yml --model openrouter/google/gemini-2.5-flash
CLI help:
npx tsx src/cli.ts --help
npx tsx src/cli.ts run-case --help
npx tsx src/cli.ts run-suite --help
The workbench gives an agent a skill/reference folder, an isolated /work directory, and deterministic graders. It is designed for evals where success can be verified from files, command logs, SQL, generated artifacts, or other local state.
Core concepts:
references/ is copied into /work; this is where the skill under test lives./work, not graders, hidden answers, /case, or /results.$CASE, $WORK, and $RESULTS available.answer.json, trace.jsonl, and result state under $RESULTS.Read docs/workbench.md for the full model: directory layout, Docker phases, graders, outputs, and debugging.
Tracked examples live under examples/workbench/. The PDF example includes positive PDF extraction/splitting/creation cases and a negative case that checks the agent did not read the PDF skill file for a non-PDF task. The MCP example shows a local calculator server started as a hidden Docker service and exposed through the workbench mcp command.
npx tsx src/cli.ts run-suite examples/workbench/pdf/suite.yml --trials 1
npx tsx src/cli.ts run-suite examples/workbench/mcp/suite.yml --trials 1
npm run typecheck
npm test
npm run build
npx tsx src/cli.ts --help
For Docker runner or image changes:
docker build -t skill-optimizer-workbench:local -f docker/workbench-runner.Dockerfile .
Do not commit .skill-eval/, .results/, .env, or credentials.
Analyze and optimize your Agent Skills (SKILL.md) using session data and research-backed static checks. Works with Claude Code, Codex, and any Agent Skills-compatible agent.
Share bugs, ideas, or general feedback.
Create and validate production-grade agent skills with 100-point marketplace grading
Self-evolving skill engine for Claude Code. Creates, scores, repairs, and hardens skills autonomously through recursive improvement cycles.
Professional skill and subagent creation with dual-mode workflow: 12-step fast mode and 15-step full mode with behavioral pressure testing and TDD integration.
Evidence-based agent skills compiler with progressive capability tiers (Quick/Forge/Forge+/Deep).
Evaluate Agent Skill design quality against official specifications and best practices. Use when reviewing, auditing, or improving SKILL.md files and skill packages.