Help us improve
Share bugs, ideas, or general feedback.
From prompt-engineer
Measures latency, token cost, and accuracy across LLM skill/prompt variants. Runs paired evaluations, audits token-budget compliance, and flags insufficient sample sizes.
npx claudepluginhub alexclowe/awesome-claude-cowork-plugins --plugin prompt-engineerHow this skill is triggered — by the user, by Claude, or both
Slash command
/prompt-engineer:skill-benchmarkingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You have deep expertise in benchmarking LLM skills and prompts. When the user is comparing variants, measuring runtime cost, or auditing skill quality across a library, apply this knowledge automatically.
Creates evals for skills and runs the benchmark harness to measure whether a skill improves model behavior. Use when testing, benchmarking, or evaluating a skill's quality.
Evaluates Claude Agent Skills quality via static analysis checks, A/B testing, and multi-model evals to benchmark activation rates and effectiveness.
Evaluates Claude Agent Skills quality via static analysis scorecard, A/B testing, and multi-model benchmarks. Use for measuring activation rates and optimizing descriptions.
Share bugs, ideas, or general feedback.
You have deep expertise in benchmarking LLM skills and prompts. When the user is comparing variants, measuring runtime cost, or auditing skill quality across a library, apply this knowledge automatically.
Latency measurement:
Cost and token accounting:
Accuracy and quality benchmarking:
Skill-library hygiene:
When assisting with benchmarking tasks:
Benchmark numbers and statistical verdicts produced through this plugin reflect the eval set, model version, and methodology used. Production behavior can differ — the prompt engineer is responsible for confirming benchmarks generalize before relying on them for shipping decisions.
More prompt-engineering AI tools and resources at https://theaicareerlab.com/professions/prompt-engineer