npx claudepluginhub BrandCast-Signage/agent-benchmark-kitAutomated quality assurance for Claude Code agents using LLM-as-judge evaluation
Claude Code marketplace entries for the plugin-safe Antigravity Awesome Skills library and its compatible editorial bundles.
Production-ready workflow orchestration with 79 focused plugins, 184 specialized agents, and 150 skills - optimized for granular installation and minimal token usage
Browser automation for AI agents
Automated quality assurance for Claude Code agents using LLM-as-judge evaluation.
We built AI agents at BrandCast for SEO optimization, content publishing, weekly planning, as well as our technical agent fleet. They needed rigorous quality checks and continuous improvement, but manual testing was time-consuming and inconsistent.
So we built an automated benchmarking system using AI to evaluate AI.
We're still very early, but the approach shows promise. We're open-sourcing what we've built so far.
✅ Slash command - /benchmark-agent for one-command testing
✅ Test suite creator - Generate your first benchmark in < 1 hour
✅ LLM-as-judge - Automated, objective scoring
✅ Performance tracking - JSON-based history over time
✅ Test rotation - Keep agents challenged with fresh tests
✅ Complete examples - 2 production-tested benchmark suites
# 1. Install via Claude Code Marketplace
/plugin add https://github.com/BrandCast-Signage/agent-benchmark-kit
# 2. Create your first benchmark
/benchmark-agent --create my-agent
# 3. Answer 5 questions about your agent
# [Interactive prompts guide you through test creation]
# 4. Run the benchmark
/benchmark-agent my-agent
# 5. View results and iterate
# Results show score breakdown and recommendations
We use this framework internally at BrandCast for 7 production agents:
| Agent | Baseline | Current | Improvement |
|---|---|---|---|
| SEO Specialist | 88/100 | 90/100 | +2.3% in 8 days |
| Content Publisher | 97.5/100 | 97.5/100 | Excellent baseline |
| Weekly Planner | 85/100 | 87/100 | Tracked over 12 weeks |
These aren't toy examples. These are production agents serving real users.
graph TD
A[Create Test Suite] --> B[Define Test Cases]
B --> C[Set Ground Truth]
C --> D[Run Benchmarks]
D --> E[Judge Scores Results]
E --> F[Track Performance]
F --> G[Iterate & Improve]
G --> D
Define inputs that test your agent's capabilities. The test-suite-creator agent helps you design 5 diverse, challenging tests.
Define expected outputs in JSON format. What should the agent detect? What decisions should it make?
Execute tests via the /benchmark-agent command. Your agent processes each test case.
The benchmark-judge agent compares actual output to ground truth, scoring objectively (0-100).
Results stored in performance-history.json. See trends over time, detect regressions.
Use data to guide prompt improvements. Re-run to validate changes.
Problem: Creating test cases manually is hard and time-consuming.
Solution: Answer 5 questions about your agent, get a complete benchmark suite.
/benchmark-agent --create my-agent
# Questions you'll answer:
# 1. What does your agent do?
# 2. What validations does it perform?
# 3. What are common edge cases?
# 4. What would perfect output look like?
# 5. What would failing output look like?
# Generates:
# ✓ 5 diverse test cases
# ✓ Ground truth expectations (JSON)
# ✓ Scoring rubric (METRICS.md)
# ✓ Complete documentation
Time to first benchmark: < 1 hour
Consistent, objective scoring using AI to evaluate AI output.
The benchmark-judge agent:
Agreement rate with manual scoring: 95%+
Track improvements over time with JSON-based history.
{
"seo-specialist": {
"baseline": { "version": "v1", "score": 88 },
"current": { "version": "v2", "score": 90 },
"trend": "improving",
"runs": [...]
}
}
See at a glance:
Keep benchmarks challenging with automated test rotation.
When agent scores 95+ on all tests:
When agent scores 100 three times: