Agent Benchmark Kit
Automated quality assurance for Claude Code agents using LLM-as-judge evaluation.

Why This Exists
We built AI agents at BrandCast for SEO optimization, content publishing, weekly planning, as well as our technical agent fleet. They needed rigorous quality checks and continuous improvement, but manual testing was time-consuming and inconsistent.
So we built an automated benchmarking system using AI to evaluate AI.
We're still very early, but the approach shows promise. We're open-sourcing what we've built so far.
What You Get
✅ Slash command - /benchmark-agent for one-command testing
✅ Test suite creator - Generate your first benchmark in < 1 hour
✅ LLM-as-judge - Automated, objective scoring
✅ Performance tracking - JSON-based history over time
✅ Test rotation - Keep agents challenged with fresh tests
✅ Complete examples - 2 production-tested benchmark suites
Quick Start
# 1. Install via Claude Code Marketplace
/plugin add https://github.com/BrandCast-Signage/agent-benchmark-kit
# 2. Create your first benchmark
/benchmark-agent --create my-agent
# 3. Answer 5 questions about your agent
# [Interactive prompts guide you through test creation]
# 4. Run the benchmark
/benchmark-agent my-agent
# 5. View results and iterate
# Results show score breakdown and recommendations
Real-World Results
We use this framework internally at BrandCast for 7 production agents:
| Agent | Baseline | Current | Improvement |
|---|
| SEO Specialist | 88/100 | 90/100 | +2.3% in 8 days |
| Content Publisher | 97.5/100 | 97.5/100 | Excellent baseline |
| Weekly Planner | 85/100 | 87/100 | Tracked over 12 weeks |
These aren't toy examples. These are production agents serving real users.
How It Works
graph TD
A[Create Test Suite] --> B[Define Test Cases]
B --> C[Set Ground Truth]
C --> D[Run Benchmarks]
D --> E[Judge Scores Results]
E --> F[Track Performance]
F --> G[Iterate & Improve]
G --> D
1. Create Test Cases
Define inputs that test your agent's capabilities. The test-suite-creator agent helps you design 5 diverse, challenging tests.
2. Set Ground Truth
Define expected outputs in JSON format. What should the agent detect? What decisions should it make?
3. Run Benchmarks
Execute tests via the /benchmark-agent command. Your agent processes each test case.
4. Judge Scores Results
The benchmark-judge agent compares actual output to ground truth, scoring objectively (0-100).
5. Track Performance
Results stored in performance-history.json. See trends over time, detect regressions.
6. Iterate & Improve
Use data to guide prompt improvements. Re-run to validate changes.
Key Features
🎯 Interactive Test Suite Creator
Problem: Creating test cases manually is hard and time-consuming.
Solution: Answer 5 questions about your agent, get a complete benchmark suite.
/benchmark-agent --create my-agent
# Questions you'll answer:
# 1. What does your agent do?
# 2. What validations does it perform?
# 3. What are common edge cases?
# 4. What would perfect output look like?
# 5. What would failing output look like?
# Generates:
# ✓ 5 diverse test cases
# ✓ Ground truth expectations (JSON)
# ✓ Scoring rubric (METRICS.md)
# ✓ Complete documentation
Time to first benchmark: < 1 hour
📊 LLM-as-Judge Evaluation
Consistent, objective scoring using AI to evaluate AI output.
The benchmark-judge agent:
- Compares actual output to expected results
- Scores using your custom rubric (0-100 scale)
- Identifies false positives and missed issues
- Provides detailed feedback
Agreement rate with manual scoring: 95%+
📈 Performance Tracking
Track improvements over time with JSON-based history.
{
"seo-specialist": {
"baseline": { "version": "v1", "score": 88 },
"current": { "version": "v2", "score": 90 },
"trend": "improving",
"runs": [...]
}
}
See at a glance:
- Current score vs. baseline
- Trend (improving/stable/regressing)
- Individual test performance
- Prompt changes and their impact
🔄 Intelligent Test Rotation
Keep benchmarks challenging with automated test rotation.
When agent scores 95+ on all tests:
- Add new challenging test cases
- Keep agent from "gaming" the tests
When agent scores 100 three times:
- Retire test (agent has mastered it)
- Focus effort on remaining challenges