Search everything...

Marketplace

agent-benchmark-kit

npx claudepluginhub BrandCast-Signage/agent-benchmark-kit

README

View full README on GitHub

1 Plugin

agent-benchmark-kit

Automated quality assurance for Claude Code agents using LLM-as-judge evaluation

5mo

v1.0.0

Related Marketplaces

antigravity-awesome-skills

35.9K

0plugins

Claude Code marketplace entries for the plugin-safe Antigravity Awesome Skills library and its compatible editorial bundles.

claude-code-workflows

34.5K

0plugins

Production-ready workflow orchestration with 79 focused plugins, 184 specialized agents, and 150 skills - optimized for granular installation and minimal token usage

agent-browser

31.0K

0plugins

Browser automation for AI agents

Stats

Plugins1

Stars2

UpdatedJan 1, 2026

Links

View on GitHub View Marketplace JSON

agent-benchmark-kit

npx claudepluginhub BrandCast-Signage/agent-benchmark-kit

README

Agent Benchmark Kit

Automated quality assurance for Claude Code agents using LLM-as-judge evaluation.

Why This Exists

We built AI agents at BrandCast for SEO optimization, content publishing, weekly planning, as well as our technical agent fleet. They needed rigorous quality checks and continuous improvement, but manual testing was time-consuming and inconsistent.

So we built an automated benchmarking system using AI to evaluate AI.

We're still very early, but the approach shows promise. We're open-sourcing what we've built so far.

What You Get

✅ Slash command - /benchmark-agent for one-command testing

✅ Test suite creator - Generate your first benchmark in < 1 hour

✅ LLM-as-judge - Automated, objective scoring

✅ Performance tracking - JSON-based history over time

✅ Test rotation - Keep agents challenged with fresh tests

✅ Complete examples - 2 production-tested benchmark suites

Quick Start

# 1. Install via Claude Code Marketplace
/plugin add https://github.com/BrandCast-Signage/agent-benchmark-kit

# 2. Create your first benchmark
/benchmark-agent --create my-agent

# 3. Answer 5 questions about your agent
# [Interactive prompts guide you through test creation]

# 4. Run the benchmark
/benchmark-agent my-agent

# 5. View results and iterate
# Results show score breakdown and recommendations

Real-World Results

We use this framework internally at BrandCast for 7 production agents:

Agent	Baseline	Current	Improvement
SEO Specialist	88/100	90/100	+2.3% in 8 days
Content Publisher	97.5/100	97.5/100	Excellent baseline
Weekly Planner	85/100	87/100	Tracked over 12 weeks

These aren't toy examples. These are production agents serving real users.

How It Works

graph TD
    A[Create Test Suite] --> B[Define Test Cases]
    B --> C[Set Ground Truth]
    C --> D[Run Benchmarks]
    D --> E[Judge Scores Results]
    E --> F[Track Performance]
    F --> G[Iterate & Improve]
    G --> D

1. Create Test Cases

Define inputs that test your agent's capabilities. The test-suite-creator agent helps you design 5 diverse, challenging tests.

2. Set Ground Truth

Define expected outputs in JSON format. What should the agent detect? What decisions should it make?

3. Run Benchmarks

Execute tests via the /benchmark-agent command. Your agent processes each test case.

4. Judge Scores Results

The benchmark-judge agent compares actual output to ground truth, scoring objectively (0-100).

5. Track Performance

Results stored in performance-history.json. See trends over time, detect regressions.

6. Iterate & Improve

Use data to guide prompt improvements. Re-run to validate changes.

Key Features

🎯 Interactive Test Suite Creator

Problem: Creating test cases manually is hard and time-consuming.

Solution: Answer 5 questions about your agent, get a complete benchmark suite.

/benchmark-agent --create my-agent

# Questions you'll answer:
# 1. What does your agent do?
# 2. What validations does it perform?
# 3. What are common edge cases?
# 4. What would perfect output look like?
# 5. What would failing output look like?

# Generates:
# ✓ 5 diverse test cases
# ✓ Ground truth expectations (JSON)
# ✓ Scoring rubric (METRICS.md)
# ✓ Complete documentation

Time to first benchmark: < 1 hour

📊 LLM-as-Judge Evaluation

Consistent, objective scoring using AI to evaluate AI output.

The benchmark-judge agent:

Compares actual output to expected results
Scores using your custom rubric (0-100 scale)
Identifies false positives and missed issues
Provides detailed feedback

Agreement rate with manual scoring: 95%+

📈 Performance Tracking

Track improvements over time with JSON-based history.

{
  "seo-specialist": {
    "baseline": { "version": "v1", "score": 88 },
    "current": { "version": "v2", "score": 90 },
    "trend": "improving",
    "runs": [...]
  }
}

See at a glance:

Current score vs. baseline
Trend (improving/stable/regressing)
Individual test performance
Prompt changes and their impact

🔄 Intelligent Test Rotation

Keep benchmarks challenging with automated test rotation.

When agent scores 95+ on all tests:

Add new challenging test cases
Keep agent from "gaming" the tests

When agent scores 100 three times:

Retire test (agent has mastered it)
Focus effort on remaining challenges

View full README on GitHub

1 Plugin

agent-benchmark-kit

Automated quality assurance for Claude Code agents using LLM-as-judge evaluation

5mo

v1.0.0

Related Marketplaces

antigravity-awesome-skills

35.9K

0plugins

Claude Code marketplace entries for the plugin-safe Antigravity Awesome Skills library and its compatible editorial bundles.

claude-code-workflows

34.5K

0plugins

Production-ready workflow orchestration with 79 focused plugins, 184 specialized agents, and 150 skills - optimized for granular installation and minimal token usage

agent-browser

31.0K

0plugins

Browser automation for AI agents

Stats

Plugins1

Stars2

UpdatedJan 1, 2026

Links

View on GitHub View Marketplace JSON

Agent Benchmark Kit

Automated quality assurance for Claude Code agents using LLM-as-judge evaluation.

Why This Exists

So we built an automated benchmarking system using AI to evaluate AI.

We're still very early, but the approach shows promise. We're open-sourcing what we've built so far.

What You Get

✅ Slash command - /benchmark-agent for one-command testing

✅ Test suite creator - Generate your first benchmark in < 1 hour

✅ LLM-as-judge - Automated, objective scoring

✅ Performance tracking - JSON-based history over time

✅ Test rotation - Keep agents challenged with fresh tests

✅ Complete examples - 2 production-tested benchmark suites

Quick Start

# 1. Install via Claude Code Marketplace
/plugin add https://github.com/BrandCast-Signage/agent-benchmark-kit

# 2. Create your first benchmark
/benchmark-agent --create my-agent

# 3. Answer 5 questions about your agent
# [Interactive prompts guide you through test creation]

# 4. Run the benchmark
/benchmark-agent my-agent

# 5. View results and iterate
# Results show score breakdown and recommendations

Real-World Results

We use this framework internally at BrandCast for 7 production agents:

Agent	Baseline	Current	Improvement
SEO Specialist	88/100	90/100	+2.3% in 8 days
Content Publisher	97.5/100	97.5/100	Excellent baseline
Weekly Planner	85/100	87/100	Tracked over 12 weeks

These aren't toy examples. These are production agents serving real users.

How It Works

graph TD
    A[Create Test Suite] --> B[Define Test Cases]
    B --> C[Set Ground Truth]
    C --> D[Run Benchmarks]
    D --> E[Judge Scores Results]
    E --> F[Track Performance]
    F --> G[Iterate & Improve]
    G --> D

1. Create Test Cases

Define inputs that test your agent's capabilities. The test-suite-creator agent helps you design 5 diverse, challenging tests.

2. Set Ground Truth

Define expected outputs in JSON format. What should the agent detect? What decisions should it make?

3. Run Benchmarks

Execute tests via the /benchmark-agent command. Your agent processes each test case.

4. Judge Scores Results

The benchmark-judge agent compares actual output to ground truth, scoring objectively (0-100).

5. Track Performance

Results stored in performance-history.json. See trends over time, detect regressions.

6. Iterate & Improve

Use data to guide prompt improvements. Re-run to validate changes.

Key Features

🎯 Interactive Test Suite Creator

Problem: Creating test cases manually is hard and time-consuming.

Solution: Answer 5 questions about your agent, get a complete benchmark suite.

/benchmark-agent --create my-agent

# Questions you'll answer:
# 1. What does your agent do?
# 2. What validations does it perform?
# 3. What are common edge cases?
# 4. What would perfect output look like?
# 5. What would failing output look like?

# Generates:
# ✓ 5 diverse test cases
# ✓ Ground truth expectations (JSON)
# ✓ Scoring rubric (METRICS.md)
# ✓ Complete documentation

Time to first benchmark: < 1 hour

📊 LLM-as-Judge Evaluation

Consistent, objective scoring using AI to evaluate AI output.

The benchmark-judge agent:

Compares actual output to expected results
Scores using your custom rubric (0-100 scale)
Identifies false positives and missed issues
Provides detailed feedback

Agreement rate with manual scoring: 95%+

📈 Performance Tracking

Track improvements over time with JSON-based history.

{
  "seo-specialist": {
    "baseline": { "version": "v1", "score": 88 },
    "current": { "version": "v2", "score": 90 },
    "trend": "improving",
    "runs": [...]
  }
}

See at a glance:

Current score vs. baseline
Trend (improving/stable/regressing)
Individual test performance
Prompt changes and their impact

🔄 Intelligent Test Rotation

Keep benchmarks challenging with automated test rotation.

When agent scores 95+ on all tests:

Add new challenging test cases
Keep agent from "gaming" the tests

When agent scores 100 three times:

Retire test (agent has mastered it)
Focus effort on remaining challenges