Plugin

agent-benchmark-kit

Name: agent-benchmark-kit
Author: brandcast-signage

Automated quality assurance for Claude Code agents using LLM-as-judge evaluation

npx claudepluginhub BrandCast-Signage/agent-benchmark-kit --plugin agent-benchmark-kit

Component Overview

Commands

Agents

Component Details

Commands (1)

Usage

/benchmark-agent

Run automated benchmark tests on Claude Code agents and track performance over time

Agents (3)

Benchmark Judge Agent

/benchmark-judge

You evaluate agent performance by comparing actual output to expected results (ground truth).

Benchmark Orchestrator Agent

/benchmark-orchestrator

You coordinate the complete agent benchmarking workflow from test execution to performance tracking to reporting.

Test Suite Creator Agent

/test-suite-creator

You help users create their first benchmark suite for a Claude Code agent in **less than 1 hour**.

README

Agent Benchmark Kit

Automated quality assurance for Claude Code agents using LLM-as-judge evaluation.

Why This Exists

We built AI agents at BrandCast for SEO optimization, content publishing, weekly planning, as well as our technical agent fleet. They needed rigorous quality checks and continuous improvement, but manual testing was time-consuming and inconsistent.

So we built an automated benchmarking system using AI to evaluate AI.

We're still very early, but the approach shows promise. We're open-sourcing what we've built so far.

What You Get

✅ Slash command - /benchmark-agent for one-command testing

✅ Test suite creator - Generate your first benchmark in < 1 hour

✅ LLM-as-judge - Automated, objective scoring

✅ Performance tracking - JSON-based history over time

✅ Test rotation - Keep agents challenged with fresh tests

✅ Complete examples - 2 production-tested benchmark suites

Quick Start

# 1. Install via Claude Code Marketplace
/plugin add https://github.com/BrandCast-Signage/agent-benchmark-kit

# 2. Create your first benchmark
/benchmark-agent --create my-agent

# 3. Answer 5 questions about your agent
# [Interactive prompts guide you through test creation]

# 4. Run the benchmark
/benchmark-agent my-agent

# 5. View results and iterate
# Results show score breakdown and recommendations

Real-World Results

We use this framework internally at BrandCast for 7 production agents:

Agent	Baseline	Current	Improvement
SEO Specialist	88/100	90/100	+2.3% in 8 days
Content Publisher	97.5/100	97.5/100	Excellent baseline
Weekly Planner	85/100	87/100	Tracked over 12 weeks

These aren't toy examples. These are production agents serving real users.

How It Works

graph TD
    A[Create Test Suite] --> B[Define Test Cases]
    B --> C[Set Ground Truth]
    C --> D[Run Benchmarks]
    D --> E[Judge Scores Results]
    E --> F[Track Performance]
    F --> G[Iterate & Improve]
    G --> D

1. Create Test Cases

Define inputs that test your agent's capabilities. The test-suite-creator agent helps you design 5 diverse, challenging tests.

2. Set Ground Truth

Define expected outputs in JSON format. What should the agent detect? What decisions should it make?

3. Run Benchmarks

Execute tests via the /benchmark-agent command. Your agent processes each test case.

4. Judge Scores Results

The benchmark-judge agent compares actual output to ground truth, scoring objectively (0-100).

5. Track Performance

Results stored in performance-history.json. See trends over time, detect regressions.

6. Iterate & Improve

Use data to guide prompt improvements. Re-run to validate changes.

Key Features

🎯 Interactive Test Suite Creator

Problem: Creating test cases manually is hard and time-consuming.

Solution: Answer 5 questions about your agent, get a complete benchmark suite.

/benchmark-agent --create my-agent

# Questions you'll answer:
# 1. What does your agent do?
# 2. What validations does it perform?
# 3. What are common edge cases?
# 4. What would perfect output look like?
# 5. What would failing output look like?

# Generates:
# ✓ 5 diverse test cases
# ✓ Ground truth expectations (JSON)
# ✓ Scoring rubric (METRICS.md)
# ✓ Complete documentation

Time to first benchmark: < 1 hour

📊 LLM-as-Judge Evaluation

Consistent, objective scoring using AI to evaluate AI output.

The benchmark-judge agent:

Compares actual output to expected results
Scores using your custom rubric (0-100 scale)
Identifies false positives and missed issues
Provides detailed feedback

Agreement rate with manual scoring: 95%+

📈 Performance Tracking

Track improvements over time with JSON-based history.

{
  "seo-specialist": {
    "baseline": { "version": "v1", "score": 88 },
    "current": { "version": "v2", "score": 90 },
    "trend": "improving",
    "runs": [...]
  }
}

See at a glance:

Current score vs. baseline
Trend (improving/stable/regressing)
Individual test performance
Prompt changes and their impact

🔄 Intelligent Test Rotation

Keep benchmarks challenging with automated test rotation.

When agent scores 95+ on all tests:

Add new challenging test cases
Keep agent from "gaming" the tests

When agent scores 100 three times:

Retire test (agent has mastered it)
Focus effort on remaining challenges

View full README on GitHub

Similar Plugins

everything-claude-code

Complete collection of battle-tested Claude Code configs from an Anthropic hackathon winner - agents, skills, hooks, rules, and legacy command shims evolved over 10+ months of intensive daily use

Stats

Version1.0.0

Stars2

MaintenanceGood

LicenseMIT

AddedDec 8, 2025

Actions

View on GitHub View README Plugin Marketplace JSON Homepage

Available In

agent-benchmark-kit

Safety Signals

Caution

Uses power tools

Uses Bash, Write, or Edit tools

Agent Benchmark Kit

Automated quality assurance for Claude Code agents using LLM-as-judge evaluation.

Why This Exists

So we built an automated benchmarking system using AI to evaluate AI.

We're still very early, but the approach shows promise. We're open-sourcing what we've built so far.

What You Get

✅ Slash command - /benchmark-agent for one-command testing

✅ Test suite creator - Generate your first benchmark in < 1 hour

✅ LLM-as-judge - Automated, objective scoring

✅ Performance tracking - JSON-based history over time

✅ Test rotation - Keep agents challenged with fresh tests

✅ Complete examples - 2 production-tested benchmark suites

Quick Start

# 1. Install via Claude Code Marketplace
/plugin add https://github.com/BrandCast-Signage/agent-benchmark-kit

# 2. Create your first benchmark
/benchmark-agent --create my-agent

# 3. Answer 5 questions about your agent
# [Interactive prompts guide you through test creation]

# 4. Run the benchmark
/benchmark-agent my-agent

# 5. View results and iterate
# Results show score breakdown and recommendations

Real-World Results

We use this framework internally at BrandCast for 7 production agents:

Agent	Baseline	Current	Improvement
SEO Specialist	88/100	90/100	+2.3% in 8 days
Content Publisher	97.5/100	97.5/100	Excellent baseline
Weekly Planner	85/100	87/100	Tracked over 12 weeks

These aren't toy examples. These are production agents serving real users.

How It Works

graph TD
    A[Create Test Suite] --> B[Define Test Cases]
    B --> C[Set Ground Truth]
    C --> D[Run Benchmarks]
    D --> E[Judge Scores Results]
    E --> F[Track Performance]
    F --> G[Iterate & Improve]
    G --> D

1. Create Test Cases

Define inputs that test your agent's capabilities. The test-suite-creator agent helps you design 5 diverse, challenging tests.

2. Set Ground Truth

Define expected outputs in JSON format. What should the agent detect? What decisions should it make?

3. Run Benchmarks

Execute tests via the /benchmark-agent command. Your agent processes each test case.

4. Judge Scores Results

The benchmark-judge agent compares actual output to ground truth, scoring objectively (0-100).

5. Track Performance

Results stored in performance-history.json. See trends over time, detect regressions.

6. Iterate & Improve

Use data to guide prompt improvements. Re-run to validate changes.

Key Features

🎯 Interactive Test Suite Creator

Problem: Creating test cases manually is hard and time-consuming.

Solution: Answer 5 questions about your agent, get a complete benchmark suite.

/benchmark-agent --create my-agent

# Questions you'll answer:
# 1. What does your agent do?
# 2. What validations does it perform?
# 3. What are common edge cases?
# 4. What would perfect output look like?
# 5. What would failing output look like?

# Generates:
# ✓ 5 diverse test cases
# ✓ Ground truth expectations (JSON)
# ✓ Scoring rubric (METRICS.md)
# ✓ Complete documentation

Time to first benchmark: < 1 hour

📊 LLM-as-Judge Evaluation

Consistent, objective scoring using AI to evaluate AI output.

The benchmark-judge agent:

Compares actual output to expected results
Scores using your custom rubric (0-100 scale)
Identifies false positives and missed issues
Provides detailed feedback

Agreement rate with manual scoring: 95%+

📈 Performance Tracking

Track improvements over time with JSON-based history.

{
  "seo-specialist": {
    "baseline": { "version": "v1", "score": 88 },
    "current": { "version": "v2", "score": 90 },
    "trend": "improving",
    "runs": [...]
  }
}

See at a glance:

Current score vs. baseline
Trend (improving/stable/regressing)
Individual test performance
Prompt changes and their impact

🔄 Intelligent Test Rotation

Keep benchmarks challenging with automated test rotation.

When agent scores 95+ on all tests:

Add new challenging test cases
Keep agent from "gaming" the tests

When agent scores 100 three times:

Retire test (agent has mastered it)
Focus effort on remaining challenges

agent-benchmark-kit

Component Overview

Component Details

Commands (1)

Agents (3)

README

Agent Benchmark Kit

Why This Exists

What You Get

Quick Start

Real-World Results

How It Works

1. Create Test Cases

2. Set Ground Truth

3. Run Benchmarks

4. Judge Scores Results

5. Track Performance

6. Iterate & Improve

Key Features

🎯 Interactive Test Suite Creator

📊 LLM-as-Judge Evaluation

📈 Performance Tracking

🔄 Intelligent Test Rotation

Similar Plugins

everything-claude-code

agent-benchmark-kit

Component Overview

Component Details

Commands (1)

Agents (3)

README

Agent Benchmark Kit

Why This Exists

What You Get

Quick Start

Real-World Results

How It Works

1. Create Test Cases

2. Set Ground Truth

3. Run Benchmarks

4. Judge Scores Results

5. Track Performance

6. Iterate & Improve

Key Features

🎯 Interactive Test Suite Creator

📊 LLM-as-Judge Evaluation

📈 Performance Tracking

🔄 Intelligent Test Rotation

Similar Plugins

everything-claude-code

dotnet-skills

everything-claude-code

unity-dev-toolkit

everything-claude-code

everything-claude-code