Plugin

harness-eval

Name: harness-eval
Author: whchoi98

Run 3-tier (quick/standard/full) multi-agent evaluations of Claude Code harness projects, scoring across 12 dimensions like safety, completeness, and design quality, with history comparison, trend analysis, and bilingual report generation.

code-quality

testing

security

What's Inside

Slash Commands6

Commands Module

/CLAUDE

User-facing slash commands for evaluation. Each `.md` file in this directory is auto-discovered by Claude Code as a `/harness-eval:<name>` command.

Compare

/compare

Compare two harness evaluations side by side

Full

/full

Full harness evaluation — multi-agent comprehensive review (~5-10min)

Usage

/harness-eval

Evaluate Claude Code harness engineering quality

Quick

/quick

Quick harness evaluation — checklist-based scoring (~30s)

Agents6

Agents Module

/CLAUDE

Subagents for Full mode evaluation. Spawned in parallel by `skills/full.md` to perform qualitative analysis of target projects.

collector

/collector

Scans project structure and collects harness artifacts for evaluation. Produces a structured project overview consumed by evaluator agents.

completeness-evaluator

/completeness-evaluator

Evaluates harness actionability, testability, and contract-based testing. Assesses whether components are usable, tested, and have clear interfaces.

design-evaluator

/design-evaluator

Evaluates harness architecture quality based on Anthropic's harness design patterns. Analyzes agent communication, context management, feedback loops, and evolvability.

safety-evaluator

/safety-evaluator

Evaluates harness safety posture and cost efficiency. Deep analysis of tool permissions, deny lists, secret patterns, and model/tool cost optimization.

Skills4

compare

/compare

Compare harness evaluation history — shows score trends, per-tier deltas, diminishing returns detection, and next grade projection.

full

/full

Full harness evaluation — multi-agent deep analysis across 12 dimensions (safety, completeness, design quality) with parallel evaluators and synthesized report. Takes 5-10 minutes. Produces a comprehensive scored report with executive summary and improvement roadmap.

quick

/quick

Quick harness evaluation — checklist-based scoring in ~30 seconds. Runs deterministic checks against the target project and produces a score, grade, and improvement suggestions.

standard

/standard

Standard harness evaluation — static analysis, dynamic testing, and checklist scoring in 2-3 minutes. Produces a detailed report with findings and improvement roadmap.

Hooks1

Event Hooks

1 hook across 1 event

Stats

Version0.1.0

LanguageShell

Stars16

Forks1

MaintenanceGood

LicenseMIT

Last CommitApr 6, 2026

AddedApr 16, 2026

Actions

View on GitHub View README Plugin Marketplace JSON

Available In

harness-eval17

Safety Signals

Caution

Uses power tools

Uses Bash, Write, or Edit tools

harness-eval

A Claude Code plugin for systematic 3-tier evaluation of harness engineering quality Claude Code 하네스 엔지니어링 품질을 체계적으로 평가하는 플러그인

English

Overview

harness-eval is a Claude Code plugin that systematically evaluates the engineering quality of Claude Code harness configurations. It combines deterministic script-based quantitative checks with AI agent-powered qualitative reviews through a 3-tier evaluation system (Quick / Standard / Full).

The plugin scores projects across 6 dimensions — correctness, safety, completeness, actionability, consistency, and testability — producing structured reports with letter grades (A+ through F) and improvement roadmaps.

Features

3-Tier Evaluation System — Choose evaluation depth: Quick (< 30s checklist), Standard (static + dynamic analysis), or Full (multi-agent parallel review)
Multi-Agent Analysis — Full mode dispatches 5 specialized agents (collector, safety-evaluator, completeness-evaluator, design-evaluator, synthesizer) for comprehensive assessment
Evaluation History Tracking — Save, list, and compare evaluation results over time with trend analysis and delta reporting
Badge Generation — Automatically generate score-based badges (A+ through F) in SVG and Markdown formats
4-Level Test Fixtures — Validate scoring accuracy across minimal, functional, robust, and production maturity levels

Prerequisites

Bash 4+
jq 1.6+
Python 3.6+
Git
Claude Code CLI

Installation

Install

From the terminal (shell):

claude plugin marketplace add https://github.com/whchoi98/harness-eval
claude plugin install harness-eval@harness-eval

Or from inside a Claude Code session:

/plugin marketplace add https://github.com/whchoi98/harness-eval
/plugin install harness-eval@harness-eval

Verify

From the terminal:

claude plugin list

Or from inside a Claude Code session:

/plugin list

Update

From the terminal:

claude plugin marketplace refresh
claude plugin install harness-eval@harness-eval

Or from inside a Claude Code session:

/plugin marketplace refresh
/plugin install harness-eval@harness-eval

Uninstall

From the terminal:

claude plugin remove harness-eval
claude plugin marketplace remove harness-eval

Or from inside a Claude Code session:

/plugin remove harness-eval
/plugin marketplace remove harness-eval

From Source (For Development)

git clone https://github.com/whchoi98/harness-eval.git
cd harness-eval/plugins/harness-eval
bash scripts/setup.sh

Usage

Run evaluations from inside a Claude Code session:

/harness-eval:quick              # Checklist-based scoring (~30s)
/harness-eval:standard           # Static + dynamic analysis (~2-3min)
/harness-eval:full               # Multi-agent comprehensive review (~5-10min)
/harness-eval:compare            # Compare with previous evaluation
/harness-eval:harness-eval full  # Argument style also works

Run evaluation scripts directly:

# Score a target project
HARNESS_EVAL_ROOT=$(pwd) bash scripts/scoring.sh /path/to/target-project
# Output: {"score": 7.2, "grade": "B", "checks": [...]}

# Run static analysis
HARNESS_EVAL_ROOT=$(pwd) bash scripts/static-analysis.sh /path/to/target-project
# Output: {"summary": {"pass": 12, "warn": 1, "fail": 0, "total": 13}, ...}

# View evaluation history
HARNESS_EVAL_ROOT=$(pwd) bash scripts/history.sh list /path/to/target-project
# Output: [{"id": "eval-2026-04-06-001", "score": 7.2, ...}]

# Generate badge
bash scripts/badge.sh /path/to/target-project
# Output: Badge SVG/Markdown

Configuration

Variable	Description	Default
`HARNESS_EVAL_ROOT`	Path to the harness-eval plugin root directory	(required)
`CLAUDE_NOTIFY_WEBHOOK`	Webhook URL for evaluation completion notifications	(empty, disabled)

harness-eval

What's Inside

harness-eval

Popularity

What's Inside

Confidence

README

harness-eval

English

Overview

Features

Prerequisites

Installation

Install

Verify

Update

Uninstall

From Source (For Development)

Usage

Configuration

Project Structure

Similar Plugins

Harness Score

harness-session

quality-engineering

code-autopsy

More by whchoi98

aws-skills-for-claude-code

harness-eval

English

Overview

Features

Prerequisites

Installation

Install

Verify

Update

Uninstall

From Source (For Development)

Usage

Configuration

Project Structure

Popularity

Health & Quality

More by whchoi98

aws-skills-for-claude-code

Similar Plugins

Harness Score

harness-session

quality-engineering

code-autopsy

harness-engineering

code-quality-orchestrator