By whchoi98
Run quick, standard, or full multi-agent evaluations on Claude Code harness projects to score engineering quality across safety, completeness, design, and 9 other dimensions; generate bilingual Markdown reports with improvement roadmaps; compare histories for trends, deltas, and projections; produce badges.
npx claudepluginhub whchoi98/harness-eval --plugin harness-evalUser-facing slash commands for evaluation. Each `.md` file in this directory is auto-discovered by Claude Code as a `/harness-eval:<name>` command.
Compare two harness evaluations side by side
Full harness evaluation — multi-agent comprehensive review (~5-10min)
Evaluate Claude Code harness engineering quality
Quick harness evaluation — checklist-based scoring (~30s)
Standard harness evaluation — static + dynamic analysis (~2-3min)
Subagents for Full mode evaluation. Spawned in parallel by `skills/full.md` to perform qualitative analysis of target projects.
Scans project structure and collects harness artifacts for evaluation. Produces a structured project overview consumed by evaluator agents.
Evaluates harness actionability, testability, and contract-based testing. Assesses whether components are usable, tested, and have clear interfaces.
Evaluates harness architecture quality based on Anthropic's harness design patterns. Analyzes agent communication, context management, feedback loops, and evolvability.
Evaluates harness safety posture and cost efficiency. Deep analysis of tool permissions, deny lists, secret patterns, and model/tool cost optimization.
Aggregates evaluation results from all agents into a comprehensive 12-dimension report with weighted scoring, grade assignment, and prioritized improvement roadmap.
Compare harness evaluation history — shows score trends, per-tier deltas, diminishing returns detection, and next grade projection.
Full harness evaluation — multi-agent deep analysis across 12 dimensions (safety, completeness, design quality) with parallel evaluators and synthesized report. Takes 5-10 minutes. Produces a comprehensive scored report with executive summary and improvement roadmap.
Quick harness evaluation — checklist-based scoring in ~30 seconds. Runs deterministic checks against the target project and produces a score, grade, and improvement suggestions.
Standard harness evaluation — static analysis, dynamic testing, and checklist scoring in 2-3 minutes. Produces a detailed report with findings and improvement roadmap.
Complete collection of battle-tested Claude Code configs from an Anthropic hackathon winner - agents, skills, hooks, rules, and legacy command shims evolved over 10+ months of intensive daily use
Uses power tools
Uses Bash, Write, or Edit tools
Complete collection of battle-tested Claude Code configs agents, skills, hooks, rules, and legacy command shims evolved over 10+ months of intensive daily use
Complete collection of battle-tested Claude Code configs from an Anthropic hackathon winner - agents, skills, hooks, and rules evolved over 10+ months of intensive daily use
Complete collection of battle-tested Claude Code configs from an Anthropic hackathon winner - agents, skills, hooks, and rules evolved over 10+ months of intensive daily use
Efficient skill management system with progressive discovery — 410+ production-ready skills across 33+ domains
Comprehensive .NET development skills for modern C#, ASP.NET, MAUI, Blazor, Aspire, EF Core, Native AOT, testing, security, performance optimization, CI/CD, and cloud-native applications