Help us improve
Share bugs, ideas, or general feedback.
From harness-eval
Compares harness evaluation history: shows score trends, per-tier deltas, diminishing returns detection, grade projections, bilingual reports, and ASCII charts. Useful after 2+ evaluations.
npx claudepluginhub whchoi98/harness-eval --plugin harness-evalHow this skill is triggered — by the user, by Claude, or both
Slash command
/harness-eval:compareThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are performing a harness evaluation comparison. This analyzes evaluation history to show trends and improvements.
Runs quick checklist-based harness evaluation on project: scores 1-10, grades, suggests improvements, generates bilingual EN/KO Markdown reports saved to .harness-eval/reports.
Views evaluation results and benchmark reports for Claude Code skills and plugins. Reviews past evals, compares benchmark runs, and tracks quality trends via tables.
Audits Claude Code harness maturity using 6-axis 24-item checklist and 2x3 matrix (Static/Behavioral/Growth × User/Project), running 4 sub-agents for skill portfolio, sessions, context, and automation. Outputs scorecards, action reports, HTML/MD files.
Share bugs, ideas, or general feedback.
You are performing a harness evaluation comparison. This analyzes evaluation history to show trends and improvements.
Get evaluation history: Run:
HARNESS_EVAL_ROOT="${CLAUDE_PLUGIN_ROOT}" bash "${CLAUDE_PLUGIN_ROOT}/scripts/history.sh" "$(pwd)" list
This returns a JSON array of past evaluations.
Check minimum history: If fewer than 2 evaluations exist, inform the user:
"Not enough evaluation history to compare. Run /harness-eval quick or /harness-eval standard at least twice to enable comparison."
Get comparison data: Run:
HARNESS_EVAL_ROOT="${CLAUDE_PLUGIN_ROOT}" bash "${CLAUDE_PLUGIN_ROOT}/scripts/history.sh" "$(pwd)" compare
This returns current vs previous delta.
Present bilingual comparison report (English first, then ---, then Korean):
# Harness Evaluation Comparison
## Current vs Previous
| Metric | Previous | Current | Delta |
|--------|----------|---------|-------|
| Score | {prev_score}/10 | {curr_score}/10 | {delta} |
| Grade | {prev_grade} | {curr_grade} | {changed?} |
## Per-Tier Changes
| Tier | Previous | Current | Delta |
|------|----------|---------|-------|
| Basic | X/Y | X/Y | ↑/↓/→ |
| Functional | X/Y | X/Y | ↑/↓/→ |
| Robust | X/Y | X/Y | ↑/↓/→ |
| Production | X/Y | X/Y | ↑/↓/→ |
---
# 하네스 평가 비교
## 현재 vs 이전
| 지표 | 이전 | 현재 | 변화 |
|------|------|------|------|
| 점수 | {prev_score}/10 | {curr_score}/10 | {delta} |
| 등급 | {prev_grade} | {curr_grade} | {changed?} |
## 단계별 변화
| 단계 | 이전 | 현재 | 변화 |
|------|------|------|------|
| 기본 | X/Y | X/Y | ↑/↓/→ |
| 기능적 | X/Y | X/Y | ↑/↓/→ |
| 견고 | X/Y | X/Y | ↑/↓/→ |
| 프로덕션 | X/Y | X/Y | ↑/↓/→ |
Score history chart: If 3+ evaluations exist, show an ASCII bar chart:
## Score History
eval-04-06-001 ████████░░ 7.2 B
eval-04-06-002 █████████░ 7.9 B
eval-04-06-003 █████████░ 8.5 A-
Use █ for filled, ░ for empty, 10 chars total width.
Trend analysis:
Recommendations: Based on the comparison, suggest the highest-impact actions to continue improving.
Save reports to files: Save the English and Korean comparison reports as separate files:
mkdir -p .harness-eval/reports
.harness-eval/reports/eval-{YYYY-MM-DD}-{NNN}-compare-en.md.harness-eval/reports/eval-{YYYY-MM-DD}-{NNN}-compare-ko.mdUse the Write tool to create each file. Inform the user of the saved file paths.
Be analytical and forward-looking. Focus on trajectory and momentum, not just current state.
Always produce the report in both English and Korean. English section first, then a horizontal rule (---), then the Korean section. Tables, scores, and charts are identical in both sections — only the prose text (analysis, recommendations, warnings) differs.