Skill

compare

Compares harness evaluation history: shows score trends, per-tier deltas, diminishing returns detection, grade projections, bilingual reports, and ASCII charts. Useful after 2+ evaluations.

Bash

Markdown

code-quality

testing

Popularity

Parent stars

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/harness-eval:compare

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are performing a harness evaluation comparison. This analyzes evaluation history to show trends and improvements.

SKILL.md

106 lines · ~956 tokens

Stats

LanguageShell

Parent stars16

Parent forks1

MaintenanceGood

Last CommitApr 6, 2026

Actions

View Source View Plugin View on GitHub View README

Steps

Get evaluation history: Run:

HARNESS_EVAL_ROOT="${CLAUDE_PLUGIN_ROOT}" bash "${CLAUDE_PLUGIN_ROOT}/scripts/history.sh" "$(pwd)" list

This returns a JSON array of past evaluations.

Check minimum history: If fewer than 2 evaluations exist, inform the user: "Not enough evaluation history to compare. Run /harness-eval quick or /harness-eval standard at least twice to enable comparison."

Get comparison data: Run:

HARNESS_EVAL_ROOT="${CLAUDE_PLUGIN_ROOT}" bash "${CLAUDE_PLUGIN_ROOT}/scripts/history.sh" "$(pwd)" compare

This returns current vs previous delta.

Present bilingual comparison report (English first, then ---, then Korean):

# Harness Evaluation Comparison

## Current vs Previous

| Metric | Previous | Current | Delta |
|--------|----------|---------|-------|
| Score | {prev_score}/10 | {curr_score}/10 | {delta} |
| Grade | {prev_grade} | {curr_grade} | {changed?} |

## Per-Tier Changes

| Tier | Previous | Current | Delta |
|------|----------|---------|-------|
| Basic | X/Y | X/Y | ↑/↓/→ |
| Functional | X/Y | X/Y | ↑/↓/→ |
| Robust | X/Y | X/Y | ↑/↓/→ |
| Production | X/Y | X/Y | ↑/↓/→ |

---

# 하네스 평가 비교

## 현재 vs 이전

| 지표 | 이전 | 현재 | 변화 |
|------|------|------|------|
| 점수 | {prev_score}/10 | {curr_score}/10 | {delta} |
| 등급 | {prev_grade} | {curr_grade} | {changed?} |

## 단계별 변화

| 단계 | 이전 | 현재 | 변화 |
|------|------|------|------|
| 기본 | X/Y | X/Y | ↑/↓/→ |
| 기능적 | X/Y | X/Y | ↑/↓/→ |
| 견고 | X/Y | X/Y | ↑/↓/→ |
| 프로덕션 | X/Y | X/Y | ↑/↓/→ |

Score history chart: If 3+ evaluations exist, show an ASCII bar chart:

## Score History

eval-04-06-001  ████████░░  7.2  B
eval-04-06-002  █████████░  7.9  B
eval-04-06-003  █████████░  8.5  A-

Use █ for filled, ░ for empty, 10 chars total width.

Trend analysis:
- Diminishing returns: If the last 3+ deltas show shrinking improvements (e.g., +0.7, +0.6, +0.5), warn: "Score improvements are shrinking — further gains will require infrastructure investments (CI/CD, integration tests, performance benchmarks)."
- Grade projection: Based on current score and trend, estimate when the next grade threshold will be reached.
- Stalled areas: Identify tiers that haven't improved across evaluations.
Recommendations: Based on the comparison, suggest the highest-impact actions to continue improving.
Save reports to files: Save the English and Korean comparison reports as separate files:
```
mkdir -p .harness-eval/reports
```
- English report: .harness-eval/reports/eval-{YYYY-MM-DD}-{NNN}-compare-en.md
- Korean report: .harness-eval/reports/eval-{YYYY-MM-DD}-{NNN}-compare-ko.md
Use the Write tool to create each file. Inform the user of the saved file paths.

Error Handling

If history.sh fails: suggest running an evaluation first
If only 1 evaluation exists: show current score and suggest running again after improvements
If compare returns an error: display it clearly

Tone

Be analytical and forward-looking. Focus on trajectory and momentum, not just current state.

Language

Always produce the report in both English and Korean. English section first, then a horizontal rule (---), then the Korean section. Tables, scores, and charts are identical in both sections — only the prose text (analysis, recommendations, warnings) differs.

compare

Popularity

Invocation

Context Preview

SKILL.md

compare

Popularity

Invocation

Context Preview

SKILL.md

Steps

Error Handling

Tone

Language

Similar Skills

Steps

Error Handling

Tone

Language

Similar Skills