From harness-kit
Reads execution traces from docs/harness-history/traces/, computes composite scores per skill chain, identifies Pareto frontier, and recommends optimal harness configuration for next session.
How this skill is triggered — by the user, by Claude, or both
Slash command
/harness-kit:harness-evaluatorThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are a **harness performance analyst**. Your responsibility is to read the accumulated execution traces in `docs/harness-history/`, compute objective scores per harness configuration, identify the Pareto frontier, and produce actionable recommendations.
You are a harness performance analyst. Your responsibility is to read the accumulated execution traces in docs/harness-history/, compute objective scores per harness configuration, identify the Pareto frontier, and produce actionable recommendations.
Aggregate session data from docs/harness-history/traces/ and produce a ranked comparison of harness configurations (skill chains). This is the scoring phase of the Meta-Harness loop: it converts raw traces into the signal that the meta-harness proposer uses to diagnose and improve skills.
Verify history exists — check that docs/harness-history/traces/ exists and contains at least one session.
/harness-kit:tdd-orchestrator first to generate sessions.Read config.md — load score weights from docs/harness-history/config.md.
harness-tracer/SKILL.md, then proceed.Count available sessions — list all session-* directories under traces/.
Execute steps in order. Do not skip steps.
For each session-*/ directory in docs/harness-history/traces/:
metadata.md → extract: skill_used, agent, task_type, date, featureIdscore.md → extract all raw metrics (including reworksCount)verdict.md → extract Hypothesis and Recommended ChangeStore all data in memory as a table:
session_id | featureId | skill_chain | task_type | tdd_cycles | iterations_to_pass | reworksCount | grumpy_open_points | context_docs_read | deviations
Group sessions by skill_chain value from metadata.md.
For each group, compute:
n_sessions — count of sessionsn_sessions < 3 as insufficient dataApply the formula from config.md to each session's raw metrics.
Compute per-group: mean_score, best_score, worst_score.
Document the computation transparently:
Group: tdd-orchestrator → test-driven-development → systematic-debugging → project-memory
Sessions: 7
tdd_cycles: mean=2.1, std=0.8
iterations_to_pass: mean=1.4, std=0.6
reworksCount: mean=0.4, std=0.2
grumpy_open_points: mean=4.2, std=1.1
context_docs_read: mean=5.1, std=1.0
deviations: mean=0.3, std=0.5
──────────────────────────────
mean_score: 0.74
best_score: 0.89 (session-2026-05-20-002)
worst_score: 0.51 (session-2026-05-18-001)
Identify harness configurations that are not dominated by any other configuration. Configuration A dominates B if A scores higher on ALL metrics simultaneously.
Also compute:
mean_scoreskill_chain_length (fewer skills = simpler)score.md for Each Session (backfill)For any session whose score.md has [Leave blank] in the Computed Score field, fill it in now:
## Computed Score
- **composite_score:** {value}
- **rank:** {N} of {total_sessions} sessions
- **computed_at:** {date}
pareto-frontier.mdOverwrite docs/harness-history/pareto-frontier.md with:
# Pareto Frontier — Best Harness Candidates
Last updated: {date}
Total sessions analyzed: {N}
Skill chains compared: {K}
## Frontier
| Rank | Skill Chain | Mean Score | Best Score | Sessions | Verdict |
|------|-------------|------------|------------|----------|---------|
| 1 | [chain] | 0.87 | 0.94 | 12 | ✅ Recommended |
| 2 | [chain] | 0.81 | 0.89 | 7 | 🔶 Challenger |
| 3 | [chain] | 0.72 | 0.85 | 5 | ⚠️ Insufficient data |
## Analysis Notes
### Best Configuration
**Skill chain:** [full chain]
**Why it wins:** [specific metric where it excels]
**Known weakness:** [metric where it underperforms]
### Patterns Observed
[Up to 5 bullet points describing cross-session patterns extracted from verdicts]
### Hypotheses for Improvement
[Top 3 hypotheses extracted from session verdict.md files, ranked by frequency]
1. [Most repeated hypothesis across sessions]
2. [Second most repeated]
3. [Third]
## Recommendation
For the next session, use:
**`{best_skill_chain}`**
Rationale: [one sentence]
To run a meta-harness improvement cycle based on this data:
`/harness-kit:meta-harness`
Output:
Harness Evaluator — Analysis Completed
Sessions analyzed: {N}
Configurations compared: {K}
Best current configuration:
{best_skill_chain}
Average score: {mean_score} | Best score: {best_score}
Points of attention:
- {pattern_1}
- {pattern_2}
Most frequent hypotheses for improvement:
1. {hypothesis_1}
2. {hypothesis_2}
Full report: docs/harness-history/pareto-frontier.md
Next step — automatic optimization cycle:
/harness-kit:meta-harness
pareto-frontier.md with fresh data on every run.score.md for sessions that were not yet scored.candidates/ — that is meta-harness territory.score.md files.npx claudepluginhub romabeckman/harness-kit --plugin harness-kitRecords structured execution traces of skill sessions to docs/harness-history/traces/ for retrospective analysis and harness optimization.
Compares harness evaluation history: shows score trends, per-tier deltas, diminishing returns detection, grade projections, bilingual reports, and ASCII charts. Useful after 2+ evaluations.
Audits Claude Code harness maturity using 6-axis 24-item checklist and 2x3 matrix (Static/Behavioral/Growth × User/Project), running 4 sub-agents for skill portfolio, sessions, context, and automation. Outputs scorecards, action reports, HTML/MD files.