Skill

harness-audit

Scores a project's agent harness across 5 subsystems (Instructions, State, Verification, Scope, Lifecycle) and produces a prioritized improvement plan. Use when assessing long-run readiness or debugging agent failures.

Anthropic

developer-tools

Popularity

Stars

138

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/claude-code-config:harness-audit

User invocable

Model invocable

Inline context

Default effort

When to use

Trigger on phrases like: "audit my harness", "evaluate my agent setup", "score my CLAUDE.md", "is my project ready for long-run", "5-subsystem assessment", "what's missing from my project setup", "/harness-audit". Run proactively when joining an unfamiliar codebase that has agent artifacts (CLAUDE.md, .claude/, AGENTS.md) but obvious gaps. Skip for single-file scripts and pure exploration.

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Score a project's agent harness across five subsystems and tell the user which one to fix first.

Supporting Files

references/checklist-per-subsystem.mdreferences/scoring-rubric.md

SKILL.md

175 lines · ~2.2k tokens

Stats

LanguagePython

Stars138

Forks20

MaintenanceExcellent

Last CommitJul 22, 2026

Actions

View Source View Plugin View on GitHub View README

Harness Audit

Score a project's agent harness across five subsystems and tell the user which one to fix first.

Source: Five-subsystem framework adapted from Learn Harness Engineering (walkinglabs, MIT). Adapted to our concrete stack: CLAUDE.md, .claude/rules/, PROBLEMS.md, feature_list.json, init.sh, hooks, handoffs, chronicles.

What This Skill Does

Given a project directory, produces a scorecard like this:

=== Harness Audit: project-xyz ===

Instructions  4/5  ✓ CLAUDE.md present, modular rules in .claude/rules/
                   ✗ No project-level REVIEW.md for PR review guidance
State         2/5  ✓ .claude/handoffs/ exists (3 files)
                   ✗ No PROBLEMS.md - issues scattered in handoffs
                   ✗ No feature_list.json - scope state not machine-readable
Verification  3/5  ✓ Tests run, pytest configured
                   ✗ No init.sh - new sessions take 15+ min to bootstrap
                   ✗ 3-layer gate not documented in CLAUDE.md
Scope         3/5  ✓ no-pre-existing-evasion principle in CLAUDE.md
                   ✗ No WIP=1 (no feature_list.json to enforce it)
                   ✗ Definition of Done not explicit
Lifecycle     2/5  ✗ No SessionStart hook (no .claude/settings.json)
                   ✗ No Stop hook for clean-state check
                   ~ Manual cleanup convention exists but not enforced

Bottleneck: State (2/5) — lack of structured progress tracking

Top 3 improvements (in order):
1. Create PROBLEMS.md (1h)   ↗ State 2→4
   Template: claude-code-skills/templates/long-run-project/ has examples
2. Create feature_list.json + init.sh (30min)   ↗ State 2→5, Verification 3→4
   Drop-in: claude-code-skills/templates/long-run-project/
3. Add Stop hook stop-test-gate.py (15min)   ↗ Lifecycle 2→4
   Source: claude-code-skills/hooks/stop-test-gate.py

After top 3: Instructions 4 + State 5 + Verification 4 + Scope 3 + Lifecycle 4 = 20/25 (was 14/25)

The skill does not make changes. It produces the scorecard. The user decides whether to apply recommendations.

The Five Subsystems (Our Adaptation)

Subsystem	Concrete files/conventions in our stack
Instructions	`CLAUDE.md` (root + `~/.claude/`), `.claude/rules/.md` (project), `~/.claude/rules/.md` (global), optional `REVIEW.md`
State	`PROBLEMS.md`, `feature_list.json`, `.claude/handoffs/`, `.claude/chronicles/`
Verification	`init.sh`, tests configured, 3-Layer Validation Gate referenced in CLAUDE.md, Proof Loop usage
Scope	`no-pre-existing-evasion.md` rule applied, WIP=1 enforced (one `in-progress` in feature_list.json), explicit Definition of Done
Lifecycle	SessionStart hooks, Stop hooks (stop-test-gate, check-problems-md), cleanup convention

See references/checklist-per-subsystem.md for per-subsystem concrete checks. See references/scoring-rubric.md for how to interpret 1-5 scores.

How to Run an Audit

Phase 1 — Gather

Read these files in order (skip silently if missing):

CLAUDE.md in project root
AGENTS.md in project root (some projects use this name)
.claude/rules/*.md (project-level rules)
.claude/settings.json and .claude/settings.local.json (hooks config)
PROBLEMS.md in root
feature_list.json in root
init.sh in root (and Makefile / package.json scripts as fallback)
.claude/handoffs/ (count files, check INDEX.md existence)
.claude/chronicles/ (count files)
Sample test config: pytest.ini / package.json test script / Cargo.toml

Use Glob + Read. Don't grep across entire codebase — this is metadata audit, not code review.

Phase 2 — Score

For each subsystem, run the checks in references/checklist-per-subsystem.md. Each check is a binary pass/fail. Score:

5 = all checks pass + documented + consistently followed
4 = most checks pass, 1-2 gaps
3 = covers basics, missing polish
2 = weak, several checks fail
1 = missing or actively harmful

For each subsystem, list:

✓ what's present and working
✗ what's missing or broken
~ partial / unclear

Phase 3 — Identify Bottleneck

The lowest-scoring subsystem is the bottleneck. Even if other subsystems are weaker by absolute count of checks, the lowest score is the one to fix first because it limits the value of the rest.

Tie-breaker (multiple subsystems at same low score): pick the one whose improvement unlocks progress in others. State usually wins ties because feature_list.json + PROBLEMS.md unlock Verification and Scope checks.

Phase 4 — Prioritized Improvement Plan

Output exactly 3 next steps in order, each with:

Effort estimate (15min / 30min / 1h / 1d)
Subsystem(s) it improves and by how much (2→4, etc.)
Pointer to a template or example in claude-code-skills/ if available

The 3 steps must:

Address the bottleneck first
Each step independently shippable (no item depends on a later one)
Together raise the total score by at least 4 points (out of 25)

Do not give more than 3. Three is enough scope for one focused session.

Output Format

Use the visual scorecard format shown at the top of this skill. Sections:

Header: === Harness Audit: <project-name> === (one line)
Scorecard: 5 lines, one per subsystem, with score + ✓/✗ findings
Bottleneck: one line naming the subsystem and score
Top 3 improvements: numbered list with effort + impact + pointer
Projected total: optional, only if user asked for "after" state

Keep the entire output under 50 lines. The user is scanning for next steps, not reading an essay. Detail goes into the per-subsystem checklist file, not the audit output.

What This Skill Is NOT

Not a code review — does not look at source code quality
Not a security audit — does not check for vulnerabilities (use /security-review instead)
Not a test runner — does not execute init.sh or tests, just checks existence
Not a fix tool — produces recommendations only, user applies them
Not for short-term projects — if the project is <5 features or <5 sessions, the harness overhead is not yet warranted; say so and skip the audit

Honest Tradeoffs

The 5-subsystem framework is opinionated. A project can be perfectly functional with 3 of 5 strong and 2 weak (e.g., a research repo with no lifecycle needs).
Scoring is subjective at the margins. A 3 vs 4 for "covers basics" is a judgment call. Use the checklist to keep it consistent across audits, not to claim numeric precision.
The skill assumes our stack conventions. For projects using completely different tooling (e.g., AGENTS.md without .claude/), translate concepts before scoring — don't fail the project on naming.

Principle 27 (feature-tracking) — full framework explanation
Principle 01 (harness-design) — Generator-Evaluator pattern, source of "subsystems" thinking
Templates templates/long-run-project/ — drop-in files for fixing State + Verification gaps
Rule rules/long-run-harness.md — convention this audit checks against

Quick Self-Audit (for skill development)

This skill is itself a [LONG-RUN]-style artifact. To audit the audit:

Instructions: SKILL.md is this file (✓)
State: Scoring decisions are reproducible from references/scoring-rubric.md (✓)
Verification: 5 example evals in references/example-audits.md (TODO if added)
Scope: Clear "what this skill is NOT" section (✓)
Lifecycle: No hooks needed — this is a query skill, not a continuous one (N/A)

harness-audit

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

harness-audit

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Harness Audit

What This Skill Does

The Five Subsystems (Our Adaptation)

How to Run an Audit

Phase 1 — Gather

Phase 2 — Score

Phase 3 — Identify Bottleneck

Phase 4 — Prioritized Improvement Plan

Output Format

What This Skill Is NOT

Honest Tradeoffs

Related

Quick Self-Audit (for skill development)

Similar Skills

Harness Audit

What This Skill Does

The Five Subsystems (Our Adaptation)

How to Run an Audit

Phase 1 — Gather

Phase 2 — Score

Phase 3 — Identify Bottleneck

Phase 4 — Prioritized Improvement Plan

Output Format

What This Skill Is NOT

Honest Tradeoffs

Related

Quick Self-Audit (for skill development)

Similar Skills