Skill

harness-loop

Use when building complex applications autonomously - orchestrates a generator-evaluator iteration loop with planner, generator, and evaluator agents for long-running multi-hour coding sessions

npx claudepluginhub jinsong-zhou/cc-harness --plugin cc-harness

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/cc-harness:harness-loop

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Orchestrates autonomous application development using a three-agent architecture inspired by GANs: a **planner** expands prompts into specs, a **generator** builds features iteratively, and a **skeptical evaluator** tests and grades the work.

Supporting Files

references/evaluation-examples.mdreferences/sprint-contract-examples.md

SKILL.md

163 lines · ~2k tokens

Similar Skills

skill-lookup

163.2k

Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.

prompts.chat

karpathy-guidelines

149.3k

Provides behavioral guidelines to reduce common LLM coding mistakes, focusing on simplicity, surgical changes, assumption surfacing, and verifiable success criteria.

andrej-karpathy-skills

debugging-and-error-recovery

46.7k

Guides systematic root-cause debugging via triage checklist for test failures, build breaks, unexpected behavior, logs, and errors.

agent-skills

Stats

LanguageJavaScript

Stars3

MaintenanceExcellent

Last CommitMar 28, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Help us improve

Share bugs, ideas, or general feedback.

Generator-Evaluator Harness Loop

Orchestrates autonomous application development using a three-agent architecture inspired by GANs: a planner expands prompts into specs, a generator builds features iteratively, and a skeptical evaluator tests and grades the work.

When to Use

Building complete applications from brief prompts
Long-running autonomous coding sessions (1-6+ hours)
Projects requiring both frontend and backend work
When single-agent approaches produce superficially impressive but broken results
Any task where self-evaluation would be unreliable

Why Naive Approaches Fail

Single-agent approaches hit two persistent failure modes:

Context Window Deterioration — Models lose coherence on lengthy tasks as the context fills. Earlier instructions get compressed, patterns get forgotten, naming becomes inconsistent.

Self-Evaluation Bias — "When asked to evaluate work they've produced, agents tend to respond by confidently praising the work — even when, to a human observer, the quality is obviously mediocre." Separating creation from evaluation is a strong lever. Tuning a standalone evaluator to be skeptical is far more tractable than making a generator critical of its own work.

Architecture

User Prompt (1-4 sentences)
        │
        ▼
┌──────────────┐
│   PLANNER    │──→ SPEC.md (ambitious product spec)
│  (read-only) │     with visual design language
└──────────────┘
        │
        ▼
┌──────────────┐    harness/sprint-contract.md     ┌──────────────┐
│  GENERATOR   │◄─────────────────────────────────►│  EVALUATOR   │
│  (builds)    │    harness/sprint-result.md        │  (tests)     │
└──────────────┘                                    └──────────────┘
        │                                                  │
        ▼                                                  │
   git commit                harness/qa-feedback.md        │
        ▲                                                  │
        └──────────────────────────────────────────────────┘
                         (iterate if FAIL)

Workflow

Phase 1: Planning

Invoke the harness-planner agent with the user's prompt
The planner writes SPEC.md — an ambitious product spec with 10-20 features, including a visual design language
Review the spec with the user before proceeding

Phase 2: Build Loop

For each feature in the spec:

Step 1: Contract — Generator writes harness/sprint-contract.md defining scope, success criteria, and verification plan.

Step 2: Build — Generator implements the feature using the specified stack.

Step 3: Commit — Generator commits working code to git.

Step 4: Handoff — Generator writes harness/sprint-result.md with what was built and how to test it.

Step 5: Evaluate — Evaluator interacts with the live application (Playwright MCP, curl, browser), grades against criteria, writes harness/qa-feedback.md.

Step 6: Iterate or Proceed

FAIL → Generator reads feedback, fixes issues, resubmits (max 3 iterations)
PASS → Append to harness/iteration-log.md, move to next feature

Phase 3: Final Review

After all features:

Evaluator runs full end-to-end review
Generator addresses integration issues
Final commit and summary (duration, cost, iterations, scores)

File-Based Communication

Agents communicate through files, not conversation context:

File	Writer	Reader	Lifecycle
`SPEC.md`	Planner	Generator, Evaluator	Once
`harness/sprint-contract.md`	Generator	Evaluator	Overwritten per feature
`harness/sprint-result.md`	Generator	Evaluator	Overwritten per feature
`harness/qa-feedback.md`	Evaluator	Generator	Overwritten per evaluation
`harness/iteration-log.md`	Orchestrator	All	Append-only (via log-iteration script)
`harness/context-handoff.md`	Any	Any	On context reset

For worked examples, read references/sprint-contract-examples.md and references/evaluation-examples.md.

Grading Criteria

Full-Stack Applications

Criterion	Weight	What It Measures
Product Depth	HIGH	Real functionality vs facades. Can users complete the workflow?
Functionality	HIGH	Does it actually work? Bugs? Edge cases? Error handling?
Visual Design	MEDIUM	Polished, consistent, full-viewport UI with coherent identity
Code Quality	LOW	Competence check — broken fundamentals only

Frontend Design

Criterion	Weight	What It Measures
Design Quality	HIGH	Coherent whole vs collection of parts — mood and identity
Originality	HIGH	Custom decisions vs AI patterns / template defaults
Craft	MEDIUM	Typography hierarchy, spacing, color harmony, contrast
Functionality	MEDIUM	Usability independent of aesthetics

Evaluator Calibration

The evaluator must be actively tuned. Common failure: evaluator identifies real issues then rationalizes approving the work.

Calibration loop:

Run the harness on a realistic task
Read evaluator logs — look at scores and reasoning
Compare against your judgment — where does it diverge?
Update the evaluator prompt to fix divergence
Repeat (several rounds needed)

For detailed calibration guidance, use the harness-tuning skill.

Key Guidelines

5-15 iterations per design is normal for frontend work
2-3 iterations per feature is normal for full-stack features
If the evaluator consistently passes everything, it's not skeptical enough
If the generator repeatedly fails, the spec may be too ambitious for a single feature
Use git commits as checkpoints — you can always roll back
The planner should be ambitious; the evaluator should be skeptical; the generator should be resilient
Every harness component encodes an assumption about model limitations — periodically test whether each is still load-bearing

Troubleshooting

Problem	Cause	Solution
Evaluator passes everything	Default LLM generosity	Tighten criteria — add "never give 5 unless genuinely impressed", "FAIL if any critical issue"
Evaluator keeps failing the same feature	Spec too ambitious for one feature, or generator ignoring feedback	Split the feature into smaller pieces, or check that the generator is reading `qa-feedback.md`
Generator ignores QA feedback	Generator context doesn't include the feedback file	Ensure generator reads `harness/qa-feedback.md` before starting fixes
Planner produces too-ambitious spec	This is by design — but if 20+ features, consider scoping down	Ask planner to target 8-12 features, or manually trim the spec
Quality degrades mid-session	Context deterioration	Use `context-management` skill — try compaction, then reset if needed
Agent wraps up prematurely	Context anxiety (mainly Sonnet 4.5)	Switch to context resets or use Opus 4.5+
File communication breaks	Agent writes to wrong path or doesn't read handoff	Check `harness/` directory exists, verify agents reference correct file paths

Cost Reference

Approach	Duration	Cost	Quality
Solo agent (no harness)	~20 min	~$9	Broken core features
Full harness (with sprints)	~6 hours	~$200	Working, polished
Simplified harness (no sprints)	~4 hours	~$125	Working, comparable

The full harness costs ~20x more than solo but produces actually working applications.

harness-loop

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Similar Skills

Help us improve

Help us improve

Find plugins for your project

harness-loop

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Generator-Evaluator Harness Loop

When to Use

Why Naive Approaches Fail

Architecture

Workflow

Phase 1: Planning

Phase 2: Build Loop

Phase 3: Final Review

File-Based Communication

Grading Criteria

Full-Stack Applications

Frontend Design

Evaluator Calibration

Key Guidelines

Troubleshooting

Cost Reference

Similar Skills

Help us improve

Generator-Evaluator Harness Loop

When to Use

Why Naive Approaches Fail

Architecture

Workflow

Phase 1: Planning

Phase 2: Build Loop

Phase 3: Final Review

File-Based Communication

Grading Criteria

Full-Stack Applications

Frontend Design

Evaluator Calibration

Key Guidelines

Troubleshooting

Cost Reference