Use when building complex applications autonomously - orchestrates a generator-evaluator iteration loop with planner, generator, and evaluator agents for long-running multi-hour coding sessions
From cc-harnessnpx claudepluginhub jinsong-zhou/cc-harness --plugin cc-harnessThis skill uses the workspace's default tool permissions.
references/evaluation-examples.mdreferences/sprint-contract-examples.mdSearches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
Orchestrates autonomous application development using a three-agent architecture inspired by GANs: a planner expands prompts into specs, a generator builds features iteratively, and a skeptical evaluator tests and grades the work.
Single-agent approaches hit two persistent failure modes:
Context Window Deterioration — Models lose coherence on lengthy tasks as the context fills. Earlier instructions get compressed, patterns get forgotten, naming becomes inconsistent.
Self-Evaluation Bias — "When asked to evaluate work they've produced, agents tend to respond by confidently praising the work — even when, to a human observer, the quality is obviously mediocre." Separating creation from evaluation is a strong lever. Tuning a standalone evaluator to be skeptical is far more tractable than making a generator critical of its own work.
User Prompt (1-4 sentences)
│
▼
┌──────────────┐
│ PLANNER │──→ SPEC.md (ambitious product spec)
│ (read-only) │ with visual design language
└──────────────┘
│
▼
┌──────────────┐ harness/sprint-contract.md ┌──────────────┐
│ GENERATOR │◄─────────────────────────────────►│ EVALUATOR │
│ (builds) │ harness/sprint-result.md │ (tests) │
└──────────────┘ └──────────────┘
│ │
▼ │
git commit harness/qa-feedback.md │
▲ │
└──────────────────────────────────────────────────┘
(iterate if FAIL)
SPEC.md — an ambitious product spec with 10-20 features, including a visual design languageFor each feature in the spec:
Step 1: Contract — Generator writes harness/sprint-contract.md defining scope, success criteria, and verification plan.
Step 2: Build — Generator implements the feature using the specified stack.
Step 3: Commit — Generator commits working code to git.
Step 4: Handoff — Generator writes harness/sprint-result.md with what was built and how to test it.
Step 5: Evaluate — Evaluator interacts with the live application (Playwright MCP, curl, browser), grades against criteria, writes harness/qa-feedback.md.
Step 6: Iterate or Proceed
harness/iteration-log.md, move to next featureAfter all features:
Agents communicate through files, not conversation context:
| File | Writer | Reader | Lifecycle |
|---|---|---|---|
SPEC.md | Planner | Generator, Evaluator | Once |
harness/sprint-contract.md | Generator | Evaluator | Overwritten per feature |
harness/sprint-result.md | Generator | Evaluator | Overwritten per feature |
harness/qa-feedback.md | Evaluator | Generator | Overwritten per evaluation |
harness/iteration-log.md | Orchestrator | All | Append-only (via log-iteration script) |
harness/context-handoff.md | Any | Any | On context reset |
For worked examples, read references/sprint-contract-examples.md and references/evaluation-examples.md.
| Criterion | Weight | What It Measures |
|---|---|---|
| Product Depth | HIGH | Real functionality vs facades. Can users complete the workflow? |
| Functionality | HIGH | Does it actually work? Bugs? Edge cases? Error handling? |
| Visual Design | MEDIUM | Polished, consistent, full-viewport UI with coherent identity |
| Code Quality | LOW | Competence check — broken fundamentals only |
| Criterion | Weight | What It Measures |
|---|---|---|
| Design Quality | HIGH | Coherent whole vs collection of parts — mood and identity |
| Originality | HIGH | Custom decisions vs AI patterns / template defaults |
| Craft | MEDIUM | Typography hierarchy, spacing, color harmony, contrast |
| Functionality | MEDIUM | Usability independent of aesthetics |
The evaluator must be actively tuned. Common failure: evaluator identifies real issues then rationalizes approving the work.
Calibration loop:
For detailed calibration guidance, use the harness-tuning skill.
| Problem | Cause | Solution |
|---|---|---|
| Evaluator passes everything | Default LLM generosity | Tighten criteria — add "never give 5 unless genuinely impressed", "FAIL if any critical issue" |
| Evaluator keeps failing the same feature | Spec too ambitious for one feature, or generator ignoring feedback | Split the feature into smaller pieces, or check that the generator is reading qa-feedback.md |
| Generator ignores QA feedback | Generator context doesn't include the feedback file | Ensure generator reads harness/qa-feedback.md before starting fixes |
| Planner produces too-ambitious spec | This is by design — but if 20+ features, consider scoping down | Ask planner to target 8-12 features, or manually trim the spec |
| Quality degrades mid-session | Context deterioration | Use context-management skill — try compaction, then reset if needed |
| Agent wraps up prematurely | Context anxiety (mainly Sonnet 4.5) | Switch to context resets or use Opus 4.5+ |
| File communication breaks | Agent writes to wrong path or doesn't read handoff | Check harness/ directory exists, verify agents reference correct file paths |
| Approach | Duration | Cost | Quality |
|---|---|---|---|
| Solo agent (no harness) | ~20 min | ~$9 | Broken core features |
| Full harness (with sprints) | ~6 hours | ~$200 | Working, polished |
| Simplified harness (no sprints) | ~4 hours | ~$125 | Working, comparable |
The full harness costs ~20x more than solo but produces actually working applications.