Help us improve
Share bugs, ideas, or general feedback.
From cc-harness
Use when building complex applications autonomously - orchestrates a generator-evaluator iteration loop with planner, generator, and evaluator agents for long-running multi-hour coding sessions
npx claudepluginhub jinsong-zhou/cc-harness --plugin cc-harnessHow this skill is triggered — by the user, by Claude, or both
Slash command
/cc-harness:harness-loopThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Orchestrates autonomous application development using a three-agent architecture inspired by GANs: a **planner** expands prompts into specs, a **generator** builds features iteratively, and a **skeptical evaluator** tests and grades the work.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Provides behavioral guidelines to reduce common LLM coding mistakes, focusing on simplicity, surgical changes, assumption surfacing, and verifiable success criteria.
Guides systematic root-cause debugging via triage checklist for test failures, build breaks, unexpected behavior, logs, and errors.
Share bugs, ideas, or general feedback.
Orchestrates autonomous application development using a three-agent architecture inspired by GANs: a planner expands prompts into specs, a generator builds features iteratively, and a skeptical evaluator tests and grades the work.
Single-agent approaches hit two persistent failure modes:
Context Window Deterioration — Models lose coherence on lengthy tasks as the context fills. Earlier instructions get compressed, patterns get forgotten, naming becomes inconsistent.
Self-Evaluation Bias — "When asked to evaluate work they've produced, agents tend to respond by confidently praising the work — even when, to a human observer, the quality is obviously mediocre." Separating creation from evaluation is a strong lever. Tuning a standalone evaluator to be skeptical is far more tractable than making a generator critical of its own work.
User Prompt (1-4 sentences)
│
▼
┌──────────────┐
│ PLANNER │──→ SPEC.md (ambitious product spec)
│ (read-only) │ with visual design language
└──────────────┘
│
▼
┌──────────────┐ harness/sprint-contract.md ┌──────────────┐
│ GENERATOR │◄─────────────────────────────────►│ EVALUATOR │
│ (builds) │ harness/sprint-result.md │ (tests) │
└──────────────┘ └──────────────┘
│ │
▼ │
git commit harness/qa-feedback.md │
▲ │
└──────────────────────────────────────────────────┘
(iterate if FAIL)
SPEC.md — an ambitious product spec with 10-20 features, including a visual design languageFor each feature in the spec:
Step 1: Contract — Generator writes harness/sprint-contract.md defining scope, success criteria, and verification plan.
Step 2: Build — Generator implements the feature using the specified stack.
Step 3: Commit — Generator commits working code to git.
Step 4: Handoff — Generator writes harness/sprint-result.md with what was built and how to test it.
Step 5: Evaluate — Evaluator interacts with the live application (Playwright MCP, curl, browser), grades against criteria, writes harness/qa-feedback.md.
Step 6: Iterate or Proceed
harness/iteration-log.md, move to next featureAfter all features:
Agents communicate through files, not conversation context:
| File | Writer | Reader | Lifecycle |
|---|---|---|---|
SPEC.md | Planner | Generator, Evaluator | Once |
harness/sprint-contract.md | Generator | Evaluator | Overwritten per feature |
harness/sprint-result.md | Generator | Evaluator | Overwritten per feature |
harness/qa-feedback.md | Evaluator | Generator | Overwritten per evaluation |
harness/iteration-log.md | Orchestrator | All | Append-only (via log-iteration script) |
harness/context-handoff.md | Any | Any | On context reset |
For worked examples, read references/sprint-contract-examples.md and references/evaluation-examples.md.
| Criterion | Weight | What It Measures |
|---|---|---|
| Product Depth | HIGH | Real functionality vs facades. Can users complete the workflow? |
| Functionality | HIGH | Does it actually work? Bugs? Edge cases? Error handling? |
| Visual Design | MEDIUM | Polished, consistent, full-viewport UI with coherent identity |
| Code Quality | LOW | Competence check — broken fundamentals only |
| Criterion | Weight | What It Measures |
|---|---|---|
| Design Quality | HIGH | Coherent whole vs collection of parts — mood and identity |
| Originality | HIGH | Custom decisions vs AI patterns / template defaults |
| Craft | MEDIUM | Typography hierarchy, spacing, color harmony, contrast |
| Functionality | MEDIUM | Usability independent of aesthetics |
The evaluator must be actively tuned. Common failure: evaluator identifies real issues then rationalizes approving the work.
Calibration loop:
For detailed calibration guidance, use the harness-tuning skill.
| Problem | Cause | Solution |
|---|---|---|
| Evaluator passes everything | Default LLM generosity | Tighten criteria — add "never give 5 unless genuinely impressed", "FAIL if any critical issue" |
| Evaluator keeps failing the same feature | Spec too ambitious for one feature, or generator ignoring feedback | Split the feature into smaller pieces, or check that the generator is reading qa-feedback.md |
| Generator ignores QA feedback | Generator context doesn't include the feedback file | Ensure generator reads harness/qa-feedback.md before starting fixes |
| Planner produces too-ambitious spec | This is by design — but if 20+ features, consider scoping down | Ask planner to target 8-12 features, or manually trim the spec |
| Quality degrades mid-session | Context deterioration | Use context-management skill — try compaction, then reset if needed |
| Agent wraps up prematurely | Context anxiety (mainly Sonnet 4.5) | Switch to context resets or use Opus 4.5+ |
| File communication breaks | Agent writes to wrong path or doesn't read handoff | Check harness/ directory exists, verify agents reference correct file paths |
| Approach | Duration | Cost | Quality |
|---|---|---|---|
| Solo agent (no harness) | ~20 min | ~$9 | Broken core features |
| Full harness (with sprints) | ~6 hours | ~$200 | Working, polished |
| Simplified harness (no sprints) | ~4 hours | ~$125 | Working, comparable |
The full harness costs ~20x more than solo but produces actually working applications.