Use when optimizing or simplifying a multi-agent harness - guides systematic removal of components as models improve, evaluator calibration, and prompt engineering for grading criteria
From cc-harnessnpx claudepluginhub jinsong-zhou/cc-harness --plugin cc-harnessThis skill uses the workspace's default tool permissions.
references/audit-template.mdSearches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Translates PRD intent, roadmaps, or product discussions into implementation-ready capability plans that expose constraints, invariants, interfaces, and unresolved decisions for multi-service work.
Every component in a harness encodes an assumption about what the model can't do on its own. This skill guides systematic tuning to match current model capabilities.
"The space of interesting harness combinations doesn't shrink as models improve. Instead, it moves, and the interesting work for AI engineers is to keep finding the next novel combination."
The evaluator is the most critical and most fragile component.
"Out of the box, Claude is a poor QA agent. In early runs, I watched it identify legitimate issues, then talk itself into deciding they weren't a big deal and approve the work anyway."
| Symptom | Cause | Fix |
|---|---|---|
| Passes everything | Default LLM generosity | Add "never give 5 unless genuinely impressed" |
| Identifies bugs then approves | Conflict avoidance | Add "if ANY critical issue exists, assessment is FAIL" |
| Scores inflate over iterations | Anchoring to previous scores | Reset score context each evaluation |
| Misses functional bugs | Only reads code | REQUIRE browser/API interaction before scoring |
| Too harsh on style | Over-indexing on craft | Reweight: functionality > originality > craft |
| Generic praise before issues | Politeness pattern | Add "lead with issues, not praise" |
Provide worked examples matching your quality bar. See references/audit-template.md for the format.
Calibrated FAIL example:
Product Depth: 2 — Form renders but validation is display-only, invalid submissions go through
Functionality: 1 — Core feature is broken
Visual Design: 3 — Looks acceptable but generic
Code Quality: 2 — No input sanitization
Overall: FAIL — Cannot ship a login that doesn't validate
Calibrated PASS example:
Product Depth: 4 — Full CRUD workflow completes end-to-end including edge cases
Functionality: 4 — All happy paths work, error handling covers expected failures
Visual Design: 3 — Clean but unremarkable — consistent spacing and colors
Code Quality: 4 — Well-structured, appropriate error handling, no security issues
Overall: PASS — Solid implementation, ready for next feature
The wording of grading criteria directly shapes generator behavior, independent of evaluator feedback.
| Criterion Wording | Effect on Output |
|---|---|
| "Museum quality" | Pushes toward polished but visually convergent designs |
| "Penalize AI-generated patterns" | Encourages aesthetic risk-taking |
| "A human designer should recognize deliberate creative choices" | Drives away from template defaults |
| "Usability independent of aesthetics" | Keeps functionality grounded during experimental design |
| "Purple gradients over white cards fail here" | Specifically steers away from common AI visual clichés |
When a new model arrives, systematically test each component:
For each:
Use the audit template in references/audit-template.md to record findings.
Real progression showing how harness simplified as models improved:
| Approach | Duration | Cost | Quality |
|---|---|---|---|
| Solo agent | ~20 min | ~$9 | Superficially impressive, core features broken |
| Full harness (sprints) | ~6 hr | ~$200 | Working features, polished UI, real functionality |
| Simplified harness (no sprints) | ~4 hr | ~$125 | Working features, comparable quality |
| Phase | Duration | Cost |
|---|---|---|
| Planner | 4.7 min | $0.46 |
| Build Round 1 | 2 hr 7 min | $71.08 |
| QA Round 1 | 8.8 min | $3.24 |
| Build Round 2 | 1 hr 2 min | $36.89 |
| QA Round 2 | 6.8 min | $3.09 |
| Build Round 3 | 10.9 min | $5.88 |
| QA Round 3 | 9.6 min | $4.06 |
| Total | 3 hr 50 min | $124.70 |
Most time goes to the builder. QA rounds are cheap (~3-4 min, ~$3-4 each) but catch real issues.
"Every component in a harness encodes an assumption about what the model can't do on its own, and those assumptions are worth stress testing."