Help us improve
Share bugs, ideas, or general feedback.
From cc-harness
Use when optimizing or simplifying a multi-agent harness - guides systematic removal of components as models improve, evaluator calibration, and prompt engineering for grading criteria
npx claudepluginhub jinsong-zhou/cc-harness --plugin cc-harnessHow this skill is triggered — by the user, by Claude, or both
Slash command
/cc-harness:harness-tuningThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Every component in a harness encodes an assumption about what the model can't do on its own. This skill guides systematic tuning to match current model capabilities.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Provides behavioral guidelines to reduce common LLM coding mistakes, focusing on simplicity, surgical changes, assumption surfacing, and verifiable success criteria.
Guides systematic root-cause debugging via triage checklist for test failures, build breaks, unexpected behavior, logs, and errors.
Share bugs, ideas, or general feedback.
Every component in a harness encodes an assumption about what the model can't do on its own. This skill guides systematic tuning to match current model capabilities.
"The space of interesting harness combinations doesn't shrink as models improve. Instead, it moves, and the interesting work for AI engineers is to keep finding the next novel combination."
The evaluator is the most critical and most fragile component.
"Out of the box, Claude is a poor QA agent. In early runs, I watched it identify legitimate issues, then talk itself into deciding they weren't a big deal and approve the work anyway."
| Symptom | Cause | Fix |
|---|---|---|
| Passes everything | Default LLM generosity | Add "never give 5 unless genuinely impressed" |
| Identifies bugs then approves | Conflict avoidance | Add "if ANY critical issue exists, assessment is FAIL" |
| Scores inflate over iterations | Anchoring to previous scores | Reset score context each evaluation |
| Misses functional bugs | Only reads code | REQUIRE browser/API interaction before scoring |
| Too harsh on style | Over-indexing on craft | Reweight: functionality > originality > craft |
| Generic praise before issues | Politeness pattern | Add "lead with issues, not praise" |
Provide worked examples matching your quality bar. See references/audit-template.md for the format.
Calibrated FAIL example:
Product Depth: 2 — Form renders but validation is display-only, invalid submissions go through
Functionality: 1 — Core feature is broken
Visual Design: 3 — Looks acceptable but generic
Code Quality: 2 — No input sanitization
Overall: FAIL — Cannot ship a login that doesn't validate
Calibrated PASS example:
Product Depth: 4 — Full CRUD workflow completes end-to-end including edge cases
Functionality: 4 — All happy paths work, error handling covers expected failures
Visual Design: 3 — Clean but unremarkable — consistent spacing and colors
Code Quality: 4 — Well-structured, appropriate error handling, no security issues
Overall: PASS — Solid implementation, ready for next feature
The wording of grading criteria directly shapes generator behavior, independent of evaluator feedback.
| Criterion Wording | Effect on Output |
|---|---|
| "Museum quality" | Pushes toward polished but visually convergent designs |
| "Penalize AI-generated patterns" | Encourages aesthetic risk-taking |
| "A human designer should recognize deliberate creative choices" | Drives away from template defaults |
| "Usability independent of aesthetics" | Keeps functionality grounded during experimental design |
| "Purple gradients over white cards fail here" | Specifically steers away from common AI visual clichés |
When a new model arrives, systematically test each component:
For each:
Use the audit template in references/audit-template.md to record findings.
Real progression showing how harness simplified as models improved:
| Approach | Duration | Cost | Quality |
|---|---|---|---|
| Solo agent | ~20 min | ~$9 | Superficially impressive, core features broken |
| Full harness (sprints) | ~6 hr | ~$200 | Working features, polished UI, real functionality |
| Simplified harness (no sprints) | ~4 hr | ~$125 | Working features, comparable quality |
| Phase | Duration | Cost |
|---|---|---|
| Planner | 4.7 min | $0.46 |
| Build Round 1 | 2 hr 7 min | $71.08 |
| QA Round 1 | 8.8 min | $3.24 |
| Build Round 2 | 1 hr 2 min | $36.89 |
| QA Round 2 | 6.8 min | $3.09 |
| Build Round 3 | 10.9 min | $5.88 |
| QA Round 3 | 9.6 min | $4.06 |
| Total | 3 hr 50 min | $124.70 |
Most time goes to the builder. QA rounds are cheap (~3-4 min, ~$3-4 each) but catch real issues.
"Every component in a harness encodes an assumption about what the model can't do on its own, and those assumptions are worth stress testing."