From nexus-agents
Scores own output 0-10 across 5 task-appropriate dimensions before emitting; fixes and rescoring regressions below functional (5-6) band for complex code, docs, or designs.
npx claudepluginhub williamzujkowski/nexus-agentsThis skill is limited to using the following tools:
<!--
Builds interactive rubrics to evaluate artifacts with parallel multi-model scoring (Codex, Gemini, Claude), then iteratively improves one criterion at a time until threshold met.
Assesses code, designs, or approaches with 0-10 rating, pros/cons analysis, and actionable recommendations. Use for evaluating quality or trade-offs.
Audits tools, frameworks, systems, and codebases against industry standards with 1-10 scoring across 12 dimensions including code quality, architecture, security, and performance.
Share bugs, ideas, or general feedback.
This is the pre-emit gate. Before you hand work back to the user — code, design, docs, a spec, a PR description — silently score it 0-10 across 5 task-appropriate dimensions. If the worst sustained band is below 3, the work is a regression: fix the lowest dimension, rescore, repeat.
This is distinct from reviewing-code:
| Skill | Reviews | When |
|---|---|---|
reviewing-code | Others' code (PRs, commits) | After someone else writes code |
self-critique | Your own output | Before you emit anything |
reviewing-code is the external gate. self-critique is the internal gate that runs first. Both can apply to the same artifact at different lifecycle points.
Skip when:
consensus_vote or a humanEvery dimension uses the same 0-10 scale with these bands. Memorize them.
| Score | Band | Meaning |
|---|---|---|
| 0-4 | Broken | Doesn't satisfy the dimension. Visible problems a reader will notice. Fix before emit. |
| 5-6 | Functional | Satisfies the dimension at a baseline. No obvious failures, but unremarkable. |
| 7-8 | Strong | Above baseline. An expert reader would find 1-2 minor issues. |
| 9-10 | Exceptional | The work makes the case better than the spec required. Rare; don't grade-inflate to get here. |
Pick the dimension table that matches your task. Each table has 5 dimensions. Score each independently — do not average. The score is the worst sustained band across recent work in that dimension, not an arithmetic mean.
| Dimension | Question | What to look for |
|---|---|---|
| Correctness | Does it actually do what the spec says, including edge cases? | Null/empty/boundary inputs, error paths, off-by-one, race conditions if await is between read and write. |
| Readability | Could another engineer maintain this without explanation? | Names match domain. Control flow obvious. No deeply nested logic. Comments explain why, not what. |
| Architecture | Does it follow existing canonical patterns or introduce a new one with justification? | Module boundaries respected. No circular deps. Abstraction level appropriate. Canonical paths from CLAUDE.md. |
| Security | Are inputs validated at boundaries? Are secrets out of code/logs? | Per .rules/untrusted-input.md Tier 1-4. Zod at MCP/HTTP boundaries. No eval/innerHTML. |
| Performance | Are obvious anti-patterns absent? | N+1 queries, unbounded data, sync I/O in hot paths. Don't optimize without profiling. |
Per Open Design's original 5 dimensions:
| Dimension | Question |
|---|---|
| Philosophy consistency | Does the artifact pick one direction and stick to it through every micro-decision? |
| Visual hierarchy | Can a stranger figure out what to read first, second, third without being told? |
| Detail execution | Alignment, leading, kerning, image framing, edge-case spacing — the 90/10 stuff. |
| Functionality | Does it work for its intended use? Click targets, nav, readability at presentation distance. |
| Innovation | One unexpected move that makes a viewer lean in — or generic AI-slop median? |
| Dimension | Question |
|---|---|
| Accuracy | Does the doc match current code? Run grep against the symbols it cites. |
| Discoverability | Will a reader looking for this find it? Is it indexed in docs/README.md? Are search terms in the headings? |
| Density | Is each paragraph load-bearing, or is it padding? Could you delete 30% without losing meaning? |
| Examples | Is there at least one runnable example for the canonical use case? Does the example actually run? |
| Tone | Direct, technical, no marketing fluff (per CLAUDE.md "Documentation Style"). |
| Dimension | Question |
|---|---|
| Completeness | Are all required sections present? (Context, decision, alternatives, consequences for ADR) |
| Testability | Can the acceptance criterion be expressed as a test that passes/fails? "Works" is unfalsifiable. |
| Reversibility | Is the cost of unwinding this documented? Helps choose the right semver bump and review threshold. |
| Stakeholder fit | Does the proposal address the actual user problem, not a different problem the author preferred? |
| Scope | Is the "Not Doing" list explicit? Is the diff bounded by the stated scope? |
| Dimension | Question |
|---|---|
| Soundness | Does the work do what the brief asked? No subtle scope shift? |
| Clarity | Could a fresh reader understand the purpose in 30 seconds? |
| Coverage | Are obvious edge cases handled or explicitly out-of-scope? |
| Specificity | Is the work concrete, with names/numbers/citations? Or generic and vague? |
| Restraint | Did you build what was asked, not what was tempting? |
Read these every time. They are the difference between a critique that catches problems and grade-inflation theater.
Always cite evidence. Bad: "scored 5 because feels inconsistent." Good: "scored 5 because hero page mixes Playfair display with Inter sans on the same line." Numbers without evidence get rejected.
Don't average up. If Hierarchy is 5 because page 3 is broken, don't bump to 7 because pages 1 and 2 are fine. The score is the worst sustained band.
Don't grade-inflate. A 7 means strong, not acceptable. If every score is 7+, you're not critiquing — you're rubber-stamping. Aim for honest distribution: most production work scores 5-7 across most dimensions. A 9 should make you suspicious of yourself.
Innovation/Restraint allowed to be low. 5/10 on Innovation is fine for production deliverables that don't need to be novel. Don't punish appropriate conservatism.
One dimension can fail without the others. A doc can be 9/10 on Accuracy and 4/10 on Discoverability — say so plainly. Don't average away interesting failures.
Generate output → Score 5 dims → Worst < 3 band? → Yes → Fix lowest → Rescore
→ No → Emit
Concretely:
Three invocation paths:
This skill does not auto-fire on every output — that would be context-budget waste on trivial tasks. It fires when the work warrants it.
## Self-Critique
| Dimension | Score | Band | Evidence |
| ------------ | ----- | ---------- | ------------------------------------------------------------------------------------------------------------------------- |
| Correctness | 7/10 | Strong | The off-by-one in line 42 was caught by the test added in line 87; remaining edge cases (empty array, null user) covered. |
| Readability | 6/10 | Functional | Function names are clear, but `process()` at line 23 should be `validateAndPersist()`. |
| Architecture | 5/10 | Functional | Direct adapter call bypasses `UnifiedAdapterRegistry` per CLAUDE.md canonical paths — flagged for follow-up. |
| Security | 8/10 | Strong | Zod at boundary, no secrets in logs, parameterized queries. |
| Performance | 7/10 | Strong | No N+1 patterns; one `readFileSync` in cold-start path is acceptable. |
**Worst band**: 5/10 (Architecture). Above the 3 threshold; safe to emit. Follow-up issue filed for the canonical-path violation.
| Excuse | Counter |
|---|---|
| "The work is fine, I don't need to score it" | The skill exists because LLMs grade-inflate. The score forces evidence; the evidence is what catches the problem. |
| "I'll score it 7s across the board" | If every score is 7+, you're rubber-stamping. Honest distributions are mostly 5-7 with one or two 6s and one or two 8s. |
| "The user will tell me if it's wrong" | The user reads what you emit. Costly fixes happen post-emit. The cycle is cheap (~30 seconds); the post-emit fix is expensive (a turn at minimum). |
| "I averaged the scores so it's fine" | The worst band is the score. Page 3 broken pulls the whole work down regardless of how good pages 1-2 are. |
| "Innovation should be 9 because I tried something new" | Innovation is the riskiest dimension to inflate. Did the new thing serve the brief, or was it grafted on? |
| "I'll skip the evidence — I know what I mean" | Evidence is what makes the critique falsifiable. Without it, the next reviewer has no way to verify or extend. |