From agents
Review code with confidence-scored evidence. Session, scoped, PR, or full audit; optional approved fix pass. Use when reviewing changes or quality. NOT for feature work or benchmarking.
npx claudepluginhub wyattowalsh/agents --plugin agentsThis skill uses the workspace's default tool permissions.
Research-driven code review. Every finding validated with evidence.
evals/agentic-verification.jsonevals/approval-gate.jsonevals/auto-fix-dry-run.jsonevals/conventional-comments.jsonevals/degraded-mode.jsonevals/dependency-context.jsonevals/diff-mode.jsonevals/empty-no-diff-menu.jsonevals/finding-quality-validation.jsonevals/full-codebase-audit.jsonevals/git-range-review.jsonevals/healthy-codebase.jsonevals/history-mode.jsonevals/learnings-loop.jsonevals/monorepo-review.jsonevals/multi-pass-review.jsonevals/post-review-fix-pass.jsonevals/pr-review.jsonevals/reasoning-chain.jsonevals/reasoning-first-output.jsonCreates isolated Git worktrees for feature branches with prioritized directory selection, gitignore safety checks, auto project setup for Node/Python/Rust/Go, and baseline verification.
Executes implementation plans in current session by dispatching fresh subagents per independent task, with two-stage reviews: spec compliance then code quality.
Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.
Research-driven code review. Every finding validated with evidence. 4-wave pipeline: Triage → Analysis → Research → Judge.
Scope: Code review and audit first. NOT for feature work, general explanation, or benchmarking. Post-review fix planning or execution is allowed only after the approval gate for selected findings.
Use these terms exactly throughout both modes:
| Term | Definition |
|---|---|
| triage | Wave 0: risk-stratify files (HIGH/MEDIUM/LOW) and determine specialist triggers before analysis |
| wave | A pipeline stage: Wave 0 (Triage), Wave 1 (Analysis), Wave 2 (Research), Wave 3 (Judge) |
| finding | A discrete code issue with severity, confidence score, evidence, and citation |
| confidence | Score 0.0-1.0 per finding; >=0.7 report, 0.3-0.7 unconfirmed, <0.3 discard (except P0/S0) |
| severity | Priority (P0-P3) and scope (S0-S3) classification of a finding's impact |
| judge | Wave 3 reconciliation: normalize, cluster, deduplicate, filter, resolve conflicts, rank findings |
| lens | A creative review perspective: Inversion, Deletion, Newcomer, Incident, Evolution, Adversary, Compliance, Dependency, Cost, Sustainability |
| blast radius | How many files, users, or systems a finding's defect could affect |
| slopsquatting | AI-hallucinated package names in dependencies — security-critical, checked first in Wave 2 |
| research validation | Core differentiator: every non-trivial finding confirmed with external evidence (Context7, WebSearch, DeepWiki, gh). Two evidence tiers: fact evidence (Grep) for reuse/simplification findings; assumption evidence (external research) for correctness, security, and non-obvious design. |
| systemic finding | A pattern appearing in 3+ files, elevated from individual findings during Judge reconciliation |
| approval gate | Mandatory pause after presenting findings — never implement fixes without user consent |
| pass | Internal teammate stage (Pass A: scan, Pass B: deep dive, Pass C: research) — distinct from pipeline waves |
| self-verification | Wave 3.5: adversarial pass on top findings to reduce false positives (references/self-verification.md) |
| convention awareness | Check for AGENTS.md/CLAUDE.md/.cursorrules — review against project's own agent instructions |
| code reuse finding | newly written code duplicating existing functionality; must cite existing implementation at file:line. |
| fact evidence | Grep result confirming X exists/doesn't exist in the codebase. Sufficient for reuse/simplification findings — no external research required. |
| assumption evidence | external research (WebSearch, Context7) confirming a pattern is harmful. Required for correctness, security, and non-obvious design findings. |
| degraded mode | Operation when research tools are unavailable — confidence ceilings applied per tool |
| review depth | honest-review always operates at maximum depth — all lenses, full research validation, full team; team composition is content-adaptive (file types determine specialist selection). |
| reasoning chain | Mandatory explanation of WHY before the finding statement. Reduces false positives. |
| citation anchor | [file:start-end] reference linking a finding to specific source lines. Mechanically verified. |
| conventional comment | Structured PR output label: praise/nitpick/suggestion/issue/todo/question/thought with (blocking)/(non-blocking) decoration. |
| dependency graph | Import/export map built during Wave 0 triage. Informs blast radius and cross-file impact. |
| learning | A stored false-positive dismissal that suppresses similar future findings. Scoped per project. |
| $ARGUMENTS | Mode |
|---|---|
| Empty + changes in session (git diff) | Session review of changed files |
| Empty + no changes | Show mode menu; require explicit audit for full codebase |
| File or directory path | Scoped review of that path |
| "audit" | Force full codebase audit |
| PR number/URL | Review PR changes (gh pr diff) |
| Git range (HEAD~3..HEAD) | Review changes in that range |
| "history" [project] | Show review history for project |
| "diff" or "delta" [project] | Compare current vs. previous review |
--format sarif (with any mode) | Output findings in SARIF v2.1 (references/sarif-output.md) |
| "learnings" [command] | Manage false-positive learnings (add/list/clear) |
--format conventional (with any mode) | Output findings in Conventional Comments format |
| "fix" / "apply" approved findings | Post-review fix pass via references/auto-fix-protocol.md |
| Unrecognized input | Ask for clarification |
If no explicit mode keyword is provided:
git diff --name-only HEAD -> Session reviewauditHEAD~3..HEAD) -> Range review--format sarif or --format conventional modifies the selected review mode; it never chooses the scope by itselfSeverity calibration by project type:
Confidence-calibrated reporting: Every finding carries a confidence score (0.0-1.0). Confidence ≥ 0.7: report. Confidence 0.3-0.7: report as "unconfirmed". Confidence < 0.3: discard (except P0/S0). Rubric: references/research-playbook.md § Confidence Scoring Rubric.
Strengths acknowledgment: Call out well-engineered patterns, clean abstractions, and thoughtful design. Minimum one strength per review scope. Strength notes are report items, not defect findings, and do not require a fix.
Positive-to-constructive ratio: Target 3:1. Avoid purely negative reports. If the ratio skews negative, re-examine whether low-severity findings are worth reporting. Exception: when 3+ P0/P1 findings are present, report all critical findings without reducing them to meet the ratio — critical issues are never suppressed for balance.
Convention-respecting stance: Review against the codebase's own standards, not an ideal standard.
Healthy codebase acknowledgment: If no P0/P1 or S0 findings: state this explicitly. A short report is a good report.
Three abstraction levels, each examining defects and unnecessary complexity:
Correctness (does it work?): Error handling, boundary conditions, security, API misuse, concurrency, resource leaks. Simplify: phantom error handling, defensive checks for impossible states, dead error paths. TOCTOU anti-pattern: existence checks before operations create race conditions — operate directly, handle the error.
Design (is it well-built?): Abstraction quality, coupling, cohesion, test quality, cognitive complexity. Simplify: dead code, 1:1 wrappers, single-use abstractions, over-engineering. Stringly-typed (raw strings where constants/enums already exist). Parameter sprawl (new params instead of restructuring). Redundant state (duplicates existing state or derivable). Copy-paste variation (near-identical blocks that should be unified).
Efficiency (is it economical?): Algorithmic complexity, N+1, data structure choice, resource usage, caching. Simplify: unnecessary serialization, redundant computation, premature optimization. Hot-path bloat (blocking work on startup or per-request paths). Missed concurrency (independent ops run sequentially). Overly broad operations (reading entire file/collection when only a subset is needed). Unbounded structures / event-listener leaks.
Context-dependent triggers (apply when relevant):
Apply at least 2 lenses per review scope. For security-sensitive code, Adversary is mandatory.
Reference: read references/review-lenses.md
Every finding must follow this order:
[file:start-end] — exact source locationNever state a finding without first explaining the reasoning. Citation anchors are mechanically verified — the referenced lines must exist and contain the described code. If verification fails, discard the finding.
THIS IS THE CORE DIFFERENTIATOR. Do not report findings based solely on LLM knowledge. For every non-trivial finding, validate with research:
Three-phase review per scope:
Two evidence tiers: Fact evidence — Grep confirms X exists/doesn't in the codebase; sufficient for reuse and simplification findings. Assumption evidence — external research confirming a pattern is harmful; required for correctness, security, and non-obvious design. When in doubt: if the finding can be fully proven or disproven by reading the codebase alone, fact evidence suffices.
Research playbook: read references/research-playbook.md
Run git diff --name-only HEAD to capture changes. Collect git diff HEAD for context.
Identify task intent from session history.
Detect convention files (AGENTS.md, CLAUDE.md, .cursorrules) — see references/triage-protocol.md.
For 6+ files: run triage per references/triage-protocol.md:
uv run python skills/honest-review/scripts/project-scanner.py [path] for project profileFor 1-5 files: lightweight triage — classify risk levels and run content type detection; do not skip full team composition. (Note: always-maximum-depth applies to team effort; triage for small reviews can still skip git history analysis steps to save time.)
| Scope | Strategy |
|---|---|
| Any scope | Content-adaptive team (see below). Always maximum depth. No inline-only mode. |
Content-adaptive team composition:
Always spawn:
Spawn when triggered by triage content detection:
[Lead: triage (Wave 0), Judge reconciliation (Wave 3), final report]
|-- Correctness Reviewer → Passes A/B/C
|-- Design Reviewer → Passes A/B/C
|-- Efficiency Reviewer → Passes A/B/C
|-- Code Reuse Reviewer → Passes A/B/C
|-- Test Quality Reviewer → Passes A/B/C
|-- [Security Specialist if triage triggers]
|-- [Observability Specialist if triage triggers]
|-- [Requirements Validator if spec auto-detected]
|-- [Data Migration Specialist if schema changes present]
|-- [Frontend Specialist if UI/component files present]
Each reviewer runs 3 internal passes (references/team-templates.md § Internal Pass Structure):
Prompt templates: read references/team-templates.md
For small-scope reviews: lead collects all findings and dispatches the validation wave. For team reviews: each teammate handles validation internally (Pass C).
Batch findings by validation type. Dispatch order:
Batch sizing: 5-8 findings per subagent (optimal). See references/research-playbook.md § Batch Optimization.
Run the 8-step Judge protocol (references/judge-protocol.md):
If 2+ findings survive, run self-verification (Wave 3.5): references/self-verification.md
Present all findings with evidence, confidence scores, and citations. After presenting findings, ask: "Which findings should I create a fix plan for? [all / select by ID / skip]" If approved: load references/auto-fix-protocol.md and start a separate post-review fix pass for selected findings only. Generate an orchestration implementation plan using Pattern E. Dispatch independent fixes in parallel; serialize same-file edits. Verify after all tasks complete (build, tests, behavior). Output format: read references/output-formats.md For SARIF output: read references/sarif-output.md
Full triage per references/triage-protocol.md:
uv run python skills/honest-review/scripts/project-scanner.py [path] for project profileFor 500+ files: prioritize HIGH-risk, recently modified, entry points, public API. State scope limits in report.
Use triage results to select team composition. Apply the same content-adaptive team as Session Review. Assign file ownership based on risk stratification — HIGH-risk files get domain reviewer + specialist coverage.
| Scope | Strategy |
|---|---|
| Any scope | Content-adaptive team (see below). Always maximum depth. No inline-only mode. |
Content-adaptive team composition:
Always spawn:
Spawn when triggered by triage content detection:
[Lead: triage (Wave 0), cross-domain analysis, Judge reconciliation (Wave 3), report]
|-- Correctness Reviewer → Passes A/B/C
|-- Design Reviewer → Passes A/B/C
|-- Efficiency Reviewer → Passes A/B/C
|-- Code Reuse Reviewer → Passes A/B/C
|-- Test Quality Reviewer → Passes A/B/C
|-- [Security Specialist if triage triggers]
|-- [Observability Specialist if triage triggers]
|-- [Requirements Validator if spec auto-detected]
|-- [Data Migration Specialist if schema changes present]
|-- [Frontend Specialist if UI/component files present]
Audit-only scoping for large codebases (500+ files): the core reviewers (Correctness, Design, Efficiency) split file ownership by risk tier; Code Reuse Reviewer is scoped to changed/HIGH-risk files plus a sampled cross-section (top 50 files by fan-in and LOC) — not the entire codebase; Test Quality Reviewer focuses on test files that cover changed/HIGH-risk code. Lead continues to run cross-domain analysis in parallel (Audit Step 3 — unchanged). The 500+ file splitting rule does NOT apply in Session Review mode (which always operates on a bounded diff).
Each teammate runs 3 internal passes (references/team-templates.md § Internal Pass Structure). Scaling: references/team-templates.md § Scaling Matrix.
While teammates review, lead spawns parallel subagents for:
Each teammate handles research validation internally (Pass C). Lead validates cross-domain findings separately. Batch optimization: references/research-playbook.md § Batch Optimization.
Collect all findings from all teammates + cross-domain analysis. Run the 8-step Judge protocol (references/judge-protocol.md). Cross-domain deduplication: findings spanning multiple domains → elevate to systemic.
Output format: read references/output-formats.md Required sections: Critical, Significant, Cross-Domain, Health Summary, Top 3 Recommendations, Statistics. All findings include evidence + citations.
After presenting findings, ask: "Which findings should I create a fix plan for? [all / select by ID / skip]" If approved: load references/auto-fix-protocol.md and start a separate post-review fix pass for selected findings only. Generate an orchestration implementation plan using Pattern E. Dispatch independent fixes in parallel; serialize same-file edits. Verify after all tasks complete (build, tests, behavior).
State is optional and scoped to review history or false-positive learnings. Persist it under ~/.{gemini|copilot|codex|claude}/honest-reviews/. Do not write state during ordinary read-only reviews unless the user asks to save history, compare runs, or manage learnings.
| Need | Command |
|---|---|
| Save review JSON | `cat findings.json |
| Show history | uv run python skills/honest-review/scripts/review-store.py list --project <slug> |
| Compare reviews | uv run python skills/honest-review/scripts/review-store.py diff --project <slug> --old previous --new latest |
| Check learnings | `cat findings.json |
| Manage learnings | `uv run python skills/honest-review/scripts/learnings-store.py <add |
If no prior review exists for history or diff, report "no stored baseline" and continue read-only.
Load ONE reference at a time. Do not preload all references into context.
| File | When to Read | ~Tokens |
|---|---|---|
| references/triage-protocol.md | During Wave 0 triage (both modes) | 1500 |
| references/checklists.md | During analysis or building teammate prompts | 2800 |
| references/research-playbook.md | When setting up research validation (Wave 2) | 2200 |
| references/judge-protocol.md | During Judge reconciliation (Wave 3) | 1200 |
| references/self-verification.md | After Judge (Wave 3.5) — adversarial false-positive reduction | 900 |
| references/auto-fix-protocol.md | When implementing fixes after approval | 800 |
| references/output-formats.md | When producing final output | 1100 |
| references/sarif-output.md | When outputting SARIF format for CI tooling | 700 |
| references/supply-chain-security.md | When reviewing dependency security | 1000 |
| references/team-templates.md | When designing teams (Mode 2 or large Mode 1) | 2200 |
| references/review-lenses.md | When applying creative review lenses | 1600 |
| references/ci-integration.md | When running in CI pipelines | 700 |
| references/conventional-comments.md | When producing PR comments or CI annotations | 400 |
| references/dependency-context.md | During Wave 0 triage for cross-file dependency analysis | 500 |
| Script | When to Run |
|---|---|
skills/honest-review/scripts/project-scanner.py | Wave 0 triage — deterministic project profiling |
skills/honest-review/scripts/finding-formatter.py | Wave 3 Judge — normalize findings to structured JSON (supports --format sarif) |
skills/honest-review/scripts/review-store.py | Save, load, list, diff review history |
skills/honest-review/scripts/sarif-uploader.py | Upload SARIF results to GitHub Code Scanning |
skills/honest-review/scripts/learnings-store.py | Manage false-positive learnings (add, check, list, clear) |
| Template | When to Render |
|---|---|
| templates/dashboard.html | After Judge reconciliation — inject findings JSON into data tag |
[file:start-end] mechanically verified against source