Code review methodology -- checklist, severity levels, categories, false positive rules
From beenpx claudepluginhub george-popescu/bee-dev --plugin beeThis skill uses the workspace's default tool permissions.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
Follow these steps in order for every review:
Read the spec and acceptance criteria. Open spec.md and TASKS.md for the phase being reviewed. Understand what was supposed to be built, not just what exists.
Read false positives. Open .bee/false-positives.md (if it exists). Note every documented FP. Compare each potential finding against this list before including it.
Identify files to review. Read TASKS.md task entries to find all files created or modified by the phase. These are the review targets.
Scan incrementally. For each file, use Grep and Read to examine the code. Do NOT load all files into context at once -- scan one file at a time, focusing on areas relevant to the checklist categories below.
Check against the checklist. Walk through the 7-section checklist for each file. Record findings only when you have HIGH confidence.
Verify confidence before including. For each potential finding, ask: "Am I certain this is wrong?" If unsure, skip it. The finding-validator handles ambiguity.
Write REVIEW.md. Write the review report to the phase directory using the template at skills/core/templates/review-report.md.
Calibration guidance: Only report findings you have HIGH confidence in. A review with 30+ findings overwhelms the pipeline and destroys usefulness. Target 5-15 findings per phase review. Quality over quantity -- a focused review with 8 specific, actionable findings is worth more than a broad review with 25 vague ones.
Findings that represent immediate risk to security, data integrity, or application stability.
A Critical finding means: "This must be fixed before the code can ship."
Boundary with High: If an issue could lead to data loss or security breach under normal usage, it is Critical. If it causes incorrect behavior but does not compromise security or data, it is High.
Findings that affect correctness of business logic or completeness of the implementation.
A High finding means: "This is wrong or incomplete and needs attention before review passes."
Boundary with Medium: If an issue changes the correctness of output or violates a spec requirement, it is High. If it only affects readability, maintainability, or convention adherence, it is Medium.
Findings that affect code quality, maintainability, or adherence to standards.
A Medium finding means: "This should be cleaned up but does not break functionality."
Boundary with "not a finding": If an issue is purely a style preference with no impact on readability or maintainability, it is not a finding. Only flag Medium issues that have a concrete negative effect on the codebase.
| Category | What It Covers | Look For |
|---|---|---|
| Bug | Logic errors, incorrect behavior, runtime errors | Wrong conditions, off-by-one, null dereference, type mismatches, race conditions |
| Spec Gap | Feature missing or not matching spec requirements | Acceptance criteria without implementation, behavior mismatch with spec |
| Standards | Naming, structure, pattern violations | Stack skill conventions not followed, global standards violations, wrong file location |
| Dead Code | Unused imports, unreachable code, orphaned functions | import without usage, code after return, functions never called, commented-out blocks |
| Security | XSS, injection, auth, data exposure | Unsanitized input, raw SQL, missing auth middleware, exposed env vars, hardcoded secrets |
| TDD | Missing tests, test-after-implementation, untested criteria | Production files without test files, acceptance criteria without test coverage |
| Pattern | Inconsistency with existing codebase patterns | New code deviating from established conventions in the same project |
Recognize these patterns as potential false positives -- findings that look wrong but are actually correct:
Framework-specific patterns: Laravel facades, Vue reactivity transforms, Inertia shared props, Next.js server components, convention-over-configuration patterns. These look like violations but are the framework's intended usage.
Intentional design choices: Code documented in task notes or spec as deliberately chosen. If TASKS.md notes say "using X approach because Y," do not flag X as a problem.
Concern handled elsewhere: Validation in middleware instead of controller, error handling in a global handler instead of per-function, authorization in a policy instead of inline. Check the full request lifecycle before flagging missing logic.
Official boilerplate: Configuration files, migration stubs, factory patterns, and scaffolding that follows the framework's official documentation pattern. These are not "dead code" or "standards violations."
Dynamic usage: Dependency injection containers, event listeners, magic methods, reflection-based invocations. Static code reading misses runtime resolution -- check for DI bindings, event registrations, and service provider configurations before flagging "unused" code.
Matching against .bee/false-positives.md: Each entry in false-positives.md has a file path, finding description, and reason. When you encounter a potential finding, check:
If all three match, exclude the finding. If the code has changed since the FP was documented, include the finding -- the FP may no longer be valid.
When in doubt, include the finding. The finding-validator will classify it as FALSE POSITIVE or STYLISTIC if appropriate. Better to report a borderline case than miss a real bug.
Read spec.md and TASKS.md acceptance criteria first. For each criterion, verify the implementation delivers the specified behavior.
Read the stack skill from skills/stacks/{stack}/SKILL.md (where {stack} comes from config.json). Apply stack-specific conventions to every file.
skills/standards/global/SKILL.md) for cross-stack rulesskills/standards/frontend/SKILL.md) for component architecture, a11y, responsive design, and CSS methodologyskills/standards/backend/SKILL.md) for API design, database, migrations, and query optimizationUse Grep to search for usage of each export, function, and import. Do not rely on reading a single file -- verify cross-file references.
Read the testing standards skill (skills/standards/testing/SKILL.md) for project-specific test conventions. Verify tests exist and cover the acceptance criteria.
These three rules apply to ALL review agents. They address the most common classes of missed findings.
When you find a bug or deviation in one location, IMMEDIATELY scan ALL similar constructs in the codebase for the same bug class. If you find an off-by-one in one loop, check EVERY loop. If you find a missing null check in one handler, check EVERY handler. Report ALL instances, not just the first.
Example: You find that one checkbox parser misses the [FAILED] state. Scan every other checkbox/status parser in the codebase -- they likely have the same gap.
For each construct you review, systematically verify boundary conditions:
[ ], [x], [FAILED], empty)Example: A loop uses i < array.length but starts at index 1 -- the last element is processed but the first is skipped.
For each state write (STATE.md, TASKS.md, REVIEW.md, config files), trace:
Example: The command writes STATUS=EXECUTING to STATE.md, then writes TASKS.md checkboxes. If it crashes after the first write but before the second, resume sees EXECUTING but TASKS.md has no completed tasks -- does it re-execute everything or detect the gap?
Write REVIEW.md to the phase directory (same location as TASKS.md).
Use the template at skills/core/templates/review-report.md as the format reference. The template defines:
Each finding gets a sequential ID: F-001, F-002, F-003, etc.
For every finding, include all 11 fields:
file:line → file:line → file:line (problem)pending (the review command handles validation)pending (the review command handles fixing)Summary counts: After writing all findings, fill in the summary tables at the top of REVIEW.md:
Finding quality checklist (self-check before writing):
Do NOT include findings you are not confident about. Do NOT include findings that match documented false positives. Do NOT include vague findings without specific file and line references.