From bsky
Systematic code review with sub-agent analysis. Works on PRs, branches, files, or staged changes.
How this skill is triggered — by the user, by Claude, or both
Slash command
/bsky:code-reviewThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Perform a structured, multi-phase code review. This skill produces actionable findings
Perform a structured, multi-phase code review. This skill produces actionable findings with severity, location, and concrete fixes — not style nits or praise.
Use memory-mcp's read tool to load the code-review-patterns memory (scope: global)
before starting. It contains learned patterns from previous reviews that should inform
what you look for. If the memory doesn't exist yet, proceed without it — findings from
this review will seed it.
$ARGUMENTS determines the review scope:
| Argument | Scope |
|---|---|
| (empty) | All uncommitted changes (staged + unstaged) |
pr or pr <N> | Current branch's PR diff, or PR #N |
branch | All commits on current branch vs base |
file <path> | Single file, full review |
files <glob> | Multiple files matching pattern |
commit <ref> | Single commit diff |
--since <ref> | Incremental: only commits since <ref> |
--since)When --since <commit-sha> is appended to any scope argument (e.g., branch --since abc123),
the review operates in incremental mode for fix-round efficiency:
<ref> — this is what sub-agents analyzeThis mode cuts review time on rounds 2+ significantly by preventing re-analysis of
already-reviewed, unchanged code. The /develop skill's Phase 4.5 should use this mode
automatically when re-invoking /code-review after fixes.
Before reviewing code, build understanding. This phase is silent — no output to user.
$ARGUMENTS to a concrete set of changed files and hunks.claude/CLAUDE.md, memory-mcp project memories
(use list filtered by project scope, look for project-overview, conventions), and
any linter/formatter configsget_symbols_overview
on affected files to understand the surrounding code structure. Read symbol bodies only
when needed to understand how changed code fits into the system.find_referencing_symbols to identify all call sites. This is critical for
catching breakage that looks fine in isolation.Context budget: aim for roughly 1:1 ratio of changed code to surrounding context. More context than code means you're over-reading. Less means you're likely missing impact.
Large diff warning: research shows review effectiveness drops sharply past 400 lines of changed code. If the diff exceeds ~500 lines, tell the user upfront and suggest reviewing in logical chunks (by file group or functional area) rather than all at once.
Launch all three sub-agents in a single message with three parallel Agent tool calls,
each with run_in_background: true. This ensures true concurrent execution — launching
them sequentially wastes time and defeats the purpose of independent analysis. Each agent
gets the same diff and context but a different analytical lens. The separation ensures
independent findings — a bug one agent normalizes, another catches. Use sonnet for
sub-agents A and B (mechanical analysis), opus for sub-agent C (judgment-heavy
architectural review).
You are reviewing code changes for correctness and safety issues. Precision matters
more than count. Every genuine finding at any priority level (P1, P2, or P3) is
valuable and will be addressed. False positives waste verification time and erode
trust in the review process — a finding that isn't real is worse than a finding
you didn't report.
Review the following changes for:
**Logic errors**
- Off-by-one, wrong operator, inverted condition, missing early return
- Race conditions, TOCTOU, shared mutable state
- Error handling: swallowed errors, wrong error type, missing propagation
**Completeness gaps** (this is the #1 thing LLM reviewers miss — be thorough)
- For each branch/match arm: what cases exist in the domain that aren't covered?
- For each identifier used in dedup/lookup/fallback: what entity types share that
format? (e.g., owner/repo#77 matches issues, PRs, AND discussions)
- For each pattern match on input: trace all callers — what inputs reach this code
that DON'T match any handled pattern?
- "Inputs always look like X" is a red flag — verify by tracing actual data flow
- For each resource created (sessions, connections, handles, caches, temp files):
what cleans it up? Timeout, eviction, explicit close, Drop impl? If nothing
cleans it up, that's a finding.
**Data integrity**
- Mutations that could corrupt or lose data (wrong UPDATE scope, missing WHERE clause)
- LIKE/GLOB wildcards in user-supplied values without escaping
- Identifier collisions across entity types sharing the same format
- Upsert/dedup logic that collapses things that should remain distinct
For each finding, output EXACTLY this format:
**[P1|P2|P3] <short title>**
- File: `<path>:<line>`
- Issue: <1-2 sentence description of what's wrong>
- Impact: <what breaks, and under what conditions>
- Fix: <concrete code change or approach>
You are reviewing code changes for design and maintainability issues. Precision matters
more than count. Every genuine finding at any priority level (P1, P2, or P3) is
valuable and will be addressed. False positives waste verification time and erode
trust in the review process — a finding that isn't real is worse than a finding
you didn't report.
Review the following changes for:
**API contract issues**
- Breaking changes to public interfaces without migration
- Inconsistent naming, parameter ordering, or return types vs existing patterns
- Missing or misleading error messages that will confuse callers
**Dead code & redundancy**
- Code that the change made unreachable or unnecessary
- Duplicated logic that should be consolidated
- Imports, variables, enum variants, or parameters that are no longer used
- For `Option`-guarded features: does disabling the feature leave allocated-but-unused
fields? If so, group the feature's state into a sub-struct and wrap in `Option`.
**Testing gaps**
- Changed behavior that has no corresponding test update
- Edge cases in new code that tests don't cover
- Test assertions that don't actually verify the behavior they claim to test
**Test quality**
- Do tests exercise the library's public API, or do they duplicate internal logic?
Tests that reimplement the production code path instead of calling it prove nothing
about the real code.
- Are assertions non-vacuous? A test should fail if its assertion is removed. Tests
that compare single-element collections, assert `true`, or check trivially-true
conditions waste CI time and give false confidence.
- For edge-case tests: is the edge case actually exercised? Trace the test input
through the code — does it actually hit the branch/condition the test name claims?
**Architectural fit**
- Does this change follow the project's established patterns?
- Are abstractions at the right level? (over-engineering is as bad as under-)
- Will this change make future work harder? (coupling, hidden dependencies)
For each finding, output EXACTLY this format:
**[P1|P2|P3] <short title>**
- File: `<path>:<line>`
- Issue: <1-2 sentence description of what's wrong>
- Impact: <what breaks, and under what conditions>
- Fix: <concrete code change or approach>
You are reviewing code changes for architectural fitness and security. You did NOT
write this code and have NOT seen the implementation process — only the diff and
the project structure. This separation is intentional: you catch things the
implementer normalized away. Precision matters more than count. Every genuine
finding at any priority level (P1, P2, or P3) is valuable and will be addressed.
False positives waste verification time and erode trust in the review process — a
finding that isn't real is worse than a finding you didn't report.
Review the following changes for:
**API contracts**
- Are changes backward-compatible? If breaking: are ALL callers updated?
Use `find_referencing_symbols` to verify.
- Are error types consistent with the project's conventions?
- Would a consumer of this API be surprised by the new behavior?
**Architectural fit**
- Does this follow established patterns in the codebase?
- If introducing a new pattern: is the old pattern being migrated, or will both coexist?
- Are dependencies flowing in the right direction? (no circular deps, no upward deps)
- Is the abstraction level appropriate? (not over-engineered, not under-abstracted)
**Completeness**
- For each match/branch: what cases exist in the domain that aren't handled?
- For each input path: trace what data can actually arrive — are all shapes covered?
- Are error paths tested? Is the happy path the only path tested?
**Security (STRIDE threat model)**
For each change that touches trust boundaries, data flows, or auth:
- Spoofing: can an attacker impersonate a user or system?
- Tampering: can data be modified in transit/at rest without detection?
- Repudiation: can actions occur without accountability/logging?
- Information disclosure: can sensitive data leak via logs, errors, side channels?
- Denial of service: can an attacker exhaust resources (CPU, memory, connections)?
Specifically: stateful services that accept external connections — is there a
session/connection limit and idle timeout? Unbounded accumulation from external
triggers is a resource exhaustion vector.
- Elevation of privilege: can a user gain access beyond their authorization?
Also check for concrete injection vectors:
- SQL, command, XSS, template, LDAP, path traversal
- Hardcoded secrets, weak crypto, insufficient randomness
**Security (beyond STRIDE)**
- Credential exposure: can secrets appear in process listings (ps), logs, stdout,
error messages, stack traces, or debug output? Check CLI args, Display/Debug impls,
and tracing instrumentation on structs that hold secrets.
- Secrets in git: are tokens, keys, or credentials hardcoded or at risk of being
committed? Check for missing .gitignore entries, secrets in config files, or test
fixtures containing real credentials.
- Trust boundaries: where does external input enter the system? Is it validated before
use? Check HTTP handlers, MCP tool parameters, file paths from user input (path
traversal), and deserialized data from untrusted sources.
- Auth bypass: can any code path skip authentication or authorization? Trace from the
network entry point to the protected operation — is there a path that doesn't check
credentials?
- Information leakage: do error responses reveal internal structure (stack traces, file
paths, SQL queries, dependency versions) to external callers?
- Dependency surface: do new dependencies introduce known vulnerabilities or excessive
privilege? Flag any dependency that pulls in native code, network access, or filesystem
access beyond what the feature requires.
**Simplicity**
- Could the same result be achieved with less code?
- Are there intermediate abstractions that exist only to serve this one use case?
- Is there dead code from a previous approach that should be cleaned up?
- Would a future reader understand this without the PR description?
For each finding, output EXACTLY this format:
**[P1|P2|P3] <short title>**
- File: `<path>:<line>`
- Issue: <1-2 sentence description of what's wrong>
- Impact: <what breaks, and under what conditions>
- Fix: <concrete code change or approach>
Each sub-agent receives:
code-review-patterns memory from memory-mcp (learned patterns)Use Serena tools within sub-agents for any additional code exploration needed.
When running a subsequent round on the same scope, include a "Previously dismissed" section in each sub-agent prompt. This prevents agents from re-discovering the same false positives and wasting verification cycles. Use this format:
Previously evaluated and dismissed (do not re-flag unless you have NEW evidence
that changes the analysis):
- "<finding title>" — <reason for dismissal>
Only include findings that were verified as false positives in a prior Phase 3 — not findings that were real and fixed (those belong in the "previously fixed" context).
| Level | Meaning |
|---|---|
| P1 | Data loss, security vulnerability, crash, or silent corruption |
| P2 | Incorrect behavior, broken edge case, or test gap |
| P3 | Design issue, dead code, or maintainability concern |
After all three sub-agents return:
Present findings grouped by severity, then by file. Use this format:
## Code Review: <scope description>
### P1 — Critical (N findings)
**<title>** — `path/to/file.py:42`
<description with concrete fix>
### P2 — Important (N findings)
...
### P3 — Suggestions (N findings)
...
### Summary
- N files reviewed, M findings (X P1, Y P2, Z P3)
- Key themes: <1-2 sentence synthesis of what the findings reveal>
If there are zero findings at a severity level, omit that section entirely. If there are zero findings total, say so clearly — don't invent issues to fill space.
Findings should be captured somewhere durable, not just displayed in-session. Try each option in order and use the first that works:
Check if the reviewed scope corresponds to a PR:
$ARGUMENTS is pr or pr <N>, the PR is already known$ARGUMENTS is branch, check for an open PR on the current branch:
gh pr list --head <branch> --state open --json number,urlgh pr comment <N> --body <review>If no PR exists, check if there's a tracked issue for the work:
gh issue comment <N> --repo <repo> --body <review>If the work is in a git repo with uncommitted or unpushed changes on a feature branch:
gh pr create --title "<branch context>" --body <review>Only do this when it's straightforward — the branch exists, changes are committed, and it's clear what the PR should be. Ask the user if anything is ambiguous.
If the work is in a repo but there's no PR or existing issue:
gh issue create --title "Code review: <scope>" --body <review>If none of the above apply (e.g., reviewing files outside version control, or the user is working interactively and will act on findings immediately), displaying the review in the conversation is sufficient. The findings are still valuable even without a durable destination.
Always tell the user where the review was posted (PR URL, issue URL, or "displayed in-session").
If the review produced P1 or P2 findings that reveal a pattern (not just a one-off bug),
update the code-review-patterns memory in memory-mcp (scope: global):
This is the feedback loop — each review makes future reviews better.
Only update patterns for findings that were verified in Phase 3. Do not add speculative patterns from unconfirmed findings.
The skill respects project-level overrides. If a project's memory-mcp memories contain a
code-review-config memory (check with list filtered by project scope), read it and apply:
["dead_code"])dead_code: P3→ignore)npx claudepluginhub butterflyskies/claude-marketplace --plugin bskyCreates bite-sized, testable implementation plans from specs or requirements, with file structure and task decomposition. Activates before coding multi-step tasks.