Search everything...

Skill

assess-findings

Evidence-based evaluation of AI-generated findings — review feedback, audit findings, spec challenges, or any AI-generated assessment that requires triage before action. Teaches agents to investigate before accepting, calibrate confidence based on evidence quality, classify findings into actionable categories, and detect assessment bias. Load this skill when evaluating findings from reviewers, auditors, challengers, or any AI-generated feedback loop.

From shared

Install

Run in your terminal

npx claudepluginhub inkeep/team-skills --plugin shared

Tool Access

This skill uses the workspace's default tool permissions.

Supporting Assets

View in Repository

references/ai-false-positive-patterns.md

Skill Content

Similar Skills

cache-components

Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.

cache-components

138.7k

claude-opus-4-5-migration

2 files

Migrates code, prompts, and API calls from Claude Sonnet 4.0/4.5 or Opus 4.1 to Opus 4.5, updating model strings on Anthropic, AWS, GCP, Azure platforms.

claude-opus-4-5-migration

83.2k

istio-traffic-management

Configures Istio traffic management with VirtualServices, DestinationRules for routing, canary/blue-green deployments, circuit breakers, load balancing, and fault injection in service meshes.

cloud-infrastructure

33.0k

Stats

Parent Repo Stars7

Parent Repo Forks1

Last CommitApr 4, 2026

Actions

View Source View Plugin View on GitHub View README

assess-findings | shared | ClaudePluginHub

Skill

assess-findings

From shared

Install

Run in your terminal

npx claudepluginhub inkeep/team-skills --plugin shared

Tool Access

This skill uses the workspace's default tool permissions.

Supporting Assets

View in Repository

references/ai-false-positive-patterns.md

Skill Content

Assess Findings

How to assess findings — whether from an AI reviewer, an auditor, a challenger, or any other source of AI-generated feedback. This protocol applies to every context where an agent must decide which findings are correct, which are noise, and what to do about each.

Stance

Findings are hypotheses about the work. Some will be correct and important. Some will be wrong, inapplicable, or based on incomplete understanding. Your job is to determine which is which, with evidence.

Do not default to acceptance. The path of least resistance (apply every suggestion) produces worse code than thoughtful evaluation. Equally, do not default to rejection — that wastes valid insights.

Assess each finding by merit. The question is not "is this a big issue?" or "is this convenient to fix?" The question is: is the finding correct? A correct improvement is worth implementing regardless of size. An incorrect suggestion should be declined regardless of who raised it.

Phase 1: Investigate

For each finding, build enough context to make a high-confidence judgment. Do not assess from the finding description alone.

Your investigation stance is adversarial: attempt to disprove each finding before accepting it. If the finding survives your best effort to disprove it with evidence, it is worth acting on. If you can disprove it, discard it. This counteracts the dominant failure mode — sycophantic acceptance of AI-generated feedback that degrades correct work.

Mandatory for every finding:

Web search to verify best-practice claims, library behavior, security patterns, and framework conventions. Training data may be outdated — web search grounds you in current best practices. This is not optional.
Codebase inspection to trace the relevant code path and verify the finding's premise against the actual implementation.

As needed:

Spec check for design intent when the finding touches architecture or requirements
Adjacent code for existing patterns and conventions (does the codebase already do this elsewhere?)

Investigation tracks by claim type

Route your investigation based on what the finding claims:

Claim type	Investigation approach
Code behavior ("this function doesn't handle X")	Trace the relevant code paths — grep, read, explore. Verify against actual implementation, not description.
Library/API/framework ("this API was deprecated," "method X doesn't exist")	Check `~/.claude/oss-repos/` and `node_modules/` for local dependency source first — actual types, interfaces, and implementations are higher-trust than docs and version-matched to the project. Then web search with version pinning (confirm version from repo's package.json/lockfile).
Ecosystem/best practice ("the recommended approach is X")	Check the project's own tooling first — linter configs, type-checker settings, formatter rules. The project already declared its practices by configuring its tools. Then check framework/language official docs (version-pinned). Then web search with recency for community consensus.
Internal coherence ("Section 3 contradicts Section 7")	Re-read the specific sections referenced. Trace downstream implications of the contradiction.

Evidence source hierarchy

When sources conflict, trust in this order:

Priority	Source	Trust level
1st	Source code (actual implementation, types, interfaces)	Highest for understanding current behavior. Not authority for whether that behavior is correct — use official docs and tooling for correctness judgments.
2nd	Official documentation (maintainer docs, API reference)	High — may be aspirational or outdated
3rd	High-trust third parties (company eng blogs, conference talks, reputable publications)	High — cross-reference rarely
4th	Community sources (Stack Overflow, Reddit, dev.to, Medium)	Medium — verify claims
5th	General sources (tutorials, AI-generated content, undated posts)	Low — always cross-reference

Query formulation

When searching, formulate specific queries: {tool} {version} {specific claim} {year}. Avoid vague queries ("is X bad") or opinion-seeking queries ("should I use X"). Pin versions from the actual repo, not from training data assumptions.

Depth of understanding

For load-bearing findings (findings that would change what gets built or how), the goal is not merely "is this claim true/false?" but "what is the full picture?" Understand the implications for the current approach, the pros and cons of the finding's implied recommendation, and whether the finding changes what should happen. A finding can be factually true but its implied recommendation wrong — surface that distinction.

Investigation depth

Investigate until you reach HIGH confidence — no more, no less. Some findings reach HIGH confidence immediately (you recognize the pattern, the tooling already enforces it, the docs are clear). Others require deep investigation. The reviewer's severity label does not determine your investigation depth — your confidence does.

The mandatory minimums (codebase inspection + web search) apply to every finding regardless. Beyond those, scale depth to what your confidence requires:

If you can classify a finding with HIGH confidence after the mandatory checks, do so — don't investigate further just because the finding is labeled Critical.
If you cannot reach HIGH confidence after the mandatory checks, investigate deeper — even if the finding is labeled Consider or While-You're-Here.

For findings that are purely editorial (wording, formatting) where correctness is not in question, the mandatory checks may be sufficient. But "the reviewer called it Minor" is never a reason to reduce investigation — a finding labeled Minor may have correctness implications that only surface under investigation.

When web search is unavailable

If no web search tool is available, cap all confidence at MEDIUM for any claim that depends on external knowledge (library behavior, framework conventions, deprecation status, best practices). Flag the limitation to the calling workflow.

Litmus test: Could you explain your reasoning to a senior engineer and defend it with evidence? "I think" or "it seems" has not met the standard. "I verified via web search that X" or "I traced the code path and confirmed Y" has.

Phase 2: Calibrate Confidence

Name your confidence before deciding. Do not proceed to a decision at LOW confidence.

Level	Meaning	Action
HIGH	Specific evidence — code lines, docs, test results, web search results	Proceed to evaluate
MEDIUM	Plausible but depends on unverified assumptions, or a reasonable alternative interpretation exists	Investigate further if stakes warrant. If proceeding, flag the uncertainty.
LOW	Cannot cite specific evidence	Do not decide. Investigate more.

Confidence ceilings: Claims about library behavior, API semantics, deprecation status, or best practices — whether from a finding, your own knowledge, or any other source — cannot reach HIGH without external verification (web search, official docs, or codebase inspection). Training-data knowledge is not verification.

Claim category	Confidence ceiling without external verification
Library API claims	MEDIUM max
Framework directives	MEDIUM max
Deprecation claims	MEDIUM max
Version-specific behavior	LOW (unless version confirmed from repo)
Security advisories	LOW (may be patched)
Best practice assertions	MEDIUM max

When uncertain between two confidence levels, prefer the lower level. Overconfident classifications (HIGH on wrong assessments) cause more downstream damage than underconfident ones (MEDIUM on correct assessments).

Phase 3: Evaluate

For each finding, assess against relevant dimensions:

Dimension	Question
Validity	Is the premise correct? Does the issue the finding identifies actually exist in this work?
Correctness	Is the proposed fix actually correct, or does it introduce new problems?
Applicability	Does this apply in this specific context, or is it generic advice that doesn't fit here?
Relevancy	Is this relevant to the scope and intent of the current changes?
Tradeoffs	Does applying this improve one thing but degrade another?
Side effects	What else changes if you apply this? Other code paths, tests, consumers?

Not every dimension applies to every finding. A simple naming suggestion doesn't need a tradeoff analysis. An architectural change does.

For non-trivial findings, also evaluate:

Dimension	What to weigh
Necessity	Does this solve a validated problem, or a hypothetical one?
Proportionality	Does the complexity of the fix match the complexity of the problem?
Evidence	What concrete evidence supports the finding over the current approach?
Reversibility	Can this be changed later, or is it a one-way door?
Codebase conventions	What do established patterns in this codebase suggest?

Phase 4: Classify

After investigation, classify and act:

Classification	Action
Valid bug or correctness issue (confirmed by evidence)	Fix. High priority.
Valid improvement (correct, tradeoffs favorable)	Implement. A correct improvement is worth implementing regardless of size — "it's a nit" is not a reason to skip a correct fix.
Valid — escalate (correct, but implicates a judgment call beyond current scope: product vision, risk tolerance, 1-way door)	Surface with evidence showing the issue exists and why it requires human judgment. Do not resolve autonomously.
Pre-existing issue unrelated to current changes (genuinely out of scope)	Decline for this scope. Note for tracking. Explain why it's out of scope.
Incorrect or based on misunderstanding (you have evidence)	Decline. Explain why with evidence.
Technically valid but tradeoffs unfavorable	Decline. Explain the tradeoffs and why.
Style/preference with no correctness impact	Evaluate against the ground truth chain: (1) Project tooling — if the project's linter/formatter/type-checker enforces a practice, that's the answer; run the tool, don't debate it. (2) Framework/language official docs — version-pinned from the project's lockfile. (3) Codebase consistency — tiebreaker only when tooling and docs are silent. Codebase patterns are evidence of what exists, not evidence of what's correct — especially in codebases with AI-generated code.
Contradicts the spec or design intent	Decline. Reference the spec.
Uncertain after investigation	Do not guess. Investigate further or escalate.

The escalation boundary: Resolve autonomously when evidence alone determines the answer. Escalate when the finding's resolution requires choosing between competing values — speed vs safety, scope vs timeline, convention vs correctness, short-term pragmatism vs long-term flexibility. These are judgment calls that require human priorities, not evidence questions. Also escalate for 1-way doors (public API shape, schema changes, security boundary decisions) where the cost of getting it wrong is high and reversal is expensive.

The bar for declining: Your investigation must show the finding is factually incorrect, technically harmful, or genuinely out of scope. Never decline because:

The issue seems minor or nitpicky (if it's correct, it's worth fixing)
It would take effort (effort is not a reason to decline a correct finding)
"It works as-is" (working code can still have valid improvements)
You disagree with the style (unless codebase conventions support your position — check)

Convenience mapping for consuming skills

Consuming skills that need simple routing can use this grouping:

Routing category	Maps to classifications above
Act	Valid bug, Valid improvement, Style/preference (when codebase supports)
Escalate	Valid — escalate, Uncertain
Decline	Pre-existing/out-of-scope, Incorrect, Tradeoffs unfavorable, Contradicts spec

If the classification is clear and the action is obvious, proceed to Phase 6: Communicate. If the right response isn't immediately clear or the classification raises follow-on questions, use Phase 5: Resolve first.

Phase 5: Resolve

When a finding requires action beyond simple classification — the right response isn't immediately clear, or the classification raises follow-on questions — use graduated escalation.

Exhaust cheaper levels before escalating:

Level	When	Action
Level 1: Re-examine existing evidence	Default first step	Re-read the source material for both sides. Check if synthesis misrepresented the evidence, or if the evidence itself is the issue. Many findings resolve here.
Level 2: Narrow verification	Existing evidence insufficient, specific factual question	Quick, pointed research. The question comes directly from the finding: "Is X deployed?" "Does the API require auth for this endpoint?"
Level 3: Deeper exploration	Finding stems from an angle the original work didn't cover	Substantive new research, still scoped to the finding area. If this reveals a much larger gap, flag it rather than expanding scope.

Resolution types (how the resolution manifests):

Type	When to use
Sharpen	Evidence supports a clearer claim than what's written
Add conditions	The truth is conditional; the finding flattened it
Recalibrate	Prose certainty doesn't match evidence strength
Acknowledge ambiguity	Evidence is genuinely unclear; forcing a resolution the evidence doesn't support
Re-research	Existing evidence is insufficient to resolve the finding

Phase 6: Communicate

Match your response's certainty to your actual evidence level:

HIGH confidence: State your conclusion directly with evidence references
MEDIUM confidence: Hedge appropriately — "Based on my reading of X, I believe Y, though I may be missing context about Z"
Overstating certainty on MEDIUM-confidence assessments erodes trust and can lead to incorrect decisions

When accepting: brief acknowledgment of what you're fixing and why the finding was correct.

When declining: specific evidence — code references, spec sections, tradeoff analysis, web search results, or documentation. Be direct. Do not apologize for disagreeing; explain why.

When partially accepting: explain what you're taking and what you're not, and why.

Guardrails

Anti-deferral

Do not commit to work outside your current scope. "We'll address this in a future iteration" or "this can be revisited later" is never an acceptable resolution — you have no authority to guarantee future work will happen.

If you cannot address a finding now, decline it with evidence-backed reasoning — explain why it's incorrect, out of scope, or harmful. A clear decline with rationale is honest. A deferral is a promise you can't keep.

(Within a bounded convergence loop, deferring to the next pass in the same loop is acceptable — the loop enforces it. Deferring to unspecified future work outside the loop is not.)

Re-raised findings

When a finding you previously assessed is re-raised after your response — whether by a human pushing back or an AI re-flagging — the other party has considered your reasoning and disagreed.

Do not repeat the same rationale. Either:

Change your position (the other party's counter-argument is convincing)
Provide new evidence they haven't seen (a different code path, different documentation, different test result)

Repeating the same rationale a second time is not productive — it was already considered and rejected.

If your original evidence remains stronger than the new counter-argument, maintain your position. Persistence alone is not evidence — do not lower confidence because a finding is re-raised without new substantive reasoning.

Self-check before finalizing

For each finding you are declining, pressure-test:

Did you web-search the claim, or only check the codebase?
Would a senior engineer find your evidence convincing?
Are you declining because the finding is wrong, or because fixing it is inconvenient?
If the finding cited a best practice, did you verify whether it applies here?

For your overall assessment:

Am I over-indexing (accepting everything uncritically)?
Am I under-indexing (dismissing findings to avoid work)?
Would someone with different experience reach a different conclusion from the same evidence?
Verify your confidence label matches your evidence: HIGH requires cited evidence; MEDIUM requires noted uncertainty.

When evaluating AI-generated findings specifically, over-indexing (accepting invalid feedback) is the more common and more damaging failure mode — when you are the first evaluator of raw AI feedback. Weight your self-check toward rejection resistance in that context.

However, when you are a fixer responding to a structured, adjudicated review (e.g., a multi-reviewer dispatch that has already classified findings by severity), both failure modes are equally likely — the review has already filtered noise. Weight your self-check evenly. Under-indexing (skipping valid improvements because "it's just a Minor") is as damaging as over-indexing in this context.

(This asymmetry is also less pronounced when assessing human reviewer feedback, where under-indexing is equally risky.)

When assessing findings from cold readers (subagents, auditors, reviewers who didn't participate in the decision process), recalibrate severity — cold readers systematically over-severity because they lack context for why trade-offs were made. But do not discount validity: cold readers also have a fresh-perspective advantage — they catch patterns that familiarity blindness obscures. The severity may be overstated, but the issue may be real.

Over-indexing signals: Applying every suggestion without evaluating tradeoffs. Refactoring adjacent code that wasn't flagged. Adding defensive checks for impossible scenarios.

Under-indexing signals: Dismissing findings without investigating. Accepting "it works" as sufficient evidence. Declining because fixing requires effort rather than because the finding lacks merit.

Batch evaluation

When evaluating multiple findings, assess every finding before implementing any. Assessment rigor is flat — every finding gets the same investigation regardless of the reviewer's severity label. Group related findings (same code area, same theme) to catch cross-finding interactions.

After all findings are assessed and classified, implement in severity order (Critical first). Separating assessment from implementation ensures every finding gets properly evaluated — severity determines fix order, not investigation depth.

When evaluating multiple findings about the same subject, check whether earlier assessments change context for later ones — a decline on finding #2 may affect the validity of finding #5.

If two or more findings describe the same underlying issue in different terms, merge them into a single finding before evaluating — assess once, not repeatedly.

AI false positive patterns

When evaluating AI-generated code changes specifically, load references/ai-false-positive-patterns.md for 8 common patterns that AI coding agents systematically produce. These patterns help distinguish genuine improvements from AI over-engineering.

Similar Skills

cache-components

138.7k

claude-opus-4-5-migration

2 files

Migrates code, prompts, and API calls from Claude Sonnet 4.0/4.5 or Opus 4.1 to Opus 4.5, updating model strings on Anthropic, AWS, GCP, Azure platforms.

claude-opus-4-5-migration

83.2k

istio-traffic-management

Configures Istio traffic management with VirtualServices, DestinationRules for routing, canary/blue-green deployments, circuit breakers, load balancing, and fault injection in service meshes.

cloud-infrastructure

33.0k

Stats

Parent Repo Stars7

Parent Repo Forks1

Last CommitApr 4, 2026

Actions

View Source View Plugin View on GitHub View README

Assess Findings

Stance

Phase 1: Investigate

For each finding, build enough context to make a high-confidence judgment. Do not assess from the finding description alone.

Mandatory for every finding:

Web search to verify best-practice claims, library behavior, security patterns, and framework conventions. Training data may be outdated — web search grounds you in current best practices. This is not optional.
Codebase inspection to trace the relevant code path and verify the finding's premise against the actual implementation.

As needed:

Spec check for design intent when the finding touches architecture or requirements
Adjacent code for existing patterns and conventions (does the codebase already do this elsewhere?)

Investigation tracks by claim type

Route your investigation based on what the finding claims:

Claim type	Investigation approach
Code behavior ("this function doesn't handle X")	Trace the relevant code paths — grep, read, explore. Verify against actual implementation, not description.
Library/API/framework ("this API was deprecated," "method X doesn't exist")	Check `~/.claude/oss-repos/` and `node_modules/` for local dependency source first — actual types, interfaces, and implementations are higher-trust than docs and version-matched to the project. Then web search with version pinning (confirm version from repo's package.json/lockfile).
Ecosystem/best practice ("the recommended approach is X")	Check the project's own tooling first — linter configs, type-checker settings, formatter rules. The project already declared its practices by configuring its tools. Then check framework/language official docs (version-pinned). Then web search with recency for community consensus.
Internal coherence ("Section 3 contradicts Section 7")	Re-read the specific sections referenced. Trace downstream implications of the contradiction.

Evidence source hierarchy

When sources conflict, trust in this order:

Priority	Source	Trust level
1st	Source code (actual implementation, types, interfaces)	Highest for understanding current behavior. Not authority for whether that behavior is correct — use official docs and tooling for correctness judgments.
2nd	Official documentation (maintainer docs, API reference)	High — may be aspirational or outdated
3rd	High-trust third parties (company eng blogs, conference talks, reputable publications)	High — cross-reference rarely
4th	Community sources (Stack Overflow, Reddit, dev.to, Medium)	Medium — verify claims
5th	General sources (tutorials, AI-generated content, undated posts)	Low — always cross-reference

Query formulation

Depth of understanding

Investigation depth

The mandatory minimums (codebase inspection + web search) apply to every finding regardless. Beyond those, scale depth to what your confidence requires:

If you can classify a finding with HIGH confidence after the mandatory checks, do so — don't investigate further just because the finding is labeled Critical.
If you cannot reach HIGH confidence after the mandatory checks, investigate deeper — even if the finding is labeled Consider or While-You're-Here.

When web search is unavailable

Phase 2: Calibrate Confidence

Name your confidence before deciding. Do not proceed to a decision at LOW confidence.

Level	Meaning	Action
HIGH	Specific evidence — code lines, docs, test results, web search results	Proceed to evaluate
MEDIUM	Plausible but depends on unverified assumptions, or a reasonable alternative interpretation exists	Investigate further if stakes warrant. If proceeding, flag the uncertainty.
LOW	Cannot cite specific evidence	Do not decide. Investigate more.

Claim category	Confidence ceiling without external verification
Library API claims	MEDIUM max
Framework directives	MEDIUM max
Deprecation claims	MEDIUM max
Version-specific behavior	LOW (unless version confirmed from repo)
Security advisories	LOW (may be patched)
Best practice assertions	MEDIUM max

Phase 3: Evaluate

For each finding, assess against relevant dimensions:

Dimension	Question
Validity	Is the premise correct? Does the issue the finding identifies actually exist in this work?
Correctness	Is the proposed fix actually correct, or does it introduce new problems?
Applicability	Does this apply in this specific context, or is it generic advice that doesn't fit here?
Relevancy	Is this relevant to the scope and intent of the current changes?
Tradeoffs	Does applying this improve one thing but degrade another?
Side effects	What else changes if you apply this? Other code paths, tests, consumers?

Not every dimension applies to every finding. A simple naming suggestion doesn't need a tradeoff analysis. An architectural change does.

For non-trivial findings, also evaluate:

Dimension	What to weigh
Necessity	Does this solve a validated problem, or a hypothetical one?
Proportionality	Does the complexity of the fix match the complexity of the problem?
Evidence	What concrete evidence supports the finding over the current approach?
Reversibility	Can this be changed later, or is it a one-way door?
Codebase conventions	What do established patterns in this codebase suggest?

Phase 4: Classify

After investigation, classify and act:

Classification	Action
Valid bug or correctness issue (confirmed by evidence)	Fix. High priority.
Valid improvement (correct, tradeoffs favorable)	Implement. A correct improvement is worth implementing regardless of size — "it's a nit" is not a reason to skip a correct fix.
Valid — escalate (correct, but implicates a judgment call beyond current scope: product vision, risk tolerance, 1-way door)	Surface with evidence showing the issue exists and why it requires human judgment. Do not resolve autonomously.
Pre-existing issue unrelated to current changes (genuinely out of scope)	Decline for this scope. Note for tracking. Explain why it's out of scope.
Incorrect or based on misunderstanding (you have evidence)	Decline. Explain why with evidence.
Technically valid but tradeoffs unfavorable	Decline. Explain the tradeoffs and why.
Style/preference with no correctness impact	Evaluate against the ground truth chain: (1) Project tooling — if the project's linter/formatter/type-checker enforces a practice, that's the answer; run the tool, don't debate it. (2) Framework/language official docs — version-pinned from the project's lockfile. (3) Codebase consistency — tiebreaker only when tooling and docs are silent. Codebase patterns are evidence of what exists, not evidence of what's correct — especially in codebases with AI-generated code.
Contradicts the spec or design intent	Decline. Reference the spec.
Uncertain after investigation	Do not guess. Investigate further or escalate.

The bar for declining: Your investigation must show the finding is factually incorrect, technically harmful, or genuinely out of scope. Never decline because:

The issue seems minor or nitpicky (if it's correct, it's worth fixing)
It would take effort (effort is not a reason to decline a correct finding)
"It works as-is" (working code can still have valid improvements)
You disagree with the style (unless codebase conventions support your position — check)

Convenience mapping for consuming skills

Consuming skills that need simple routing can use this grouping:

Routing category	Maps to classifications above
Act	Valid bug, Valid improvement, Style/preference (when codebase supports)
Escalate	Valid — escalate, Uncertain
Decline	Pre-existing/out-of-scope, Incorrect, Tradeoffs unfavorable, Contradicts spec

Phase 5: Resolve

When a finding requires action beyond simple classification — the right response isn't immediately clear, or the classification raises follow-on questions — use graduated escalation.

Exhaust cheaper levels before escalating:

Level	When	Action
Level 1: Re-examine existing evidence	Default first step	Re-read the source material for both sides. Check if synthesis misrepresented the evidence, or if the evidence itself is the issue. Many findings resolve here.
Level 2: Narrow verification	Existing evidence insufficient, specific factual question	Quick, pointed research. The question comes directly from the finding: "Is X deployed?" "Does the API require auth for this endpoint?"
Level 3: Deeper exploration	Finding stems from an angle the original work didn't cover	Substantive new research, still scoped to the finding area. If this reveals a much larger gap, flag it rather than expanding scope.

Resolution types (how the resolution manifests):

Type	When to use
Sharpen	Evidence supports a clearer claim than what's written
Add conditions	The truth is conditional; the finding flattened it
Recalibrate	Prose certainty doesn't match evidence strength
Acknowledge ambiguity	Evidence is genuinely unclear; forcing a resolution the evidence doesn't support
Re-research	Existing evidence is insufficient to resolve the finding

Phase 6: Communicate

Match your response's certainty to your actual evidence level:

HIGH confidence: State your conclusion directly with evidence references
MEDIUM confidence: Hedge appropriately — "Based on my reading of X, I believe Y, though I may be missing context about Z"
Overstating certainty on MEDIUM-confidence assessments erodes trust and can lead to incorrect decisions

When accepting: brief acknowledgment of what you're fixing and why the finding was correct.

When declining: specific evidence — code references, spec sections, tradeoff analysis, web search results, or documentation. Be direct. Do not apologize for disagreeing; explain why.

When partially accepting: explain what you're taking and what you're not, and why.

Guardrails

Anti-deferral

(Within a bounded convergence loop, deferring to the next pass in the same loop is acceptable — the loop enforces it. Deferring to unspecified future work outside the loop is not.)

Re-raised findings

When a finding you previously assessed is re-raised after your response — whether by a human pushing back or an AI re-flagging — the other party has considered your reasoning and disagreed.

Do not repeat the same rationale. Either:

Change your position (the other party's counter-argument is convincing)
Provide new evidence they haven't seen (a different code path, different documentation, different test result)

Repeating the same rationale a second time is not productive — it was already considered and rejected.

Self-check before finalizing

For each finding you are declining, pressure-test:

Did you web-search the claim, or only check the codebase?
Would a senior engineer find your evidence convincing?
Are you declining because the finding is wrong, or because fixing it is inconvenient?
If the finding cited a best practice, did you verify whether it applies here?

For your overall assessment:

Am I over-indexing (accepting everything uncritically)?
Am I under-indexing (dismissing findings to avoid work)?
Would someone with different experience reach a different conclusion from the same evidence?
Verify your confidence label matches your evidence: HIGH requires cited evidence; MEDIUM requires noted uncertainty.

(This asymmetry is also less pronounced when assessing human reviewer feedback, where under-indexing is equally risky.)

Over-indexing signals: Applying every suggestion without evaluating tradeoffs. Refactoring adjacent code that wasn't flagged. Adding defensive checks for impossible scenarios.

Batch evaluation

When evaluating multiple findings about the same subject, check whether earlier assessments change context for later ones — a decline on finding #2 may affect the validity of finding #5.

If two or more findings describe the same underlying issue in different terms, merge them into a single finding before evaluating — assess once, not repeatedly.