Evidence-based evaluation of AI-generated findings — review feedback, audit findings, spec challenges, or any AI-generated assessment that requires triage before action. Teaches agents to investigate before accepting, calibrate confidence based on evidence quality, classify findings into actionable categories, and detect assessment bias. Load this skill when evaluating findings from reviewers, auditors, challengers, or any AI-generated feedback loop.
From sharednpx claudepluginhub inkeep/team-skills --plugin sharedThis skill uses the workspace's default tool permissions.
references/ai-false-positive-patterns.mdGuides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Migrates code, prompts, and API calls from Claude Sonnet 4.0/4.5 or Opus 4.1 to Opus 4.5, updating model strings on Anthropic, AWS, GCP, Azure platforms.
Configures Istio traffic management with VirtualServices, DestinationRules for routing, canary/blue-green deployments, circuit breakers, load balancing, and fault injection in service meshes.
How to assess findings — whether from an AI reviewer, an auditor, a challenger, or any other source of AI-generated feedback. This protocol applies to every context where an agent must decide which findings are correct, which are noise, and what to do about each.
Findings are hypotheses about the work. Some will be correct and important. Some will be wrong, inapplicable, or based on incomplete understanding. Your job is to determine which is which, with evidence.
Do not default to acceptance. The path of least resistance (apply every suggestion) produces worse code than thoughtful evaluation. Equally, do not default to rejection — that wastes valid insights.
Assess each finding by merit. The question is not "is this a big issue?" or "is this convenient to fix?" The question is: is the finding correct? A correct improvement is worth implementing regardless of size. An incorrect suggestion should be declined regardless of who raised it.
For each finding, build enough context to make a high-confidence judgment. Do not assess from the finding description alone.
Your investigation stance is adversarial: attempt to disprove each finding before accepting it. If the finding survives your best effort to disprove it with evidence, it is worth acting on. If you can disprove it, discard it. This counteracts the dominant failure mode — sycophantic acceptance of AI-generated feedback that degrades correct work.
Mandatory for every finding:
As needed:
Route your investigation based on what the finding claims:
| Claim type | Investigation approach |
|---|---|
| Code behavior ("this function doesn't handle X") | Trace the relevant code paths — grep, read, explore. Verify against actual implementation, not description. |
| Library/API/framework ("this API was deprecated," "method X doesn't exist") | Check ~/.claude/oss-repos/ and node_modules/ for local dependency source first — actual types, interfaces, and implementations are higher-trust than docs and version-matched to the project. Then web search with version pinning (confirm version from repo's package.json/lockfile). |
| Ecosystem/best practice ("the recommended approach is X") | Check the project's own tooling first — linter configs, type-checker settings, formatter rules. The project already declared its practices by configuring its tools. Then check framework/language official docs (version-pinned). Then web search with recency for community consensus. |
| Internal coherence ("Section 3 contradicts Section 7") | Re-read the specific sections referenced. Trace downstream implications of the contradiction. |
When sources conflict, trust in this order:
| Priority | Source | Trust level |
|---|---|---|
| 1st | Source code (actual implementation, types, interfaces) | Highest for understanding current behavior. Not authority for whether that behavior is correct — use official docs and tooling for correctness judgments. |
| 2nd | Official documentation (maintainer docs, API reference) | High — may be aspirational or outdated |
| 3rd | High-trust third parties (company eng blogs, conference talks, reputable publications) | High — cross-reference rarely |
| 4th | Community sources (Stack Overflow, Reddit, dev.to, Medium) | Medium — verify claims |
| 5th | General sources (tutorials, AI-generated content, undated posts) | Low — always cross-reference |
When searching, formulate specific queries: {tool} {version} {specific claim} {year}. Avoid vague queries ("is X bad") or opinion-seeking queries ("should I use X"). Pin versions from the actual repo, not from training data assumptions.
For load-bearing findings (findings that would change what gets built or how), the goal is not merely "is this claim true/false?" but "what is the full picture?" Understand the implications for the current approach, the pros and cons of the finding's implied recommendation, and whether the finding changes what should happen. A finding can be factually true but its implied recommendation wrong — surface that distinction.
Investigate until you reach HIGH confidence — no more, no less. Some findings reach HIGH confidence immediately (you recognize the pattern, the tooling already enforces it, the docs are clear). Others require deep investigation. The reviewer's severity label does not determine your investigation depth — your confidence does.
The mandatory minimums (codebase inspection + web search) apply to every finding regardless. Beyond those, scale depth to what your confidence requires:
For findings that are purely editorial (wording, formatting) where correctness is not in question, the mandatory checks may be sufficient. But "the reviewer called it Minor" is never a reason to reduce investigation — a finding labeled Minor may have correctness implications that only surface under investigation.
If no web search tool is available, cap all confidence at MEDIUM for any claim that depends on external knowledge (library behavior, framework conventions, deprecation status, best practices). Flag the limitation to the calling workflow.
Litmus test: Could you explain your reasoning to a senior engineer and defend it with evidence? "I think" or "it seems" has not met the standard. "I verified via web search that X" or "I traced the code path and confirmed Y" has.
Name your confidence before deciding. Do not proceed to a decision at LOW confidence.
| Level | Meaning | Action |
|---|---|---|
| HIGH | Specific evidence — code lines, docs, test results, web search results | Proceed to evaluate |
| MEDIUM | Plausible but depends on unverified assumptions, or a reasonable alternative interpretation exists | Investigate further if stakes warrant. If proceeding, flag the uncertainty. |
| LOW | Cannot cite specific evidence | Do not decide. Investigate more. |
Confidence ceilings: Claims about library behavior, API semantics, deprecation status, or best practices — whether from a finding, your own knowledge, or any other source — cannot reach HIGH without external verification (web search, official docs, or codebase inspection). Training-data knowledge is not verification.
| Claim category | Confidence ceiling without external verification |
|---|---|
| Library API claims | MEDIUM max |
| Framework directives | MEDIUM max |
| Deprecation claims | MEDIUM max |
| Version-specific behavior | LOW (unless version confirmed from repo) |
| Security advisories | LOW (may be patched) |
| Best practice assertions | MEDIUM max |
When uncertain between two confidence levels, prefer the lower level. Overconfident classifications (HIGH on wrong assessments) cause more downstream damage than underconfident ones (MEDIUM on correct assessments).
For each finding, assess against relevant dimensions:
| Dimension | Question |
|---|---|
| Validity | Is the premise correct? Does the issue the finding identifies actually exist in this work? |
| Correctness | Is the proposed fix actually correct, or does it introduce new problems? |
| Applicability | Does this apply in this specific context, or is it generic advice that doesn't fit here? |
| Relevancy | Is this relevant to the scope and intent of the current changes? |
| Tradeoffs | Does applying this improve one thing but degrade another? |
| Side effects | What else changes if you apply this? Other code paths, tests, consumers? |
Not every dimension applies to every finding. A simple naming suggestion doesn't need a tradeoff analysis. An architectural change does.
For non-trivial findings, also evaluate:
| Dimension | What to weigh |
|---|---|
| Necessity | Does this solve a validated problem, or a hypothetical one? |
| Proportionality | Does the complexity of the fix match the complexity of the problem? |
| Evidence | What concrete evidence supports the finding over the current approach? |
| Reversibility | Can this be changed later, or is it a one-way door? |
| Codebase conventions | What do established patterns in this codebase suggest? |
After investigation, classify and act:
| Classification | Action |
|---|---|
| Valid bug or correctness issue (confirmed by evidence) | Fix. High priority. |
| Valid improvement (correct, tradeoffs favorable) | Implement. A correct improvement is worth implementing regardless of size — "it's a nit" is not a reason to skip a correct fix. |
| Valid — escalate (correct, but implicates a judgment call beyond current scope: product vision, risk tolerance, 1-way door) | Surface with evidence showing the issue exists and why it requires human judgment. Do not resolve autonomously. |
| Pre-existing issue unrelated to current changes (genuinely out of scope) | Decline for this scope. Note for tracking. Explain why it's out of scope. |
| Incorrect or based on misunderstanding (you have evidence) | Decline. Explain why with evidence. |
| Technically valid but tradeoffs unfavorable | Decline. Explain the tradeoffs and why. |
| Style/preference with no correctness impact | Evaluate against the ground truth chain: (1) Project tooling — if the project's linter/formatter/type-checker enforces a practice, that's the answer; run the tool, don't debate it. (2) Framework/language official docs — version-pinned from the project's lockfile. (3) Codebase consistency — tiebreaker only when tooling and docs are silent. Codebase patterns are evidence of what exists, not evidence of what's correct — especially in codebases with AI-generated code. |
| Contradicts the spec or design intent | Decline. Reference the spec. |
| Uncertain after investigation | Do not guess. Investigate further or escalate. |
The escalation boundary: Resolve autonomously when evidence alone determines the answer. Escalate when the finding's resolution requires choosing between competing values — speed vs safety, scope vs timeline, convention vs correctness, short-term pragmatism vs long-term flexibility. These are judgment calls that require human priorities, not evidence questions. Also escalate for 1-way doors (public API shape, schema changes, security boundary decisions) where the cost of getting it wrong is high and reversal is expensive.
The bar for declining: Your investigation must show the finding is factually incorrect, technically harmful, or genuinely out of scope. Never decline because:
Consuming skills that need simple routing can use this grouping:
| Routing category | Maps to classifications above |
|---|---|
| Act | Valid bug, Valid improvement, Style/preference (when codebase supports) |
| Escalate | Valid — escalate, Uncertain |
| Decline | Pre-existing/out-of-scope, Incorrect, Tradeoffs unfavorable, Contradicts spec |
If the classification is clear and the action is obvious, proceed to Phase 6: Communicate. If the right response isn't immediately clear or the classification raises follow-on questions, use Phase 5: Resolve first.
When a finding requires action beyond simple classification — the right response isn't immediately clear, or the classification raises follow-on questions — use graduated escalation.
Exhaust cheaper levels before escalating:
| Level | When | Action |
|---|---|---|
| Level 1: Re-examine existing evidence | Default first step | Re-read the source material for both sides. Check if synthesis misrepresented the evidence, or if the evidence itself is the issue. Many findings resolve here. |
| Level 2: Narrow verification | Existing evidence insufficient, specific factual question | Quick, pointed research. The question comes directly from the finding: "Is X deployed?" "Does the API require auth for this endpoint?" |
| Level 3: Deeper exploration | Finding stems from an angle the original work didn't cover | Substantive new research, still scoped to the finding area. If this reveals a much larger gap, flag it rather than expanding scope. |
Resolution types (how the resolution manifests):
| Type | When to use |
|---|---|
| Sharpen | Evidence supports a clearer claim than what's written |
| Add conditions | The truth is conditional; the finding flattened it |
| Recalibrate | Prose certainty doesn't match evidence strength |
| Acknowledge ambiguity | Evidence is genuinely unclear; forcing a resolution the evidence doesn't support |
| Re-research | Existing evidence is insufficient to resolve the finding |
Match your response's certainty to your actual evidence level:
When accepting: brief acknowledgment of what you're fixing and why the finding was correct.
When declining: specific evidence — code references, spec sections, tradeoff analysis, web search results, or documentation. Be direct. Do not apologize for disagreeing; explain why.
When partially accepting: explain what you're taking and what you're not, and why.
Do not commit to work outside your current scope. "We'll address this in a future iteration" or "this can be revisited later" is never an acceptable resolution — you have no authority to guarantee future work will happen.
If you cannot address a finding now, decline it with evidence-backed reasoning — explain why it's incorrect, out of scope, or harmful. A clear decline with rationale is honest. A deferral is a promise you can't keep.
(Within a bounded convergence loop, deferring to the next pass in the same loop is acceptable — the loop enforces it. Deferring to unspecified future work outside the loop is not.)
When a finding you previously assessed is re-raised after your response — whether by a human pushing back or an AI re-flagging — the other party has considered your reasoning and disagreed.
Do not repeat the same rationale. Either:
Repeating the same rationale a second time is not productive — it was already considered and rejected.
If your original evidence remains stronger than the new counter-argument, maintain your position. Persistence alone is not evidence — do not lower confidence because a finding is re-raised without new substantive reasoning.
For each finding you are declining, pressure-test:
For your overall assessment:
When evaluating AI-generated findings specifically, over-indexing (accepting invalid feedback) is the more common and more damaging failure mode — when you are the first evaluator of raw AI feedback. Weight your self-check toward rejection resistance in that context.
However, when you are a fixer responding to a structured, adjudicated review (e.g., a multi-reviewer dispatch that has already classified findings by severity), both failure modes are equally likely — the review has already filtered noise. Weight your self-check evenly. Under-indexing (skipping valid improvements because "it's just a Minor") is as damaging as over-indexing in this context.
(This asymmetry is also less pronounced when assessing human reviewer feedback, where under-indexing is equally risky.)
When assessing findings from cold readers (subagents, auditors, reviewers who didn't participate in the decision process), recalibrate severity — cold readers systematically over-severity because they lack context for why trade-offs were made. But do not discount validity: cold readers also have a fresh-perspective advantage — they catch patterns that familiarity blindness obscures. The severity may be overstated, but the issue may be real.
Over-indexing signals: Applying every suggestion without evaluating tradeoffs. Refactoring adjacent code that wasn't flagged. Adding defensive checks for impossible scenarios.
Under-indexing signals: Dismissing findings without investigating. Accepting "it works" as sufficient evidence. Declining because fixing requires effort rather than because the finding lacks merit.
When evaluating multiple findings, assess every finding before implementing any. Assessment rigor is flat — every finding gets the same investigation regardless of the reviewer's severity label. Group related findings (same code area, same theme) to catch cross-finding interactions.
After all findings are assessed and classified, implement in severity order (Critical first). Separating assessment from implementation ensures every finding gets properly evaluated — severity determines fix order, not investigation depth.
When evaluating multiple findings about the same subject, check whether earlier assessments change context for later ones — a decline on finding #2 may affect the validity of finding #5.
If two or more findings describe the same underlying issue in different terms, merge them into a single finding before evaluating — assess once, not repeatedly.
When evaluating AI-generated code changes specifically, load references/ai-false-positive-patterns.md for 8 common patterns that AI coding agents systematically produce. These patterns help distinguish genuine improvements from AI over-engineering.