From great_cto
Reusable 3-round self-challenge + arbiter pattern for filtering false positives from findings/verdicts. Use when the cost of a false-positive gate block exceeds the cost of ~4 extra LLM turns.
npx claudepluginhub avelikiy/great_ctoThis skill is limited to using the following tools:
Filter false positives from multi-angle review, security audit, QA regression flags, or any high-stakes judgment before it turns into a blocker.
Verifies tests pass on completed feature branch, presents options to merge locally, create GitHub PR, keep as-is or discard; executes choice and cleans up worktree.
Guides root cause investigation for bugs, test failures, unexpected behavior, performance issues, and build failures before proposing fixes.
Writes implementation plans from specs for multi-step tasks, mapping files and breaking into TDD bite-sized steps before coding.
Filter false positives from multi-angle review, security audit, QA regression flags, or any high-stakes judgment before it turns into a blocker.
Three rounds of skeptical self-review + an impartial arbiter, with a confidence score from the vote.
| Caller | Finding type | Apply triage? |
|---|---|---|
/review | Angle 2/4/7/9 P0/P1 (security, SQL, privacy, concurrency) | Yes |
/review --deep | Any angle P0/P1 | Yes |
security-officer | CSO audit P0/P1 | Yes |
security-officer | Secret in source/git, confirmed CVE | No — hard finding |
qa-engineer | Flaky-test verdict (is this a regression or flake?) | Yes |
architect | ADR trade-off dispute (option A vs. B when both look reasonable) | Yes |
| Any | P2/advisory | No |
Run these sequentially. Each round sees prior reasoning. Arbiter sees all rounds.
Question: is the premise true?
Output: {round: 1, verdict: VALID|INVALID|UNCERTAIN, reasoning: "...", crux: "single key fact"}
Question: are claimed defenses real and sufficient?
Grep to find its actual implementation line.MAX_BUF_SIZE is not a verified bound — #define MAX_BUF_SIZE 64 is.If you cannot point to the line that enforces the defense, it does not exist.
Output: same JSON shape, with grep_used: true/false.
Question: what did Rounds 1-2 not consider?
Output: same JSON shape.
Input: all 3 rounds + original finding/question + source code.
Question: final call — which side has the stronger evidence?
verdict: VALID|INVALID (no UNCERTAIN — make the call).crux — the key fact the verdict turns on.Output:
{
"verdict": "VALID",
"crux": "memcpy at auth.c:142 copies network-controlled len bytes into 64-byte stack buffer with no bound check",
"reasoning": "Rounds 1 and 3 verified attacker reach; Round 2 found no size check in 50 LOC radius; arbiter confirms no caller clamps len."
}
Burn these into every round's prompt:
#define / const declaration.confidence = valid_rounds_before_arbiter / 3
100% (VVV) — 3/3 rounds VALID. Arbiter rubber-stamps unless it finds something brand-new.67% (VVI or VIV or IVV) — majority VALID. Arbiter breaks tie with new evidence.33% (IIV or IVI or VII) — majority INVALID. Arbiter usually confirms INVALID.0% (III) — 3/3 INVALID. Arbiter rarely overrides.Arbiter overrides the final verdict; confidence reflects the round vote for transparency. Record both in the output so humans can see where the arbiter diverged.
Once the arbiter returns:
| Arbiter verdict | Confidence | Severity action |
|---|---|---|
VALID | ≥ 50% | Keep original severity |
VALID | < 50% | Demote: P0→P1, P1→P2 |
INVALID | any | Remove from gate tally, record as [FILTERED] in report for audit |
UNCERTAIN (only if arbiter could not decide) | n/a | Keep original severity, flag for manual CTO review |
Every caller logs triage results to .great_cto/triage-log.jsonl (append-only, one JSON per line):
{
"timestamp": "2026-04-19T12:34:56Z",
"caller": "review|security-officer|qa-engineer|architect",
"finding_id": "SEC-042",
"file": "src/auth.c:142",
"original_severity": "P0",
"rounds": [
{"round": 1, "verdict": "VALID", "crux": "..."},
{"round": 2, "verdict": "VALID", "crux": "...", "grep_used": true},
{"round": 3, "verdict": "INVALID", "crux": "..."}
],
"arbiter": {"verdict": "VALID", "crux": "..."},
"confidence": 0.67,
"final_severity": "P0"
}
This log is how we measure whether triage earns its keep. Review it weekly:
# False-positive rate: how many findings the arbiter flipped to INVALID
jq 'select(.arbiter.verdict=="INVALID")' .great_cto/triage-log.jsonl | wc -l
# Average rounds-to-consensus (did we need all 3 or did R1+R2 agree?)
jq '[.rounds[].verdict] | unique | length' .great_cto/triage-log.jsonl
If FP rate < 10% after 50 triages — triage is filtering noise that wasn't there. Lower threshold or skip triage for that angle. If FP rate > 40% — original review prompt is too trigger-happy; tighten the angle rules.
Per triaged finding: ~4 LLM turns (3 rounds + arbiter). At typical review sizes (~5-10 triaged findings per PR), total budget: 20-40 extra turns per /review. Batch when possible — one arbiter can handle multiple findings in a single call if their cruxes are independent.
For cost-sensitive runs (approval-level: auto on a huge PR), consider: triage only P0, leave P1 untriaged. Re-tune based on .great_cto/triage-log.jsonl data.
confidence (the vote) and final_verdict (the arbiter). Humans deserve to see the disagreement.