From 3-surgeons
Metacognition — evaluate gate system effectiveness, calibrate thresholds, and fine-tune invariance
npx claudepluginhub supportersimulator/3-surgeons --plugin 3-surgeonsThis skill is limited to using the following tools:
**The gates evaluate themselves.**
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
The gates evaluate themselves.
A gate system that cannot assess its own effectiveness will either over-gate (delivery freeze) or under-gate (quality erosion). Neither failure mode is acceptable. This skill provides the feedback loop that keeps the invariance suite calibrated.
| Trigger | When |
|---|---|
| Every 10 gate invocations | Automatic retrospective cycle |
| Aaron asks "how are the gates doing?" | On-demand health check |
| Gate override rate exceeds 30% | Signal that gates may be miscalibrated |
| Delivery velocity drops noticeably | Signal that gates may be over-blocking |
| Bug escapes to production | Signal that gates may be under-blocking |
Log these to the evidence store (~/.3surgeons/evidence.db) for every gate run:
| Metric | What It Measures | Target |
|---|---|---|
| time_to_gate | Total gate duration (seconds) | Light <30s, Standard <120s, Full <300s |
| mode_selected | Light/Standard/Full | Proportional to actual risk |
| sentinel_risk | Sentinel risk level output | Accurate reflection of true risk |
| surgeon_agreement | Did all 3 agree? (yes/partial/no) | Agreement on clear cases, disagreement on ambiguous |
| gate_outcome | pass/fail/bypass | Majority should pass (gates catch design issues, not block everything) |
| override | Did Aaron/Atlas override a finding? | Low rate (<15%) — high override = miscalibrated gate |
| cost_usd | LLM spend for this gate run | Light ~$0, Standard ~$0.003, Full ~$0.01 |
| false_positive | Gate blocked but change was fine (retrospective) | <20% — too many = delivery friction |
| false_negative | Gate passed but change caused issues (retrospective) | <5% — too many = gate not catching enough |
Every 10 gate invocations, run this analysis:
You MUST create a TodoWrite task for each step:
Collect metrics from last 10 invocations — query the evidence store for the 10 most recent gate entries. Calculate aggregates.
Calculate false positive rate — of the gates that blocked or flagged: how many were later shown to be unnecessary? Count cases where the flagged issue turned out to not be a problem.
Calculate false negative rate — of the gates that passed: how many changes later caused issues? Cross-reference with bug reports, rollbacks, or hotfixes that followed a passed gate.
Cross-exam with cardiologist — ask: "Given these 10 gate results, are we over-gating or under-gating? Is the risk stratification putting changes in the right modes?"
You are part of a SOFTWARE DEVELOPMENT quality gate system.
Here are the last 10 gate invocations with their outcomes:
[data]
Analyze: Are we over-gating (blocking good changes) or under-gating
(passing bad changes)? Is the Light/Standard/Full mode selection
accurate? What threshold adjustments would improve gate accuracy?
Ask neurologist for pattern classification — "Are certain change types consistently over-blocked or under-blocked?"
You are part of a SOFTWARE DEVELOPMENT protocol called '3-surgeons'.
You are the local LLM (Qwen3-4B). We write CODE, not perform medical surgery.
Here are 10 gate invocations with mode, risk, and outcome:
[data]
YOUR ROLE: Classify patterns. Are certain types of changes
(e.g., config changes, test additions) consistently assigned
higher risk modes than their outcomes justify? Which change
types are correctly calibrated?
Propose threshold adjustments — based on data, recommend specific calibration changes:
Present retrospective to Aaron — summary with data and specific recommendations. Format:
## Invariance Health — 10-Invocation Retrospective
**Period**: [date range]
**Gate invocations**: 10
**Pass rate**: X/10
**Override rate**: X/10
**Avg time-to-gate**: Xs (Light), Xs (Standard), Xs (Full)
**Total cost**: $X.XX
### Calibration Assessment
- False positive rate: X% (target <20%)
- False negative rate: X% (target <5%)
- Mode accuracy: X/10 changes were in the right mode
### Surgeon Assessment
- Cardiologist: [over-gating / under-gating / well-calibrated]
- Neurologist: [pattern findings]
### Recommendations
1. [specific adjustment with data backing]
2. [specific adjustment with data backing]
Recommendations from invariance-health feed back into the system:
gate invocations → evidence store → invariance-health analysis
↓
recommendations
↓
architectural-gate risk stratification
sentinel threshold adjustments
mode selection criteria updates
The system learns from its own gating decisions. This is not static configuration — it evolves based on empirical outcomes.
The cross-exam raised a valid point: 10-invocation retrospectives are purely retrospective. To catch fast-evolving issues, also monitor these real-time signals:
| Signal | Detection | Action |
|---|---|---|
| 3 consecutive overrides | Evidence store query | Trigger early retrospective |
| Gate time exceeds 2x target | Timer in gate execution | Flag for investigation |
| Surgeon disagreement rate >50% over 5 invocations | Evidence store query | Trigger cross-exam on the gate system itself |
| Zero gates in 48h despite active development | Absence detection | Are gates being silently skipped? |