From cappy-toolkit
Hypothesis validation rules for CAPPY TAC investigations — applies falsification-first methodology, tests hypotheses against Phase 3 evidence, and eliminates hypotheses that cannot survive scrutiny.
npx claudepluginhub thelightarchitect/cappy-toolkit --plugin cappy-toolkitThis skill uses the workspace's default tool permissions.
<!-- Copyright (C) 2025-2026 Kevin Francis Tan (github.com/theLightArchitect) | SPDX-License-Identifier: AGPL-3.0-or-later -->
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
Version: 1.1.0 Purpose: Validate Phase 4 hypothesis against Phase 3 evidence Agent: CAPPY (singleton agent) Created: 2026-02-05 Updated: 2026-03-02 — Added Falsification-First methodology (SF-03936747 lesson)
When to Trigger: After agent validates hypothesis alignment with evidence
Hook Specification:
HypothesisCoherenceHook::trigger(
hypothesis: String, // Root cause hypothesis from Phase 4
alignment_score: f32, // Evidence alignment score (0.0-1.0)
threshold: f32, // 0.85 (hard threshold for Phase 4)
weak_assumptions: Vec<String>, // Unverified claims in hypothesis
contradictions: Vec<String>, // Evidence contradicting hypothesis
confidence_delta: f32, // Adjustment to overall confidence
details: HashMap {
"matches": usize, // Claims supported by evidence
"gaps": usize, // Claims not supported by evidence
"evidence_alignment_rate": f32
}
)
Agent Implementation:
Hook Response Handling:
Hook returns: { passed: bool, confidence_adjustment: float, recommendations: Vec<String> }
If passed (alignment >= 0.85):
→ Return "PASSED: Hypothesis aligns with evidence" to agent
→ Confidence can increase by delta (typically +0.05 to +0.15)
If blocked (alignment < 0.85):
→ Return "BLOCKED: Weak hypothesis-evidence alignment" with weak_assumptions and contradictions
→ Agent provides recovery options to Claude:
Option 1: Accept weak assumptions, flag for Phase 5 verification
Option 2: Revise hypothesis to align better with evidence
Option 3: Request Kevin guidance on alternative hypotheses
Why This Hook: Hypothesis is the core of the investigation. If it doesn't align with evidence, Phase 5 validation will be wasted. This hook catches misalignment early.
The /sherlock skill defines hypothesis validation rules that check if a proposed root cause:
Agent CAPPY validates Phase 4 hypothesis and identifies weak assumptions for Phase 5 verification.
Origin: SF-03936747. An expired X.509 signing certificate was present in the HAR evidence from day one. Initial analysis verified URLs and StatusCode (all matched), hypothesized ACS URL mismatch based on precedent, and sent guidance to the customer. Four days and three customer responses later, the certificate was finally extracted and found to have expired 21.5 hours before the login attempt. One openssl x509 command on day one would have identified the root cause immediately.
Principle: Do not seek evidence that confirms a hypothesis — seek evidence that would disprove it. Hypotheses that survive disproof attempts are candidates for truth. Hypotheses that are disproven are eliminated.
FOR EACH hypothesis from triage:
1. STATE the hypothesis clearly
2. ASK: "What evidence would DISPROVE this?"
3. GO FIND that evidence:
- Decode ALL embedded artifacts (certs, tokens, encoded payloads)
- Check every field, not just the obvious ones
- Surface-level checks are INSUFFICIENT
4. EVALUATE:
- Evidence found that disproves → ELIMINATE hypothesis
- Evidence found that is ambiguous → FLAG for deeper verification
- No disproving evidence found despite thorough search → CANDIDATE for truth
5. REPEAT for next hypothesis
Every investigation must decode and inspect ALL embedded artifacts, not just check field names and status codes:
| Artifact Type | Surface Check (INSUFFICIENT) | Deep Check (REQUIRED) |
|---|---|---|
| SAML Response | URLs match, StatusCode=Success | Extract X.509 cert → check expiry, issuer, subject, fingerprint |
| JWT Token | Token present, not expired | Decode payload → check claims, audience, issuer, signing algorithm |
| TLS/SSL | Connection succeeds | Check cert chain, expiry, CN/SAN match, protocol version |
| HAR entries | Status codes, URL patterns | Decode Base64 bodies, check response content, timing correlations |
| Log bundles | Grep for "error" | Parse structured fields, correlate timestamps, check config values |
| Config files | Key names present | Validate values match expected format, cross-reference with other sources |
Hypothesis: "ACS URL mismatch between Okta and XSOAR"
Disproof question: "What evidence would show the ACS URL IS correct?"
Evidence found:
- SAMLRequest ACS URL = https://...dev.crtx.../idp/saml ✓ matches
- SAMLResponse Destination = https://...dev.crtx.../idp/saml ✓ matches
- SAMLResponse Audience = https://...dev.crtx... ✓ matches
- SAMLResponse Recipient = https://...dev.crtx.../idp/saml ✓ matches
Result: ALL URLs match → ACS mismatch hypothesis DISPROVEN → ELIMINATE
Next hypothesis: "X.509 certificate issue"
Deep decode: openssl x509 → Certificate expired 21.5 hours before login
Result: NOT disproven, CONFIRMED → ROOT CAUSE
A hypothesis is VALID when:
Alignment:
- All direct claims in hypothesis are supported by Phase 3 evidence
- Evidence patterns match hypothesis explanation
- Timeline matches hypothesis sequence
Contradictions:
- No evidence contradicts hypothesis
- No assumptions conflict with facts
- No timeline mismatches
Assumptions:
- All unverified claims clearly marked as assumptions
- Assumptions are testable (can be verified in Phase 5)
- Assumptions are documented with what would verify them
Completeness:
- Hypothesis accounts for all key evidence
- Hypothesis explains observed behavior
- No major facts left unexplained
Coherence:
- Logical flow from evidence to conclusion
- Root cause explains symptoms
- Solution would resolve root cause
Check: Does hypothesis align with Phase 3 evidence?
Evidence Alignment Checks:
- For each claim in hypothesis:
✓ Find supporting evidence in Phase 3
✓ Check claim type matches evidence type
✓ Check confidence level reasonable
- Matching patterns:
evidence: "HTTP 429 at 20:51:45Z"
hypothesis_claim: "Rate limit exceeded"
match: YES (429 = rate limit error)
evidence: "Request rate 150 req/min"
hypothesis_claim: "Exceeded limit of 100 req/min"
match: YES (150 > 100)
- Non-matching patterns:
evidence: "Timeout after 30 seconds"
hypothesis_claim: "Webhook integration timeout"
match: WEAK if no webhook in architecture
→ Flag as: "Assumption: Uses webhook integration"
Agent Action:
matches = []
gaps = []
for claim in hypothesis_claims:
supporting = find_evidence(claim, phase_3_evidence)
if supporting:
matches.append({claim, evidence: supporting})
else:
gaps.append(claim)
alignment_score = len(matches) / len(hypothesis_claims)
Check: Does evidence contradict hypothesis?
Contradiction Types:
1. Temporal Contradiction
hypothesis: "Error occurs at 10:30:45Z"
evidence: "Error log shows 10:30:42Z"
conflict: YES (3-second difference significant?)
2. Behavioral Contradiction
hypothesis: "Integration disabled"
evidence: "Integration ran at 10:30:45Z"
conflict: YES (can't run if disabled)
3. Data Type Contradiction
hypothesis: "String overflow in field"
evidence: "Field is integer type"
conflict: YES (can't overflow string in int field)
4. Architecture Contradiction
hypothesis: "Webhook integration timeout"
evidence: "Customer uses REST polling, no webhooks"
conflict: YES (wrong integration type)
Severity Levels:
CRITICAL: Hypothesis impossible given evidence
HIGH: Hypothesis contradicted by evidence
MEDIUM: Timing mismatch or unclear
LOW: Minor discrepancy, likely explanation exists
Agent Action:
contradictions = []
for assumption in hypothesis_assumptions:
contradicting = find_contradiction(assumption, phase_3_evidence)
if contradicting:
contradictions.append({
assumption: assumption,
contradiction: contradicting,
severity: assess_severity(assumption, contradicting)
})
has_critical = any(c.severity == CRITICAL for c in contradictions)
Check: Which assumptions are unverified?
Assumption Types:
1. Evidence-based Assumption
"API has 100 req/min rate limit"
evidence: "HTTP 429 errors when request rate > 100"
verification: WEAK (inferred, not directly stated)
action: Mark as assumption, verify in Phase 5
2. Configuration Assumption
"Polling interval is 60 seconds"
evidence: "Integration config not in bundle"
verification: UNVERIFIED (not in Phase 3)
action: Retrieve config in Phase 5
3. Behavioral Assumption
"System retry behavior causes cascade"
evidence: "Single failed request, then 5 retries"
verification: WEAK (inferred from observed pattern)
action: Verify in logs/docs in Phase 5
4. Capability Assumption
"Webhook timeout fires after 30s"
evidence: "Not tested in customer environment"
verification: UNVERIFIED (from documentation, not observation)
action: Verify against actual customer setup in Phase 5
Test Criteria:
✓ Can we test this assumption in Phase 5?
✓ What evidence would verify/refute it?
✓ How important is it to the root cause?
Agent Action:
assumptions = extract_assumptions(hypothesis)
key_assumptions = []
for assumption in assumptions:
verification_level = assess_verification(assumption, phase_3_evidence)
if verification_level < 0.80: # Below 80% verified
key_assumptions.append({
assumption: assumption,
verified: verification_level,
test_in_phase_5: generate_test_strategy(assumption)
})
How Validation Changes Confidence
Confidence Adjustment Rules:
Base Confidence (from Phase 2 triage):
phase_2_confidence: 0.85 # e.g., 85%
Evidence Alignment Score:
matches: 10/12 claims = 0.83
delta: 0.83 - 1.0 = -0.17
Contradiction Check:
critical_contradictions: 0
high_contradictions: 0
delta: 0.0 (none found)
Assumption Risk:
unverified_assumptions: 2
importance: medium, low
delta: -0.05
Final Confidence:
0.85 + (-0.17) + 0.0 + (-0.05) = 0.63
Interpretation:
Phase 2 confidence: 85%
Phase 4 validation confidence: 63%
Delta: -22% (significant downward adjustment)
Meaning: Phase 3 evidence less aligned than Phase 2 suggested
Action: Revalidate triage assumptions, adjust hypothesis
{
"status": "VALID|WEAK|INVALID",
"confidence_before": 0.85,
"confidence_after": 0.63,
"confidence_delta": -0.22,
"evidence_alignment": {
"matches": [
{claim: "HTTP 429 errors", evidence: "HAR entry 145", strength: 0.95},
{claim: "Rate limit", evidence: "429 = rate limit", strength: 0.90}
],
"gaps": [
{claim: "API limit is 100 req/min", evidence: "NONE", strength: 0.0}
],
"alignment_score": 0.83
},
"contradictions": {
"critical": [],
"high": [],
"medium": [],
"low": [],
"total": 0
},
"key_assumptions": [
{
"assumption": "API rate limit is 100 req/min",
"verified": 0.30,
"importance": "CRITICAL",
"test_strategy": "Search customer config, API docs, JIRA history"
},
{
"assumption": "Webhook integration NOT in use",
"verified": 0.70,
"importance": "HIGH",
"test_strategy": "Verify in JIRA, customer architecture docs"
}
],
"weak_assumptions_count": 2,
"recommendation": "Hypothesis WEAK - proceed with Phase 5 focus on verifying key assumptions"
}