Skill

sherlock

Hypothesis validation rules for CAPPY TAC investigations — applies falsification-first methodology, tests hypotheses against Phase 3 evidence, and eliminates hypotheses that cannot survive scrutiny.

Install

npx claudepluginhub thelightarchitect/cappy-toolkit --plugin cappy-toolkit

Tool Access

This skill uses the workspace's default tool permissions.

Preview

SKILL.md

Similar Skills

design-system

Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.

team-skills-platform

163.7k

ui-demo

Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.

team-skills-platform

163.7k

kotlin-patterns

Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.

team-skills-platform

163.7k

Stats

Stars0

Forks0

Last CommitApr 16, 2026

Actions

View Source View Plugin View on GitHub View README

/sherlock - Hypothesis Validation Rules

Version: 1.1.0 Purpose: Validate Phase 4 hypothesis against Phase 3 evidence Agent: CAPPY (singleton agent) Created: 2026-02-05 Updated: 2026-03-02 — Added Falsification-First methodology (SF-03936747 lesson)

Hook Integration (CRITICAL)

HypothesisCoherenceHook (Part of Gate System)

When to Trigger: After agent validates hypothesis alignment with evidence

Hook Specification:

HypothesisCoherenceHook::trigger(
  hypothesis: String,         // Root cause hypothesis from Phase 4
  alignment_score: f32,       // Evidence alignment score (0.0-1.0)
  threshold: f32,             // 0.85 (hard threshold for Phase 4)
  weak_assumptions: Vec<String>, // Unverified claims in hypothesis
  contradictions: Vec<String>, // Evidence contradicting hypothesis
  confidence_delta: f32,       // Adjustment to overall confidence
  details: HashMap {
    "matches": usize,         // Claims supported by evidence
    "gaps": usize,            // Claims not supported by evidence
    "evidence_alignment_rate": f32
  }
)

Agent Implementation:

Extract hypothesis claims from Phase 4 output
For each claim, find supporting evidence from Phase 3
Calculate alignment_score = matches / total_claims
Identify weak assumptions (inferred, not directly verified)
Check for contradictions between hypothesis and evidence
Trigger hook with all results

Hook Response Handling:

Hook returns: { passed: bool, confidence_adjustment: float, recommendations: Vec<String> }

If passed (alignment >= 0.85):
  → Return "PASSED: Hypothesis aligns with evidence" to agent
  → Confidence can increase by delta (typically +0.05 to +0.15)

If blocked (alignment < 0.85):
  → Return "BLOCKED: Weak hypothesis-evidence alignment" with weak_assumptions and contradictions
  → Agent provides recovery options to Claude:
    Option 1: Accept weak assumptions, flag for Phase 5 verification
    Option 2: Revise hypothesis to align better with evidence
    Option 3: Request Kevin guidance on alternative hypotheses

Why This Hook: Hypothesis is the core of the investigation. If it doesn't align with evidence, Phase 5 validation will be wasted. This hook catches misalignment early.

Identity

The /sherlock skill defines hypothesis validation rules that check if a proposed root cause:

Aligns with extracted evidence
Has no contradictions with evidence
Has verifiable assumptions
Is logically coherent
Accounts for all key facts

Agent CAPPY validates Phase 4 hypothesis and identifies weak assumptions for Phase 5 verification.

Falsification-First Methodology (MANDATORY)

Origin: SF-03936747. An expired X.509 signing certificate was present in the HAR evidence from day one. Initial analysis verified URLs and StatusCode (all matched), hypothesized ACS URL mismatch based on precedent, and sent guidance to the customer. Four days and three customer responses later, the certificate was finally extracted and found to have expired 21.5 hours before the login attempt. One openssl x509 command on day one would have identified the root cause immediately.

Principle: Do not seek evidence that confirms a hypothesis — seek evidence that would disprove it. Hypotheses that survive disproof attempts are candidates for truth. Hypotheses that are disproven are eliminated.

Falsification Process (Before Phase 4 Gate)

FOR EACH hypothesis from triage:

  1. STATE the hypothesis clearly
  2. ASK: "What evidence would DISPROVE this?"
  3. GO FIND that evidence:
     - Decode ALL embedded artifacts (certs, tokens, encoded payloads)
     - Check every field, not just the obvious ones
     - Surface-level checks are INSUFFICIENT
  4. EVALUATE:
     - Evidence found that disproves → ELIMINATE hypothesis
     - Evidence found that is ambiguous → FLAG for deeper verification
     - No disproving evidence found despite thorough search → CANDIDATE for truth
  5. REPEAT for next hypothesis

Artifact Deep-Decode Checklist

Every investigation must decode and inspect ALL embedded artifacts, not just check field names and status codes:

Artifact Type	Surface Check (INSUFFICIENT)	Deep Check (REQUIRED)
SAML Response	URLs match, StatusCode=Success	Extract X.509 cert → check expiry, issuer, subject, fingerprint
JWT Token	Token present, not expired	Decode payload → check claims, audience, issuer, signing algorithm
TLS/SSL	Connection succeeds	Check cert chain, expiry, CN/SAN match, protocol version
HAR entries	Status codes, URL patterns	Decode Base64 bodies, check response content, timing correlations
Log bundles	Grep for "error"	Parse structured fields, correlate timestamps, check config values
Config files	Key names present	Validate values match expected format, cross-reference with other sources

Example: SF-03936747

Hypothesis: "ACS URL mismatch between Okta and XSOAR"

Disproof question: "What evidence would show the ACS URL IS correct?"

Evidence found:
  - SAMLRequest ACS URL = https://...dev.crtx.../idp/saml  ✓ matches
  - SAMLResponse Destination = https://...dev.crtx.../idp/saml  ✓ matches
  - SAMLResponse Audience = https://...dev.crtx...  ✓ matches
  - SAMLResponse Recipient = https://...dev.crtx.../idp/saml  ✓ matches

Result: ALL URLs match → ACS mismatch hypothesis DISPROVEN → ELIMINATE

Next hypothesis: "X.509 certificate issue"
Deep decode: openssl x509 → Certificate expired 21.5 hours before login
Result: NOT disproven, CONFIRMED → ROOT CAUSE

Validation Framework

What Constitutes Valid Hypothesis

A hypothesis is VALID when:

Alignment:
  - All direct claims in hypothesis are supported by Phase 3 evidence
  - Evidence patterns match hypothesis explanation
  - Timeline matches hypothesis sequence

Contradictions:
  - No evidence contradicts hypothesis
  - No assumptions conflict with facts
  - No timeline mismatches

Assumptions:
  - All unverified claims clearly marked as assumptions
  - Assumptions are testable (can be verified in Phase 5)
  - Assumptions are documented with what would verify them

Completeness:
  - Hypothesis accounts for all key evidence
  - Hypothesis explains observed behavior
  - No major facts left unexplained

Coherence:
  - Logical flow from evidence to conclusion
  - Root cause explains symptoms
  - Solution would resolve root cause

Validation Rules

Rule 1: Evidence Alignment

Check: Does hypothesis align with Phase 3 evidence?

Evidence Alignment Checks:
  - For each claim in hypothesis:
      ✓ Find supporting evidence in Phase 3
      ✓ Check claim type matches evidence type
      ✓ Check confidence level reasonable

  - Matching patterns:
      evidence: "HTTP 429 at 20:51:45Z"
      hypothesis_claim: "Rate limit exceeded"
      match: YES (429 = rate limit error)

      evidence: "Request rate 150 req/min"
      hypothesis_claim: "Exceeded limit of 100 req/min"
      match: YES (150 > 100)

  - Non-matching patterns:
      evidence: "Timeout after 30 seconds"
      hypothesis_claim: "Webhook integration timeout"
      match: WEAK if no webhook in architecture
        → Flag as: "Assumption: Uses webhook integration"

Agent Action:

matches = []
gaps = []

for claim in hypothesis_claims:
    supporting = find_evidence(claim, phase_3_evidence)
    if supporting:
        matches.append({claim, evidence: supporting})
    else:
        gaps.append(claim)

alignment_score = len(matches) / len(hypothesis_claims)

Rule 2: Contradiction Detection

Check: Does evidence contradict hypothesis?

Contradiction Types:
  1. Temporal Contradiction
     hypothesis: "Error occurs at 10:30:45Z"
     evidence: "Error log shows 10:30:42Z"
     conflict: YES (3-second difference significant?)

  2. Behavioral Contradiction
     hypothesis: "Integration disabled"
     evidence: "Integration ran at 10:30:45Z"
     conflict: YES (can't run if disabled)

  3. Data Type Contradiction
     hypothesis: "String overflow in field"
     evidence: "Field is integer type"
     conflict: YES (can't overflow string in int field)

  4. Architecture Contradiction
     hypothesis: "Webhook integration timeout"
     evidence: "Customer uses REST polling, no webhooks"
     conflict: YES (wrong integration type)

Severity Levels:
  CRITICAL: Hypothesis impossible given evidence
  HIGH: Hypothesis contradicted by evidence
  MEDIUM: Timing mismatch or unclear
  LOW: Minor discrepancy, likely explanation exists

Agent Action:

contradictions = []

for assumption in hypothesis_assumptions:
    contradicting = find_contradiction(assumption, phase_3_evidence)
    if contradicting:
        contradictions.append({
            assumption: assumption,
            contradiction: contradicting,
            severity: assess_severity(assumption, contradicting)
        })

has_critical = any(c.severity == CRITICAL for c in contradictions)

Rule 3: Assumption Validation

Check: Which assumptions are unverified?

Assumption Types:
  1. Evidence-based Assumption
     "API has 100 req/min rate limit"
     evidence: "HTTP 429 errors when request rate > 100"
     verification: WEAK (inferred, not directly stated)
     action: Mark as assumption, verify in Phase 5

  2. Configuration Assumption
     "Polling interval is 60 seconds"
     evidence: "Integration config not in bundle"
     verification: UNVERIFIED (not in Phase 3)
     action: Retrieve config in Phase 5

  3. Behavioral Assumption
     "System retry behavior causes cascade"
     evidence: "Single failed request, then 5 retries"
     verification: WEAK (inferred from observed pattern)
     action: Verify in logs/docs in Phase 5

  4. Capability Assumption
     "Webhook timeout fires after 30s"
     evidence: "Not tested in customer environment"
     verification: UNVERIFIED (from documentation, not observation)
     action: Verify against actual customer setup in Phase 5

Test Criteria:
  ✓ Can we test this assumption in Phase 5?
  ✓ What evidence would verify/refute it?
  ✓ How important is it to the root cause?

Agent Action:

assumptions = extract_assumptions(hypothesis)
key_assumptions = []

for assumption in assumptions:
    verification_level = assess_verification(assumption, phase_3_evidence)
    if verification_level < 0.80:  # Below 80% verified
        key_assumptions.append({
            assumption: assumption,
            verified: verification_level,
            test_in_phase_5: generate_test_strategy(assumption)
        })

Confidence Delta Calculation

How Validation Changes Confidence

Confidence Adjustment Rules:

Base Confidence (from Phase 2 triage):
  phase_2_confidence: 0.85  # e.g., 85%

Evidence Alignment Score:
  matches: 10/12 claims = 0.83
  delta: 0.83 - 1.0 = -0.17

Contradiction Check:
  critical_contradictions: 0
  high_contradictions: 0
  delta: 0.0 (none found)

Assumption Risk:
  unverified_assumptions: 2
  importance: medium, low
  delta: -0.05

Final Confidence:
  0.85 + (-0.17) + 0.0 + (-0.05) = 0.63

Interpretation:
  Phase 2 confidence: 85%
  Phase 4 validation confidence: 63%
  Delta: -22% (significant downward adjustment)
  Meaning: Phase 3 evidence less aligned than Phase 2 suggested
  Action: Revalidate triage assumptions, adjust hypothesis

Validation Output Schema

{
  "status": "VALID|WEAK|INVALID",
  "confidence_before": 0.85,
  "confidence_after": 0.63,
  "confidence_delta": -0.22,
  "evidence_alignment": {
    "matches": [
      {claim: "HTTP 429 errors", evidence: "HAR entry 145", strength: 0.95},
      {claim: "Rate limit", evidence: "429 = rate limit", strength: 0.90}
    ],
    "gaps": [
      {claim: "API limit is 100 req/min", evidence: "NONE", strength: 0.0}
    ],
    "alignment_score": 0.83
  },
  "contradictions": {
    "critical": [],
    "high": [],
    "medium": [],
    "low": [],
    "total": 0
  },
  "key_assumptions": [
    {
      "assumption": "API rate limit is 100 req/min",
      "verified": 0.30,
      "importance": "CRITICAL",
      "test_strategy": "Search customer config, API docs, JIRA history"
    },
    {
      "assumption": "Webhook integration NOT in use",
      "verified": 0.70,
      "importance": "HIGH",
      "test_strategy": "Verify in JIRA, customer architecture docs"
    }
  ],
  "weak_assumptions_count": 2,
  "recommendation": "Hypothesis WEAK - proceed with Phase 5 focus on verifying key assumptions"
}