From agent-almanac
Constructs well-structured technical arguments using hypothesis-argument-example triad. For PR descriptions, ADRs, code reviews, and proposals.
npx claudepluginhub pjt222/agent-almanacThis skill is limited to using the following tools:
Build rigorous arguments from hypothesis through reasoning to concrete evidence. Every persuasive technical claim follows the same triad: a clear hypothesis states *what* you believe, an argument explains *why* it holds, and examples prove *that* it holds. This skill teaches you to apply that structure to code reviews, design decisions, research writing, and any context where claims need justif...
Adversarially reviews software proposals: challenges why/what/how, surfaces blind spots, suggests alternatives. Use for 'review proposal', 'poke holes', or second opinions before specs.
Runs parallel agents to gather positive/negative code evidence for a statement, synthesizes objectively, verifies via file reads/lines. For architecture reviews, bug claims, performance analysis.
Exposes Claude's reasoning as auditable traces with atomic claims, assumption ratings, weakest links, confidence decomposition, and falsification conditions. Triggers on 'reasoning', 'why', 'trace'.
Share bugs, ideas, or general feedback.
Build rigorous arguments from hypothesis through reasoning to concrete evidence. Every persuasive technical claim follows the same triad: a clear hypothesis states what you believe, an argument explains why it holds, and examples prove that it holds. This skill teaches you to apply that structure to code reviews, design decisions, research writing, and any context where claims need justification.
State your claim as a clear, falsifiable hypothesis. A hypothesis is not an opinion or a preference -- it is a specific assertion that can be tested against evidence.
Falsifiable vs. unfalsifiable:
| Unfalsifiable (opinion) | Falsifiable (hypothesis) |
|---|---|
| "This code is bad" | "This function has O(n^2) complexity where O(n) is achievable" |
| "We should use TypeScript" | "TypeScript's type system will catch the class of null-reference bugs that caused 4 of our last 6 production incidents" |
| "The API design is cleaner" | "Replacing the 5 endpoint variants with a single parameterized endpoint reduces the public API surface by 60%" |
| "This research approach is better" | "Method A achieves higher precision than Method B on dataset X at the 95% confidence level" |
Expected: A one-sentence hypothesis that is specific, scoped, and falsifiable. Someone reading it can immediately imagine what evidence would confirm or refute it.
On failure: If the hypothesis feels vague, apply the "how would I disprove this?" test. If you cannot imagine counter-evidence, the claim is an opinion, not a hypothesis. Narrow the scope or add measurable criteria until it becomes testable.
Select the logical structure that best supports your hypothesis. Different claims call for different reasoning strategies.
| Type | Structure | Best for |
|---|---|---|
| Deductive | If A then B; A is true; therefore B | Formal proofs, type safety claims |
| Inductive | Observed pattern across N cases; therefore likely in general | Performance data, test results |
| Analogical | X is similar to Y in relevant ways; Y has property P; therefore X likely has P | Design decisions, technology choices |
| Evidential | Evidence E is more likely under hypothesis H1 than H2; therefore H1 is supported | Research findings, A/B test results |
Match your hypothesis to the strongest argument type:
Consider combining types for stronger arguments (e.g., analogical reasoning backed by inductive evidence)
Expected: A chosen argument type (or combination) with a clear rationale for why it fits the hypothesis.
On failure: If no single type fits cleanly, the hypothesis may need splitting into sub-claims. Break it into parts that each have a natural argument structure.
Build the logical chain that connects your hypothesis to its justification.
Worked example -- Code Review (deductive + inductive):
Hypothesis: "Extracting the validation logic into a shared module will reduce bug duplication across the three API handlers."
Premises:
- The three handlers (
createUser,updateUser,deleteUser) each implement the same input validation with slight variations (observed insrc/handlers/)- In the last 6 months, 3 of 5 validation bugs were fixed in one handler but not propagated to the others (see issues #42, #57, #61)
- Shared modules enforce a single source of truth for logic (deductive: if one implementation, then one place to fix)
Logical chain: Because the three handlers duplicate the same validation (premise 1), bugs fixed in one are missed in others (premise 2, inductive from 3/5 cases). A shared module means fixes apply once to all callers (deductive from shared-module semantics). Therefore, extraction will reduce bug duplication.
Counterargument (steelmanned): "Shared modules introduce coupling -- a change to validation for one handler could break the others."
Rebuttal: The handlers already share identical validation intent; the coupling is implicit and harder to maintain. Making it explicit via a shared module with parameterized options (e.g.,
validate(input, { requireEmail: true })) makes the coupling visible and testable. The current implicit duplication is riskier because it hides the dependency.
Worked example -- Research (evidential):
Hypothesis: "Pre-training on domain-specific corpora improves downstream task performance more than increasing general corpus size for biomedical NER."
Premises:
- BioBERT pre-trained on PubMed (4.5B words) outperforms BERT-Large pre-trained on general English (16B words) on 6/6 biomedical NER benchmarks (Lee et al., 2020)
- SciBERT pre-trained on Semantic Scholar (3.1B words) outperforms BERT-Base on SciERC and JNLPBA despite a smaller pre-training corpus
- General-domain scaling (BERT-Base to BERT-Large, 3x parameters) yields smaller gains on biomedical NER than domain adaptation (BERT-Base to BioBERT, same parameters)
Logical chain: The evidence consistently shows that domain corpus selection outweighs corpus scale for biomedical NER (evidential: these results are more likely if domain specificity matters more than scale). Three independent comparisons point the same direction, strengthening the inductive case.
Counterargument (steelmanned): "These results may not generalize beyond biomedical NER -- biomedicine has unusually specialized vocabulary that inflates the domain-adaptation advantage."
Rebuttal: Valid limitation. The hypothesis is scoped to biomedical NER specifically. However, similar domain-adaptation gains appear in legal NLP (Legal-BERT) and financial NLP (FinBERT), suggesting the pattern may generalize to other specialized domains, though that is a separate claim requiring its own evidence.
Expected: A complete argument chain with premises, logical connection, a steelmanned counterargument, and a rebuttal. The reader can follow the reasoning step by step.
On failure: If the argument feels weak, check the premises. Weak arguments usually stem from unsupported premises, not faulty logic. Find evidence for each premise or acknowledge it as an assumption. If the counterargument is stronger than the rebuttal, the hypothesis may need revision.
Support the argument with independently verifiable evidence. Examples are not illustrations -- they are the empirical foundation that makes the argument testable.
Example selection criteria:
| Criterion | Good example | Bad example |
|---|---|---|
| Independently verifiable | "Issue #42 shows the bug was fixed in handler A but not B" | "We've seen this kind of bug before" |
| Specific | "createUser at line 47 re-implements the same regex as updateUser at line 23" | "There's duplication in the codebase" |
| Representative | "3 of 5 validation bugs in the last 6 months followed this pattern" | "I once saw a bug like this" |
| Includes edge cases | "This pattern holds for string inputs but not for file upload validation, which has handler-specific constraints" | (no limitations mentioned) |
Expected: Concrete examples that a reader can verify independently. At least one positive and one edge case. Each references a specific artifact (file, line, issue, paper, dataset).
On failure: If examples are hard to find, the hypothesis may be too broad or not grounded in observable reality. Narrow the scope to what you can actually point to. Absence of examples is a signal, not a gap to paper over with vague references.
Combine hypothesis, argument, and examples into the appropriate format for the context.
For code reviews -- structure the comment as:
[S] <one-line summary of the suggestion>
**Hypothesis**: <what you believe should change and why>
**Argument**: <the logical case, with premises>
**Evidence**: <specific files, lines, issues, or metrics>
**Suggestion**: <concrete code change or approach>
For PR descriptions -- structure the body as:
## Why
<Hypothesis: what problem this solves and the specific improvement claim>
## Approach
<Argument: why this approach was chosen over alternatives>
## Evidence
<Examples: benchmarks, bug references, before/after comparisons>
For ADRs (Architecture Decision Records) -- use the standard ADR format with the triad mapped to Context (hypothesis), Decision (argument), and Consequences (examples/evidence of expected outcomes)
For research writing -- map to the standard structure: Introduction states the hypothesis, Methods/Results provide argument and examples, Discussion addresses counterarguments
Review the assembled argument for:
Expected: A complete, formatted argument appropriate for its context. The reader can evaluate the hypothesis, follow the reasoning, check the evidence, and consider counterarguments -- all in one coherent structure.
On failure: If the assembled argument feels disjointed, the hypothesis may be too broad. Split it into focused sub-arguments, each with its own hypothesis-argument-example triad. Two tight arguments are stronger than one sprawling one.
review-pull-request -- applying argumentation to structured code review feedbackreview-research -- constructing evidence-based arguments in research contextsreview-software-architecture -- justifying architectural decisions with the hypothesis-argument-example triadcreate-skill -- skills themselves are structured arguments for how to accomplish a taskwrite-claude-md -- documenting conventions and decisions that benefit from clear justificationFor high-stakes decisions, compose this skill with the advocatus-diaboli agent to form a pre-decision review loop. The pattern:
When to compose vs. use alone:
Example -- PR response refinement: Argumentation structured a response (hypothesis: combining PRs is better, argument with evidence, collaboration offer). Advocatus-diaboli then caught two critical issues: a claim about proxy process identification was speculative rather than factual (would have been embarrassing on a security PR), and "I have tested this in practice" was unverifiable. Both were removed. The final response was 40-50% shorter -- overexplaining signals insecurity.
Example -- System design triage: Argumentation (via Plan agent) designed a full 500-line triage pipeline. Advocatus-diaboli killed it: at 9 items, the system was premature and would itself become a maintenance burden (recursive trap). Final solution: 25 lines added to an existing script.