Skill

scientific-critical-thinking

Evaluates scientific evidence and claims via study design hierarchy (RCTs to expert opinion), effect sizes (OR, RR, NNT, Cohen's d), biases (selection, information, confounding, reporting), p-value vs clinical significance, GRADE, reproducibility. Use when reading papers or assessing claims.

developer-tools

npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skills

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Scientific critical thinking is the disciplined application of logical and methodological standards to evaluate whether a study's design, analysis, and interpretation support its conclusions. It is the skill that separates a researcher who synthesizes evidence from one who accumulates it. This guide covers the hierarchy of evidence, the mechanics of common biases, effect size interpretation, th...

SKILL.md

Similar Skills

scientific-critical-thinking

Evaluates scientific claims and evidence quality, assesses experimental design validity, identifies biases and confounders, applies GRADE and Cochrane Risk of Bias frameworks.

6 files4 tools

superpowers

scientific-critical-thinking

1.6k

Evaluates research rigor: assesses methodology, experimental design, statistical validity, biases, confounding, evidence quality (GRADE, Cochrane ROB). For scientific claim analysis.

6 files4 tools

k-dense-ai-claude-scientific-writer

domain-research-health-science

Guides health science research via PICOT formulation, evidence hierarchy assessment, bias evaluation (Cochrane RoB 2, ROBINS-I), outcome prioritization, and GRADE certainty rating. For clinical questions, systematic reviews, evidence summaries.

3 files

thinking-frameworks-skills

Stats

Stars135

Forks16

Last CommitApr 28, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Scientific Critical Thinking: Evaluating Evidence and Claims

Overview

Key Concepts

1. Study Design Hierarchy

Study designs vary in their ability to support causal inference. The hierarchy below applies to questions about the effect of an intervention or exposure on an outcome:

Systematic reviews and meta-analyses of RCTs (highest causal certainty)
    ↓
Randomized Controlled Trials (RCTs)
    ↓
Non-randomized controlled trials / cluster-randomized trials
    ↓
Prospective cohort studies (follow exposure → outcome forward in time)
    ↓
Retrospective cohort studies
    ↓
Case-control studies (compare exposed vs. unexposed given outcome)
    ↓
Cross-sectional studies (measure exposure and outcome simultaneously)
    ↓
Case series and case reports
    ↓
Expert opinion, mechanistic reasoning, animal models (lowest causal certainty)

Important exceptions: For questions about rare outcomes, case-control designs are often more efficient than cohort studies. For questions about diagnostic accuracy, randomized designs are usually inappropriate — cross-sectional or cohort designs with verified reference standards are preferred. For harm questions, RCTs are often infeasible (ethical constraints), making large cohort studies the best available evidence.

2. Effect Measures and Their Interpretation

Effect measures quantify the relationship between an exposure/intervention and an outcome. Confusing them is a leading source of misinterpretation.

Measure	Formula	Use case	Key interpretation
Risk Ratio (RR)	Risk in exposed / Risk in unexposed	Cohort studies, RCTs	RR = 2.0: exposed group has twice the risk
Odds Ratio (OR)	Odds in exposed / Odds in unexposed	Case-control studies, logistic regression	Approximates RR when outcome is rare (<10%); overestimates RR for common outcomes
Hazard Ratio (HR)	Instantaneous event rate ratio	Survival analysis (Cox regression)	HR = 0.7: 30% lower hazard of event per time unit in treated group
Number Needed to Treat (NNT)	1 / Absolute Risk Reduction	Clinical decision-making	NNT = 20: treat 20 patients to prevent 1 event
Absolute Risk Reduction (ARR)	Risk_control − Risk_treated	Clinical impact	ARR = 2%: intervention reduces absolute event rate by 2 percentage points
Cohen's d	(μ₁ − μ₂) / σ_pooled	Continuous outcomes, psychology	d = 0.2 small; 0.5 medium; 0.8 large
Pearson r	Correlation coefficient	Association, not causal	r = 0.1 small; 0.3 medium; 0.5 large (Cohen 1988)

Common error: Reporting only the relative risk reduction (e.g., "50% reduction in risk") without the absolute risk reduction. A treatment that reduces risk from 2% to 1% has a 50% relative reduction but only 1% absolute reduction (NNT = 100). The relative measure appears more impressive but the absolute measure is clinically relevant.

3. Bias Types

Bias is systematic deviation of results or inferences from the truth. Unlike random error (reduced by larger samples), bias is directional and not correctable by increasing sample size.

Selection bias: Systematic difference in characteristics between those selected and not selected for study.

Examples: Healthy worker effect (workers healthier than general population), loss to follow-up bias (sicker patients drop out), volunteer bias
Detection: Compare baseline characteristics of included vs. excluded participants; assess attrition patterns

Information bias: Systematic error in measuring exposure or outcome.

Recall bias: Cases remember exposure better than controls (especially case-control studies for rare diseases)
Observer bias: Assessors aware of exposure status rate outcomes differently
Detection: Look for blinding of outcome assessors; validated measurement instruments; objective vs. self-reported outcomes

Confounding: A variable associated with both the exposure and the outcome, creating a spurious or masked association.

Positive confounding: Confounder inflates the observed association
Negative confounding: Confounder masks a true association
Detection: Compare crude and adjusted effect estimates; large change (>10%) indicates confounding
Control methods: Randomization (RCTs), multivariable adjustment, propensity score methods, restriction, matching

Reporting bias: Selective reporting of outcomes or results based on their statistical significance or direction.

Publication bias: Positive results are published; null results are not
Outcome reporting bias: Predefined outcomes not reported if non-significant; unplanned analyses reported if significant
Detection: Compare registered protocol vs. published outcomes; funnel plot asymmetry for meta-analyses

4. The P-value and Statistical vs. Clinical Significance

A p-value is the probability of observing results at least as extreme as those obtained, under the null hypothesis. It is NOT the probability that the null hypothesis is true, nor the probability that the finding is a false positive.

Correct interpretation: p < 0.05 means that, if the null hypothesis were true, fewer than 5% of equally designed studies would produce results this extreme or more. It does NOT indicate the effect is large, clinically meaningful, or replicable.

Statistical significance ≠ clinical significance: With large enough samples, even trivially small effects become statistically significant. A blood pressure drug that reduces systolic BP by 0.8 mmHg (95% CI 0.3–1.3, p = 0.001) is statistically significant but clinically irrelevant.

Confidence intervals are more informative than p-values: A 95% CI of [0.5 kg, 35 kg weight loss] and a 95% CI of [0.5 kg, 1.5 kg weight loss] can both have p < 0.05, but the clinical implications are vastly different. Always focus on CI width and range, not just whether it excludes the null.

5. GRADE Evidence Certainty

GRADE classifies the certainty of evidence across four levels:

GRADE level	Meaning	Typical starting point
High	Further research very unlikely to change confidence	Consistent RCTs, large effect, no bias
Moderate	Further research likely to have important impact	RCTs with limitations, or strong consistent observational
Low	Further research very likely to have important impact	Observational studies, or RCTs with serious limitations
Very low	Any estimate is very uncertain	Case series, expert opinion, very inconsistent results

GRADE certainty can be downgraded for: risk of bias, inconsistency (heterogeneity), indirectness (different population/outcome), imprecision (wide CI), and publication bias. It can be upgraded for: large effect (OR > 5), dose-response relationship, or all plausible confounders would reduce the effect.

Decision Framework

How should I evaluate this study?
│
├── Step 1: What question is being answered?
│   ├── Intervention effectiveness → Need RCT or high-quality cohort
│   ├── Diagnostic accuracy → Need cross-sectional vs reference standard
│   ├── Prognosis → Need prospective cohort
│   └── Harm / rare exposure → Case-control or large cohort acceptable
│
├── Step 2: Is the study design appropriate?
│   ├── Design matches question → Proceed
│   └── Mismatch → Major limitation (flag)
│
├── Step 3: What are the key threats to validity?
│   ├── Selection bias → Who was included/excluded? Loss to follow-up?
│   ├── Information bias → Blinding? Validated instruments?
│   └── Confounding → What was adjusted for? Residual confounders?
│
├── Step 4: Are the effect estimates clinically meaningful?
│   ├── Effect size large enough to matter clinically?
│   ├── CI narrow enough to be informative?
│   └── Absolute vs relative risk reported?
│
└── Step 5: How certain is the evidence overall? (GRADE)
    ├── High certainty → Confident conclusion
    ├── Moderate certainty → Likely true; note limitations
    ├── Low certainty → Uncertain; more research needed
    └── Very low certainty → Cannot draw conclusions

Claim type	Appropriate response	Red flags requiring skepticism
"Drug X reduces mortality by 50%"	Ask: 50% relative or absolute? What was baseline risk?	Only relative risk reported; no CI provided
"Observational study shows cause"	Downgrade to "association"; list plausible confounders	Authors use "causes" without adjustment
"Significant p-value proves effect"	Check effect size and CI; assess clinical relevance	p = 0.04 with N = 50,000 and tiny effect
"Single RCT is definitive"	Check for replication; assess risk of bias	Funded by manufacturer; no blinding
"Preprint shows breakthrough"	Await peer review; check for reproducibility	No data/code sharing; sensational press release
"N-of-1 case report demonstrates treatment"	Note limited generalizability; no control	Used to support policy without cohort evidence

Best Practices

Read the Methods before the Results: The Discussion section is written by the authors to support their conclusions. The Methods section is where you independently assess whether the data can support those conclusions. Specifically: what were the pre-specified primary outcomes? Do the reported outcomes match those in the Methods or the registered protocol?
Always seek the pre-registration record: ClinicalTrials.gov, PROSPERO, and OSF registrations contain the original protocol. Comparing the pre-specified primary outcome to what was reported in the abstract is the single most efficient check for outcome reporting bias. A change in primary outcome without explanation is a major red flag.
Distinguish statistical and clinical significance explicitly: For every effect estimate, ask: if this effect is real, would it matter to a patient or a biological system? A genomic variant with OR = 1.05 (p = 1e-12) in a GWAS of 500,000 people is a genuine association but contributes negligibly to disease risk prediction.
Identify the funding source and conflicts of interest: Industry-funded trials are not automatically invalid, but industry funding is associated with more favorable outcomes for the sponsor's product. Assess whether conflicts are disclosed, whether the funder had access to data or participated in analysis, and whether an independent statistician reviewed the data.
Check for multiple testing without correction: When a paper tests 20 outcomes, 1 will be statistically significant at p < 0.05 by chance alone. Look for corrections (Bonferroni, Benjamini-Hochberg FDR) in genomics, proteomics, and other high-throughput studies. Absence of correction in a paper reporting 50 comparisons invalidates the significance claims.
Require absolute risk data before accepting clinical conclusions: For any binary outcome, request (or calculate) the absolute risk reduction (ARR) and number needed to treat (NNT) in addition to the relative risk. This applies to both journal articles and news coverage of medical research. ARR = Control_rate − Treatment_rate; NNT = 1/ARR.
Apply structured critical appraisal checklists appropriate to study design: CONSORT for RCTs, STROBE for observational studies, TRIPOD for prediction models, STARD for diagnostic accuracy, GRADE for certainty assessment. These checklists are available free from EQUATOR Network and identify every element required for a complete report.

Common Pitfalls

Confusing association with causation in observational studies: Observational studies identify associations, not causes. Coffee drinking is associated with reduced colorectal cancer risk — but coffee drinkers differ from non-drinkers in dozens of ways (diet, activity, socioeconomic status), any of which could explain the association.
- How to avoid: Apply Bradford Hill criteria (strength, consistency, specificity, temporality, biological gradient, plausibility, coherence, experiment, analogy) to assess whether an observed association is likely causal. No single criterion is sufficient.
Over-interpreting subgroup analyses: Subgroup analyses in RCTs are almost always exploratory. A trial powered to detect an overall treatment effect cannot reliably detect subgroup-specific effects. Subgroup results are hypothesis-generating, not confirmatory.
- How to avoid: Only trust pre-specified subgroup analyses conducted with formal tests of interaction (not just reporting significance within each subgroup separately). Post-hoc subgroup analyses reported without interaction tests should be treated as hypothesis-generating only.
Accepting surrogate endpoints as equivalent to clinical endpoints: A drug that improves an imaging biomarker (e.g., amyloid PET) does not necessarily improve clinical outcomes (e.g., cognitive function). The history of medicine is filled with interventions that improved surrogate markers and worsened or did not affect hard endpoints.
- How to avoid: Ask whether the surrogate endpoint has been validated as a reliable predictor of the clinical endpoint in prior trials. If not, treat it as preliminary evidence only.
Ignoring the healthy survivor / healthy adherer bias: Patients who adhere to a treatment regimen tend to be healthier, more health-conscious, and have better outcomes even in the absence of a real treatment effect. This creates a spurious association between adherence and outcomes in observational data.
- How to avoid: When evaluating observational studies of drug adherence or screening behavior, look for analyses comparing adherent groups across different interventions or using intention-to-treat principles.
Anchoring on statistical significance and ignoring effect precision: A single small study with p = 0.03 and a wide 95% CI (OR: 0.5 to 5.0) provides essentially no information — the true effect could be large harm or large benefit. The CI incompleteness makes the result uninterpretable.
- How to avoid: Evaluate CI width alongside the p-value. A result is informative only when the CI excludes both no effect and clinically irrelevant effects. Very wide CIs signal underpowered studies.
Treating p = 0.049 and p = 0.051 as fundamentally different: The binary significant/non-significant threshold creates a cliff where results just below 0.05 are published and celebrated, while results just above are buried. This contributes to the replication crisis.
- How to avoid: Interpret continuous p-values with context — p = 0.049 and p = 0.051 provide nearly identical evidence. Focus on effect size, CI, and consistency with prior evidence. Report exact p-values; avoid "statistically significant" language where possible.
Dismissing animal and in vitro studies entirely, or accepting them uncritically: Mechanistic studies in cells and animals are necessary for discovery but have high rates of failure in human translation. Approximately 85% of findings that replicate in animals fail in human clinical trials.
- How to avoid: Weight animal/in vitro evidence as hypothesis-generating only. Ask whether the animal model is validated for the specific human condition; whether pharmacokinetic properties translate; and whether multiple independent labs have replicated the finding.

Workflow

Initial orientation (5 minutes)
- Read abstract; identify the study design
- Check correspondence between Methods and Results sections for outcome consistency
- Check for pre-registration (ClinicalTrials.gov, PROSPERO, OSF) and compare registered outcomes to reported outcomes
Methods evaluation
- Assess appropriateness of study design for the research question
- Identify inclusion/exclusion criteria and assess representativeness of the sample
- Evaluate exposure/outcome measurement: objective vs. self-reported; validated instruments; blinding
Results evaluation
- Extract primary outcome effect estimates with 95% CI and p-value
- Calculate or request absolute risk data (ARR, NNT) if only relative measures reported
- Assess whether confidence intervals are narrow enough to draw conclusions
- Review confounding adjustment: were relevant confounders measured and included?
Bias and quality assessment
- Apply appropriate structured tool (RoB 2, ROBINS-I, QUADAS-2, NOS)
- Check for multiple testing; count the number of comparisons in the paper
- Review funding source and conflicts of interest disclosures
Conclusion and certainty rating
- Assign GRADE certainty (high/moderate/low/very low)
- Write a 2–3 sentence summary: what the study found, how certain, what limitations apply

Related Skills

literature-review — applying critical appraisal systematically across a body of evidence
biostatistics — quantitative tools for calculating and interpreting effect measures
peer-review-methodology — structured application of critical thinking to manuscript review

scientific-critical-thinking

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

scientific-critical-thinking

Tool Access

Preview

SKILL.md

Scientific Critical Thinking: Evaluating Evidence and Claims

Overview

Key Concepts

1. Study Design Hierarchy

2. Effect Measures and Their Interpretation

3. Bias Types

4. The P-value and Statistical vs. Clinical Significance

5. GRADE Evidence Certainty

Decision Framework

Best Practices

Common Pitfalls

Workflow

Further Reading

Related Skills

Similar Skills

Help us improve

Scientific Critical Thinking: Evaluating Evidence and Claims

Overview

Key Concepts

1. Study Design Hierarchy

2. Effect Measures and Their Interpretation

3. Bias Types

4. The P-value and Statistical vs. Clinical Significance

5. GRADE Evidence Certainty

Decision Framework

Best Practices

Common Pitfalls

Workflow

Further Reading

Related Skills