From sciagent-skills
Evaluates scientific evidence and claims via study design hierarchy (RCTs to expert opinion), effect sizes (OR, RR, NNT, Cohen's d), biases (selection, information, confounding, reporting), p-value vs clinical significance, GRADE, reproducibility. Use when reading papers or assessing claims.
npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skillsThis skill uses the workspace's default tool permissions.
Scientific critical thinking is the disciplined application of logical and methodological standards to evaluate whether a study's design, analysis, and interpretation support its conclusions. It is the skill that separates a researcher who synthesizes evidence from one who accumulates it. This guide covers the hierarchy of evidence, the mechanics of common biases, effect size interpretation, th...
Evaluates scientific claims and evidence quality, assesses experimental design validity, identifies biases and confounders, applies GRADE and Cochrane Risk of Bias frameworks.
Evaluates research rigor: assesses methodology, experimental design, statistical validity, biases, confounding, evidence quality (GRADE, Cochrane ROB). For scientific claim analysis.
Guides health science research via PICOT formulation, evidence hierarchy assessment, bias evaluation (Cochrane RoB 2, ROBINS-I), outcome prioritization, and GRADE certainty rating. For clinical questions, systematic reviews, evidence summaries.
Share bugs, ideas, or general feedback.
Scientific critical thinking is the disciplined application of logical and methodological standards to evaluate whether a study's design, analysis, and interpretation support its conclusions. It is the skill that separates a researcher who synthesizes evidence from one who accumulates it. This guide covers the hierarchy of evidence, the mechanics of common biases, effect size interpretation, the p-value controversy, GRADE evidence grading, and common logical fallacies in the interpretation of scientific literature.
Study designs vary in their ability to support causal inference. The hierarchy below applies to questions about the effect of an intervention or exposure on an outcome:
Systematic reviews and meta-analyses of RCTs (highest causal certainty)
↓
Randomized Controlled Trials (RCTs)
↓
Non-randomized controlled trials / cluster-randomized trials
↓
Prospective cohort studies (follow exposure → outcome forward in time)
↓
Retrospective cohort studies
↓
Case-control studies (compare exposed vs. unexposed given outcome)
↓
Cross-sectional studies (measure exposure and outcome simultaneously)
↓
Case series and case reports
↓
Expert opinion, mechanistic reasoning, animal models (lowest causal certainty)
Important exceptions: For questions about rare outcomes, case-control designs are often more efficient than cohort studies. For questions about diagnostic accuracy, randomized designs are usually inappropriate — cross-sectional or cohort designs with verified reference standards are preferred. For harm questions, RCTs are often infeasible (ethical constraints), making large cohort studies the best available evidence.
Effect measures quantify the relationship between an exposure/intervention and an outcome. Confusing them is a leading source of misinterpretation.
| Measure | Formula | Use case | Key interpretation |
|---|---|---|---|
| Risk Ratio (RR) | Risk in exposed / Risk in unexposed | Cohort studies, RCTs | RR = 2.0: exposed group has twice the risk |
| Odds Ratio (OR) | Odds in exposed / Odds in unexposed | Case-control studies, logistic regression | Approximates RR when outcome is rare (<10%); overestimates RR for common outcomes |
| Hazard Ratio (HR) | Instantaneous event rate ratio | Survival analysis (Cox regression) | HR = 0.7: 30% lower hazard of event per time unit in treated group |
| Number Needed to Treat (NNT) | 1 / Absolute Risk Reduction | Clinical decision-making | NNT = 20: treat 20 patients to prevent 1 event |
| Absolute Risk Reduction (ARR) | Risk_control − Risk_treated | Clinical impact | ARR = 2%: intervention reduces absolute event rate by 2 percentage points |
| Cohen's d | (μ₁ − μ₂) / σ_pooled | Continuous outcomes, psychology | d = 0.2 small; 0.5 medium; 0.8 large |
| Pearson r | Correlation coefficient | Association, not causal | r = 0.1 small; 0.3 medium; 0.5 large (Cohen 1988) |
Common error: Reporting only the relative risk reduction (e.g., "50% reduction in risk") without the absolute risk reduction. A treatment that reduces risk from 2% to 1% has a 50% relative reduction but only 1% absolute reduction (NNT = 100). The relative measure appears more impressive but the absolute measure is clinically relevant.
Bias is systematic deviation of results or inferences from the truth. Unlike random error (reduced by larger samples), bias is directional and not correctable by increasing sample size.
Selection bias: Systematic difference in characteristics between those selected and not selected for study.
Information bias: Systematic error in measuring exposure or outcome.
Confounding: A variable associated with both the exposure and the outcome, creating a spurious or masked association.
Reporting bias: Selective reporting of outcomes or results based on their statistical significance or direction.
A p-value is the probability of observing results at least as extreme as those obtained, under the null hypothesis. It is NOT the probability that the null hypothesis is true, nor the probability that the finding is a false positive.
Correct interpretation: p < 0.05 means that, if the null hypothesis were true, fewer than 5% of equally designed studies would produce results this extreme or more. It does NOT indicate the effect is large, clinically meaningful, or replicable.
Statistical significance ≠ clinical significance: With large enough samples, even trivially small effects become statistically significant. A blood pressure drug that reduces systolic BP by 0.8 mmHg (95% CI 0.3–1.3, p = 0.001) is statistically significant but clinically irrelevant.
Confidence intervals are more informative than p-values: A 95% CI of [0.5 kg, 35 kg weight loss] and a 95% CI of [0.5 kg, 1.5 kg weight loss] can both have p < 0.05, but the clinical implications are vastly different. Always focus on CI width and range, not just whether it excludes the null.
GRADE classifies the certainty of evidence across four levels:
| GRADE level | Meaning | Typical starting point |
|---|---|---|
| High | Further research very unlikely to change confidence | Consistent RCTs, large effect, no bias |
| Moderate | Further research likely to have important impact | RCTs with limitations, or strong consistent observational |
| Low | Further research very likely to have important impact | Observational studies, or RCTs with serious limitations |
| Very low | Any estimate is very uncertain | Case series, expert opinion, very inconsistent results |
GRADE certainty can be downgraded for: risk of bias, inconsistency (heterogeneity), indirectness (different population/outcome), imprecision (wide CI), and publication bias. It can be upgraded for: large effect (OR > 5), dose-response relationship, or all plausible confounders would reduce the effect.
How should I evaluate this study?
│
├── Step 1: What question is being answered?
│ ├── Intervention effectiveness → Need RCT or high-quality cohort
│ ├── Diagnostic accuracy → Need cross-sectional vs reference standard
│ ├── Prognosis → Need prospective cohort
│ └── Harm / rare exposure → Case-control or large cohort acceptable
│
├── Step 2: Is the study design appropriate?
│ ├── Design matches question → Proceed
│ └── Mismatch → Major limitation (flag)
│
├── Step 3: What are the key threats to validity?
│ ├── Selection bias → Who was included/excluded? Loss to follow-up?
│ ├── Information bias → Blinding? Validated instruments?
│ └── Confounding → What was adjusted for? Residual confounders?
│
├── Step 4: Are the effect estimates clinically meaningful?
│ ├── Effect size large enough to matter clinically?
│ ├── CI narrow enough to be informative?
│ └── Absolute vs relative risk reported?
│
└── Step 5: How certain is the evidence overall? (GRADE)
├── High certainty → Confident conclusion
├── Moderate certainty → Likely true; note limitations
├── Low certainty → Uncertain; more research needed
└── Very low certainty → Cannot draw conclusions
| Claim type | Appropriate response | Red flags requiring skepticism |
|---|---|---|
| "Drug X reduces mortality by 50%" | Ask: 50% relative or absolute? What was baseline risk? | Only relative risk reported; no CI provided |
| "Observational study shows cause" | Downgrade to "association"; list plausible confounders | Authors use "causes" without adjustment |
| "Significant p-value proves effect" | Check effect size and CI; assess clinical relevance | p = 0.04 with N = 50,000 and tiny effect |
| "Single RCT is definitive" | Check for replication; assess risk of bias | Funded by manufacturer; no blinding |
| "Preprint shows breakthrough" | Await peer review; check for reproducibility | No data/code sharing; sensational press release |
| "N-of-1 case report demonstrates treatment" | Note limited generalizability; no control | Used to support policy without cohort evidence |
Read the Methods before the Results: The Discussion section is written by the authors to support their conclusions. The Methods section is where you independently assess whether the data can support those conclusions. Specifically: what were the pre-specified primary outcomes? Do the reported outcomes match those in the Methods or the registered protocol?
Always seek the pre-registration record: ClinicalTrials.gov, PROSPERO, and OSF registrations contain the original protocol. Comparing the pre-specified primary outcome to what was reported in the abstract is the single most efficient check for outcome reporting bias. A change in primary outcome without explanation is a major red flag.
Distinguish statistical and clinical significance explicitly: For every effect estimate, ask: if this effect is real, would it matter to a patient or a biological system? A genomic variant with OR = 1.05 (p = 1e-12) in a GWAS of 500,000 people is a genuine association but contributes negligibly to disease risk prediction.
Identify the funding source and conflicts of interest: Industry-funded trials are not automatically invalid, but industry funding is associated with more favorable outcomes for the sponsor's product. Assess whether conflicts are disclosed, whether the funder had access to data or participated in analysis, and whether an independent statistician reviewed the data.
Check for multiple testing without correction: When a paper tests 20 outcomes, 1 will be statistically significant at p < 0.05 by chance alone. Look for corrections (Bonferroni, Benjamini-Hochberg FDR) in genomics, proteomics, and other high-throughput studies. Absence of correction in a paper reporting 50 comparisons invalidates the significance claims.
Require absolute risk data before accepting clinical conclusions: For any binary outcome, request (or calculate) the absolute risk reduction (ARR) and number needed to treat (NNT) in addition to the relative risk. This applies to both journal articles and news coverage of medical research. ARR = Control_rate − Treatment_rate; NNT = 1/ARR.
Apply structured critical appraisal checklists appropriate to study design: CONSORT for RCTs, STROBE for observational studies, TRIPOD for prediction models, STARD for diagnostic accuracy, GRADE for certainty assessment. These checklists are available free from EQUATOR Network and identify every element required for a complete report.
Confusing association with causation in observational studies: Observational studies identify associations, not causes. Coffee drinking is associated with reduced colorectal cancer risk — but coffee drinkers differ from non-drinkers in dozens of ways (diet, activity, socioeconomic status), any of which could explain the association.
Over-interpreting subgroup analyses: Subgroup analyses in RCTs are almost always exploratory. A trial powered to detect an overall treatment effect cannot reliably detect subgroup-specific effects. Subgroup results are hypothesis-generating, not confirmatory.
Accepting surrogate endpoints as equivalent to clinical endpoints: A drug that improves an imaging biomarker (e.g., amyloid PET) does not necessarily improve clinical outcomes (e.g., cognitive function). The history of medicine is filled with interventions that improved surrogate markers and worsened or did not affect hard endpoints.
Ignoring the healthy survivor / healthy adherer bias: Patients who adhere to a treatment regimen tend to be healthier, more health-conscious, and have better outcomes even in the absence of a real treatment effect. This creates a spurious association between adherence and outcomes in observational data.
Anchoring on statistical significance and ignoring effect precision: A single small study with p = 0.03 and a wide 95% CI (OR: 0.5 to 5.0) provides essentially no information — the true effect could be large harm or large benefit. The CI incompleteness makes the result uninterpretable.
Treating p = 0.049 and p = 0.051 as fundamentally different: The binary significant/non-significant threshold creates a cliff where results just below 0.05 are published and celebrated, while results just above are buried. This contributes to the replication crisis.
Dismissing animal and in vitro studies entirely, or accepting them uncritically: Mechanistic studies in cells and animals are necessary for discovery but have high rates of failure in human translation. Approximately 85% of findings that replicate in animals fail in human clinical trials.
Initial orientation (5 minutes)
Methods evaluation
Results evaluation
Bias and quality assessment
Conclusion and certainty rating
literature-review — applying critical appraisal systematically across a body of evidencebiostatistics — quantitative tools for calculating and interpreting effect measurespeer-review-methodology — structured application of critical thinking to manuscript review