Skill

analyze-stats

Generates reproducible Python/R statistical analyses for medical research papers, including diagnostic accuracy, survival analysis, regression, propensity score, and repeated measures with publication-ready tables and figures.

Python

Rust

data-engineering

Popularity

Stars

218

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/medsci-project:analyze-stats

User invocable

Model invocable

Inline context

Default effort

Configuration

Modelinherit

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are assisting a medical researcher with statistical analyses for medical research papers.

Supporting Files

SKILL.md

724 lines · ~14.9k tokens(exceeds 5k compaction limit)

Stats

LanguagePython

Stars218

Forks54

MaintenanceExcellent

Last CommitJul 25, 2026

Actions

View Source View Plugin View on GitHub View README

Statistical Analysis Skill

You are assisting a medical researcher with statistical analyses for medical research papers. Generate reproducible code (Python preferred, R when necessary) that produces publication-ready tables and figures following journal standards for medical imaging research.

Data Privacy Check

Before reading any data file, check whether it might contain Protected Health Information (PHI):

If *_deidentified.* files exist in the working directory, use those preferentially.
If only raw CSV/Excel files exist (no *_deidentified.* counterpart), warn the user (ask in the user's preferred language):

"Does this data contain patient identifiers (names, national ID / RRN, contact details, etc.)? If so, please de-identify it first with the /deidentify skill."
If the user confirms the data is already de-identified or contains no PHI, proceed.
NEVER display raw PHI values (names, phone numbers, RRN) in your output. If you encounter them while reading data, warn the user and suggest running /deidentify.

Reference Files

Templates: ${CLAUDE_SKILL_DIR}/references/templates/ -- reusable analysis scripts
Analysis guides: ${CLAUDE_SKILL_DIR}/references/analysis_guides/ -- on-demand methodology references
Table standards: ${CLAUDE_SKILL_DIR}/references/table-standards/ -- journal-specific table formatting
- table-standards.md -- universal rules, AMA rules, footnote system, mistakes checklist
- journal-profiles/ -- YAML profiles per journal (radiology, jama, nejm, lancet, eur_rad, ajr)
- table-types/ -- templates per table type (Table 1, diagnostic accuracy, regression, survival/Cox, agreement/reliability, meta-analysis, model comparison, incremental value, reader study (MRMC))
- tool-comparison.md -- R/Python tool comparison and recommended pipelines
Figure style: ${CLAUDE_SKILL_DIR}/references/style/figure_style.mplstyle
Project data: See CLAUDE.md for data locations under 2_Data/

Read relevant templates before generating analysis code. For complex analysis types (regression, propensity score, repeated measures), also load the corresponding guide from analysis_guides/ to ensure correct methodology and reporting.

Workflow

Phase 1: Data Assessment

Read the data file (CSV, Excel, TSV, or other tabular format).
Report to the user:
- Shape (rows x columns)
- Column names and inferred types (continuous, categorical, ordinal, binary, datetime)
- Missing values per column (count and percentage)
- First 5 rows preview
- Unique value counts for categorical columns
Identify the analysis unit: patient, exam, lesion, image, rater, study, etc.

Phase 2: Analysis Plan

Precondition (observational studies). Before proposing an analysis plan for an observational design (cohort, case-control, cross-sectional, registry, or survey), confirm that a literature-grounded variable operationalization exists — a variable_operationalization.md from /define-variables, or an equivalent codebook-backed definition table. If none exists, warn the user and recommend running /define-variables first, so exposure / outcome / covariate definitions and cutoffs are citation-backed rather than invented ad hoc from the data dictionary (ad-hoc phenotype/cutoff definitions are a common reviewer-rejection trigger for observational work — see the dictionary-first discipline). This is a WARN, not a hard block: proceed on explicit user confirmation, recording that the operationalization artifact was not available. For stricter projects, treat the missing artifact as a hard stop until /define-variables has run. (This mirrors the same precondition already enforced in /write-protocol before drafting Methods.)

Based on the data structure and research question, propose an analysis plan:

Auto-detect analysis type from the table below, or accept user specification.
List specific tests to be performed.
Identify primary and secondary endpoints.
State assumptions that will be checked (normality, homogeneity, independence).
Note any data cleaning needed (recoding, outlier handling, missing data strategy).
Anchor the estimand to the research question. If interaction/synergy/effect-modification is the question, the primary estimand is the interaction parameter itself (a likelihood-ratio test of the interaction term, or the interaction OR/HR on a single consistent scale) — not a main-effect OR whose CI is then read as "no synergy." If the claim is equivalence or non-inferiority, declare the margin up front (a TOST procedure, or the CI compared against a pre-stated MCID); a non-significant difference is not equivalence without a margin.
Screen every categorical/binary predictor for separation — before fitting anything. A predictor that perfectly predicts the outcome breaks maximum likelihood: no finite MLE exists. The failure is silent — glm does not error, it returns an odds ratio near 0 (or enormous), p ≈ 0.99, and an AUC that then gets written into a table. This is routine in diagnostic imaging, because the good signs are the pathognomonic ones (T2-FLAIR mismatch, the string sign, a halo sign): 100% specificity means an empty cell by construction.
```
python3 "${CLAUDE_SKILL_DIR}/scripts/check_separation.py" \
  --data cohort.csv --outcome idh_mutant --auto --strict
```
COMPLETE_SEPARATION (an empty cell) and QUASI_SEPARATION (a cell below the sparsity floor) both halt the plan. The remedy is a design decision, not a numerical one: Firth's penalised likelihood keeps one model, while a two-stage rule — classify the sign-positive cases directly, model only the sign-negative remainder — is usually the clinically meaningful choice for a pathognomonic sign, because a sign-positive patient is already diagnosed and the real question is what to do with everyone else. Decide this in the plan; do not discover it in the output.

Present the plan and wait for user approval before executing.

Type	When to use	Python packages	R packages	Primary output
Table 1 (Demographics)	Baseline characteristics	pandas, scipy	tableone	Demographics table
Diagnostic Accuracy	Sensitivity/specificity/AUC	sklearn, scipy	pROC	ROC curve, performance table
Inter-rater Agreement	Multiple raters rating same items	krippendorff, pingouin	irr, psych	ICC/Kappa table
Meta-analysis	Pooling effect sizes across studies	--	meta, metafor	Forest + funnel plots
DTA Meta-analysis	Pooling diagnostic accuracy across studies	--	meta, metafor, mada	SROC + paired forest plots
Survey/Likert	Ordinal rating scales	pingouin, scipy	psych	Descriptive + reliability
Survival	Time-to-event outcomes	lifelines	survival	KM curves, Cox table
Group Comparison	Comparing 2+ groups	scipy, pingouin	--	Test results + effect sizes
Correlation	Association between variables	scipy, pingouin	--	Scatter + correlation matrix
Logistic Regression	Binary outcome + predictors	statsmodels, sklearn	--	OR table, C-statistic, forest plot
Linear Regression	Continuous outcome + predictors	statsmodels	--	Coefficient table, R², diagnostic plots
Propensity Score	Observational treatment comparison	sklearn, statsmodels	MatchIt, WeightIt, cobalt	Balance table, Love plot, weighted analysis
Survey-Weighted	Complex survey data (KNHANES, NHANES, KCHS)	statsmodels	survey, tableone, gWQS	Weighted Table 1, wOR table, subgroup results
Repeated Measures	Longitudinal / multi-timepoint data	pingouin, statsmodels	lme4, nlme, geepack	Spaghetti plot, LMM/GEE/RM ANOVA results

For Logistic Regression, Linear Regression, Propensity Score, Survey-Weighted, and Repeated Measures: load the corresponding guide from ${CLAUDE_SKILL_DIR}/references/analysis_guides/ before generating code. For Survey-Weighted analysis, also load survey_weighted.md. For NHIS claims-based studies, load nhis_icd10_mapping.md. For test selection guidance, load ${CLAUDE_SKILL_DIR}/references/analysis_guides/test_selection.md.

Phase 3: Execute

Generate and run a Python (preferred) or R script following these rules:

Script Structure

Every script MUST start with a reproducibility header:

"""
Analysis: {description}
Date: {YYYY-MM-DD}
Random seed: 42
Python: {version}
Key packages: {package==version, ...}
"""
import numpy as np
import pandas as pd
np.random.seed(42)

Execution Rules

Random seed: Always np.random.seed(42) or set.seed(42).

Figure style: Always load the matplotlib style file:

import matplotlib.pyplot as plt
style_path = os.path.join(os.environ.get('CLAUDE_SKILL_DIR', '.'), 'references/style/figure_style.mplstyle')
if os.path.exists(style_path):
    plt.style.use(style_path)

Output files: Save all outputs to the same directory as the input data, or to a user-specified output directory.
Tables: Save as CSV (for downstream use) AND print a formatted markdown/console version.
Figures: Save as both PDF (vector) and PNG (300 DPI).
Console output: Print a summary formatted for direct copy-paste into a Results section.

Assumption Checking

Before running parametric tests, always check and report:

Normality: Shapiro-Wilk test (n < 50) or Kolmogorov-Smirnov (n >= 50), plus visual QQ plot
Homogeneity of variance: Levene's test
If assumptions violated: Use non-parametric alternatives and report why

Multiple Comparisons

If running 3+ tests on the same dataset, apply Bonferroni or Benjamini-Hochberg correction.
Always report both uncorrected and corrected p-values.
State the correction method used.

Stratified & Ordinal-Trend Reporting

Strata disjointness gate (before any ordinal trend test). Before running a Cochran-Armitage trend test (or any analysis that treats tiers as an ordered partition), assert the strata are mutually exclusive and exhaustive: sum(n per stratum) == unique N and sum(events per stratum) == total events. A trend test on overlapping or non-exhaustive strata is invalid. Emit the per-stratum N/event table and the reconciliation in the output (this is the analysis-side mirror of /self-review check_cohort_arithmetic.py PARTITION_OVERLAP).
Secondary stratum-HR validation checklist. Every secondary stratum hazard/odds ratio must be reported with (a) its reference contrast (which category is the referent), (b) the event count in each stratum, and (c) a sparse-stratum caveat when any stratum has a low event count (a rule of thumb: < 10 events makes the estimate unstable). A bare "HR 1.55 in lean participants" without the referent and the events is uninterpretable.
Proportion CI lower-bound clamp. Clamp every proportion confidence-interval lower bound to max(0, lower); a zero-event Wilson/score interval can emit a negative or absurd tiny-exponent lower bound (e.g., 3.47e-16) that is a display artifact, not a real bound. Report 0 (or 0.0%) instead, and prefer an exact (Clopper-Pearson) interval for zero/near-zero cells.

Output Manifest

After all analyses complete, save a manifest file _analysis_outputs.md in the output directory:

# Analysis Outputs
Generated: {YYYY-MM-DD}
Study type: {detected or user-specified type}

## Tables
- `table1_demographics.csv` -- Baseline characteristics
- `diagnostic_accuracy_table.csv` -- Performance metrics with 95% CIs

## Figures  
- `roc_curve.pdf` / `roc_curve.png` -- ROC curves (vector / 300 DPI)

## Data
- `predictions.csv` -- Per-subject model predictions with ground truth

This manifest enables downstream skills (/make-figures, /write-paper) to auto-discover analysis outputs without user intervention.

Phase 3.5: Generated-Code Quality Gate

Before reporting any script as final, lint every emitted .py/.R file for the reproducibility-hygiene "slop" that AI-generated analysis code recurrently carries:

python3 ${CLAUDE_SKILL_DIR}/scripts/check_generated_code.py {script.py} --strict
# or scan a whole output directory:
python3 ${CLAUDE_SKILL_DIR}/scripts/check_generated_code.py --code-dir {analysis_dir} --strict

Major findings (fix before reporting the script):

MISSING_SEED — randomness used (sampling, bootstrap, train/test split, rng) with no np.random.seed / set.seed / random_state= / default_rng. Non-reproducible.
HARDCODED_DATA_LITERAL — a hand-typed, table-shaped numeric literal instead of read_csv()/read.csv() + subset. This is the data-integrity rule "never hand-type CSV data into scripts."
HARDCODED_ABS_PATH — an absolute path literal (/Users/, /home/, C:\, ~/Documents). Non-portable and a PII risk.
INPLACE_SOURCE_OVERWRITE — writing to the same path read as input; this overwrites raw data. Write derived outputs to a new path ("never modify raw data").

Flags (fix when tidying): DEBUG_LEFTOVER (a breakpoint() / browser() / debug print / TODO marker left in) and UNUSED_IMPORT (a dead Python dependency).

The gate is conservative on the Major checks — it fires HARDCODED_DATA_LITERAL only on genuinely table-shaped literals and MISSING_SEED only on a real randomness call — so it stays quiet on legitimate analysis code. It is the analysis-side mirror of the data-integrity and reproducibility checks /self-review is built to catch downstream.

Phase 4: Report

After execution, generate manuscript-ready text:

Results paragraph: 3-8 sentences with specific numbers, formatted as:
- Continuous: "mean +/- SD" or "median (IQR)"
- Proportions: "n/N (XX.X%)"
- Test results: "statistic = X.XX, p = 0.XXX"
- Effect sizes: "Cohen's d = X.XX (95% CI: X.XX-X.XX)"
- AUC: "AUC = 0.XXX (95% CI: 0.XXX-0.XXX)"
Table/figure captions: Draft captions referencing table/figure numbers.
Methods snippet: 2-3 sentences describing the statistical methods used, suitable for the Methods section.

Statistical Reporting Rules (Always Enforced)

These rules apply to ALL analyses without exception:

Exact p-values: Report exact values (e.g., p = 0.034), not inequalities. Exception: report as p < 0.001 when the value is below 0.001.
Confidence intervals: Always report 95% CIs for primary endpoints.
Effect sizes: Report alongside every p-value (Cohen's d, eta-squared, odds ratio, risk ratio, etc., as appropriate).
Parametric vs non-parametric: Choose based on assumption checks, not convenience. Report the assumption test results.
Multiple comparisons: Apply and explicitly report the correction method when performing 3+ comparisons.
Sample size reporting: Always state n for each group/analysis.
Missing data: Report how many cases were excluded and why.
Decimal places: p-values to 3 decimals, proportions to 1 decimal, means/SDs to appropriate precision for the measurement.
Design/power statistics are code outputs, never hand-computed. Any minimum detectable effect (MDE), a-priori or post-hoc power, or required sample size that will appear in the manuscript MUST be emitted by this committed script — printed with its method and inputs (n per arm, alpha, power, allocation ratio, one/two-sided) — not computed in a side tool (G*Power, an online calculator) and pasted in. Use one method family consistently (e.g. the exact noncentral-t via statsmodels TTestIndPower or scipy's nct); do not mix a normal approximation for some values with exact-t for others. A value that exists only in the manuscript with no script that reproduces it is the failure mode /self-review Phase 2.5a-2 is built to catch.
Estimand & CI output contract. Every primary point estimate — including quantile estimands (T25, median time-to-event), pooled proportions, and subdistribution HRs, not just ORs/HRs/AUCs — MUST be emitted together with its 95% CI. In the output CSV, carry the interval as explicit columns (estimate, ci_lower, ci_upper) or as a single text column in est (lo–hi) form; never emit a point estimate with no interval in an adjacent column. Round ORs/HRs/sHRs to 2 decimals and AUC/C-statistic to 3. This is the output side of the /self-review §C assertion that "all primary metrics have 95% CIs."

Effect-Size Real-World Translation

Whenever a primary result is a correlation, a standardized coefficient, a regression slope, an OR/HR/RR, or a Cohen's d, also report it as a plain-language unit shift a non-statistician can act on. The coefficient answers "is there an association"; the translation answers "how much, in units I use". This complements rule 3 above (report effect sizes) — it does not replace it.

When to apply

Any continuous-exposure to continuous-outcome association reported as Spearman's rho, Pearson's r, or a standardized slope.
Any OR/HR/RR where the audience needs an absolute-risk feel.
Reader / expert-elicitation studies, clinical-utility framing, abstracts, and figure captions.

Procedure

Pick an anchored contrast on the exposure, not a 1-unit step. Default: 25th to 75th percentile (IQR). State both endpoints in native units.
Translate to the outcome scale.
- For a rank/standardized association (Spearman's rho or a per-SD slope) under an approximately monotonic-linear assumption: delta_outcome ~= ((x_p75 - x_p25) / SD_x) * |rho| * SD_outcome. Report as: "going from {x_p25} to {x_p75} {units} is associated with about {delta_outcome} {outcome units} on average."
- For a regression slope b: delta_outcome = b * (x_p75 - x_p25) (cleaner; no monotonicity caveat).
- State the assumption explicitly; the IQR translation is a more defensible verbal guide than an SD-scaled one.
For OR/HR/RR, accompany the relative measure with an absolute one at a stated baseline risk: the absolute risk difference, and NNT = 1 / ARR (or NNH = 1 / ARI). Always state the baseline risk used.
Bound the claim: report the contrast, the assumption, and a CI on the coefficient; do not imply causation from a crude or unadjusted estimate.

Worked example (synthetic) rho = 0.39 between a fasting marker (IQR 0.6 to 3.5 units, SD 3.05) and an index (SD 2.13): ((3.5 - 0.6) / 3.05) * 0.39 * 2.13 ~= 0.8 -> "Going from the 25th to the 75th percentile of the marker is associated with about 0.8 index units higher on average (monotonic-linear approximation; crude, unadjusted)."

Output contract (clinical-utility is a default, not an optional add-on). Report every primary effect in units a clinician acts on, by default — do not leave these as prose to be added later:

OR/HR/RR primary outcomes → report the relative measure and the absolute risk at a stated baseline + absolute risk difference + NNT (or NNH = 1/ARI), baseline risk explicit. A relative-only headline is incomplete.
Continuous outcomes → add the IQR/clinically-anchored "Real-world translation" line beneath the effect size.
Prediction / classification (incl. medical-AI) models → a decision-curve / net-benefit pass at the relevant threshold is standard output, not just AUC + calibration. An incremental claim reports added net benefit / NRI / IDI over the established clinical model, not the new model's AUC alone. See references/table-standards/table-types/incremental_value.md and the make-figures decision_curve exemplar (and render_core_figures.py for the rendered curve).

Error Handling

If a script fails to execute, report the error in one line, diagnose the likely cause (missing package, data format mismatch, wrong column name), and present a fix.
Do NOT retry the same script more than once without modifying it or asking the user.
If an R package is unavailable, suggest install.packages() and wait for user confirmation.
For prediction models: always include calibration assessment (Brier score, calibration plot, or calibration slope/intercept) alongside discrimination metrics. AUC alone is insufficient.

Output Conventions

Tables

Before generating any publication table, load the journal profile and table type template:

Load ${CLAUDE_SKILL_DIR}/references/table-standards/journal-profiles/{journal}.yaml if a target journal is known
Load ${CLAUDE_SKILL_DIR}/references/table-standards/table-types/{type}.md for the relevant table type
If no journal specified, default to AMA style (Radiology profile)

Output formats (always generate all three):

CSV file (for downstream use and archival)
Console markdown rendering (for user review)
R gtsummary code (for publication-quality Word/LaTeX export)

Universal rules (enforced regardless of journal):

No vertical lines — horizontal rules only (top, below header, bottom)
Binary variables: show only one level (e.g., Male only, not Male + Female)
Units in column headers, not repeated in cells
Consistent decimal places within each column
All abbreviations defined in footnotes, self-contained per table
Exact P values always (never "NS" or "significant")
Name the statistical test in footnote or general note
Variability measure always stated: mean (SD) or median (IQR)

Journal-specific parameters (from loaded YAML profile):

Footnote markers: letters (AMA) vs symbols (NEJM/Lancet)
P value format: case, leading zero, italic
CI separator: comma (Radiology) vs "to" (JAMA/NEJM/Lancet)
Title format: period (AMA) vs colon (Lancet)
Abbreviation order: appearance (Radiology) vs alphabetical (JAMA)

Footnote placement order (universal):

General note (no marker) — e.g., "Data are mean (SD) unless noted"
Abbreviations — in order per journal convention
Specific notes (superscript markers) — per-cell explanations
Probability notes — significance thresholds (if applicable)

gtsummary pipeline (recommended for R table generation):

theme_gtsummary_journal("{journal}")  # "jama", "lancet", "nejm"
theme_gtsummary_compact()
# ... build table ...
tbl %>% as_flex_table() %>% flextable::save_as_docx(path = "table.docx")

Validation checklist (run before finalizing any table):

Binary variables show only one level
Units in headers, not cells
Consistent decimal places per column
Statistical test named (footnote or general note)
Effect sizes per clinically meaningful unit (per 10 years, not per 1 year)
Reference category stated for categorical predictors
No "NS" — exact P values only
Abbreviations defined in footnotes

Figures

Format: PDF (vector, for journal) + PNG (300 DPI, for review)
Style: Use figure_style.mplstyle for consistent appearance
Font: Arial, 8-10pt
Colors: Colorblind-safe palette
Size: 3.5 inches (single column) or 7.0 inches (double column) width
Always include axis labels with units

Console Output

Formatted for direct copy-paste into the Results section of a manuscript
Include all numbers that would appear in the text
Use the reporting format conventions above

Analysis-Specific Guidelines

Table 1 (Demographics)

Template: references/templates/table1_demographics.py
Table type guide: references/table-standards/table-types/table1_demographics.md
Continuous variables: mean +/- SD if normal, median (IQR) if skewed
Categorical variables: n (%)
Binary variables: show only one level (e.g., Male n (%), not both Male and Female)
Compare groups: t-test/Mann-Whitney for continuous, chi-square/Fisher for categorical
Report standardized mean differences (SMD) if requested (preferred over P for PS-matched studies)
RCTs: P values in Table 1 are usually unnecessary per CONSORT
gtsummary tbl_summary() with journal theme for R pipeline

Diagnostic Accuracy

Methodology guide: references/analysis_guides/diagnostic_accuracy.md (load before generating code — every metric with a CI on a stated analysis unit; the confidence-weighted trap [unweighted-baseline AUC + monotonic-encoding check, produce-side of probe D9]; paired DeLong vs MRMC for reader-generalising claims; per-stratum admissibility [D10]; one-scale-per-comparison [D11])
Template: references/templates/diagnostic_accuracy.py
Always report: sensitivity, specificity, PPV, NPV, accuracy, AUC
CIs: Wilson score for proportions, DeLong for AUC
ROC curve: include diagonal reference line, AUC in legend
If comparing models: DeLong test for AUC comparison
Youden's index for optimal threshold when applicable
Include calibration assessment (Brier score, calibration plot) for prediction models
NRI/IDI: When comparing two models (e.g., base model vs model + AI score), report:
- Category-based NRI (with clinically defined risk categories)
- Continuous NRI (note: tends to be inflated — report alongside category-based)
- IDI (Integrated Discrimination Improvement)
- Bootstrap 95% CIs (1000+ iterations)
- These supplement, not replace, DeLong AUC comparison
Table type guide (added value beyond a baseline): references/table-standards/table-types/incremental_value.md (paired ΔAUC + DeLong CI, continuous NRI with event/non-event split, IDI, net benefit at a prespecified threshold, same-patient/calibrated-first discipline). Pairs the decision-curve exemplar make-figures references/exemplar_plots/decision_curve.md.
Reader study (MRMC): references/table-standards/table-types/reader_study.md (per-reader + reader-averaged AUC with an Obuchowski–Rockette/DBM reader+case CI, per-patient vs per-lesion unit, superiority vs non-inferiority margin). Use an MRMC method (not a fixed-reader DeLong CI) for a claim that generalises to readers. Pairs make-figures references/exemplar_plots/mrmc_roc.md.

Inter-rater Agreement

Methodology guide: references/analysis_guides/agreement_reliability.md (load before generating code — the pseudoreplication trap for clustered/repeated measurements + the pseudoreplication-safe per-subject / mixed-effects code, ICC model/type selection, agreement-vs-reliability distinction; pairs with self-review probe O18)
Table type guide: references/table-standards/table-types/agreement.md (ICC with model/type + CI, weighted κ for ordinal, Bland–Altman bias + LoA, reliability-vs-agreement distinction, common errors)
Template: references/templates/agreement_analysis.py
2 raters + categorical: Cohen's kappa
2+ raters + categorical: Fleiss' kappa (or Krippendorff's alpha)
Continuous: ICC (specify model: one-way, two-way random/mixed; type: single/average)
Always report interpretation labels (Landis & Koch or Cicchetti)
Bland-Altman plot for continuous paired measurements
Bootstrap CIs (1000 iterations, seed=42)

Meta-analysis

Prefer R (meta/metafor packages) for meta-analysis
Comparative: metabin() for binary outcomes (OR/RR), metagen() for continuous
- Use method = "Inverse", method.tau = "DL", method.random.ci = "HK"
- Avoid deprecated args: comb.fixed → common, hakn → method.random.ci
Single-arm pooled proportion: metaprop() with sm = "PLOGIT", method.ci = "CP"
- Small-study test branch: do not use Egger's regression for a single-arm proportion meta-analysis — funnel-asymmetry tests assume an effect-size-vs-SE relationship that does not hold for raw proportions. If a small-study assessment is needed, use Peters' test or an arcsine-based variant, and only when k >= 10 (note underpowered otherwise)
- Standard output: report tau-squared on the logit scale and a 95% prediction interval (metaprop(..., prediction = TRUE)) in addition to the pooled estimate; the PI conveys where a future study's proportion is expected to fall under the random-effects model
Nested observation units: if the proportion's unit is nested within study (e.g., per-lesion within study, per-image within patient), do not report a naive Wilson/binomial CI that ignores clustering — use a cluster-bootstrap or a GLMM with a random intercept per study so the CI reflects the design
Heterogeneity: I-squared, Q test, tau-squared, and a 95% prediction interval for the random-effects pooled estimate
Forest plot: individual studies + pooled estimate
Funnel plot + Egger's test for publication bias (comparative effect sizes only; note: underpowered k<10)
Sensitivity analysis: leave-one-out (metainf())
Subgroup: update(res, subgroup = variable)

DTA Meta-Analysis

Template: references/templates/dta_meta_analysis.R
Prefer R (mada, meta, metafor packages) for DTA meta-analysis
Bivariate model (Reitsma): mada::reitsma() — recommended over separate pooling of Se/Sp
- Accounts for correlation between sensitivity and specificity
- Produces SROC curve with confidence + prediction regions
Key outputs: Pooled Se/Sp (95% CI), positive/negative LR, DOR, SROC AUC
Threshold effect: Spearman correlation between logit(Se) and logit(FPR)
- If significant: interpret single pooled Se/Sp with caution, emphasize SROC curve
Forest plots: Paired (sensitivity + specificity side by side)
Publication bias: Deeks' funnel plot asymmetry test (NOT standard funnel plot)
- Standard funnel plots are inappropriate for DTA studies
- Note: underpowered for k < 10
Dual approach (comparative + single-arm):
- Primary: metabin() for comparative studies (OR/RR)
- Secondary: metaprop() with sm = "PLOGIT" for single-arm pooled proportion
- Use method = "Inverse", method.tau = "DL", method.random.ci = "HK"
Small studies (k < 10): bivariate model may not converge; consider narrative synthesis
Alternative: If mada unavailable, use metafor::rma.mv() with bivariate structure

Network Meta-Analysis

Guide: Load analysis_guides/network_meta_analysis.md before generating code
For ≥3 interventions via combined direct + indirect evidence (incl. component NMA); pairwise machinery (search/screening/random-effects model) via the Meta-analysis section above
Assess transitivity before pooling: compare effect-modifier distributions across comparisons (box plots / table) and/or network meta-regression — it is a clinical judgment, not a test
Test consistency globally (design-by-treatment) AND locally (node-split / back-calculation); a star network (no closed loops) cannot be checked — state it; investigate the source of any inconsistency (often one trial)
R netmeta (frequentist: netsplit, decomp.design, netheat, netrank P-scores, comparison-adjusted funnel) or Bayesian gemtc / multinma / BUGSnet (node-split, SUCRA, DIC)
Present a network plot (node ∝ sample size, edge ∝ #trials); report global τ²; ranking (SUCRA/P-score) is not a superiority test — report it with the league table, intervals, and certainty
Certainty per estimate via CINeMA / GRADE-NMA (downgrade indirect-only); component NMA assumes additivity (state/check it). Report against PRISMA-NMA; risk of bias via RoB-NMA. Review-side probes: NM1–NM8 in network_meta_analysis.md

Health Economic Evaluation

Guide: Load analysis_guides/health_economic_evaluation.md before generating code
For cost-effectiveness (CEA), cost-utility (CUA, QALY), cost-benefit (CBA), cost-minimisation, or budget-impact analyses; trial-based or decision-model-based (decision tree, Markov/state-transition, discrete-event simulation)
Compute incremental cost ΔC, incremental effect ΔE, and the ICER = ΔC/ΔE; with ≥3 options remove dominated / extended-dominated strategies before sequential ICERs; prefer net benefit (INMB = λΔE − ΔC) for regression/probabilistic summaries
State and justify the perspective, time horizon (lifetime for chronic disease), discount rate (both costs and outcomes), currency + price year; QALYs from a named preference-based instrument + value set
Uncertainty is the analytic core: one-way / tornado for drivers, probabilistic sensitivity analysis (PSA) with justified parameter distributions (beta for probabilities/utilities, gamma/log-normal for costs) → cost-effectiveness plane + CEAC; scenario analyses for structural choices
R heemod / dampack / hesim / BCEA (state-transition + PSA + CEAC + EVPI), flexsurv for survival extrapolation. Report against CHEERS 2022; make the "cost-effective" conclusion conditional on a stated willingness-to-pay threshold. Review-side probes: HE1–HE8 in health_economic_evaluation.md

Survey/Likert

Descriptive: median, IQR, frequency distribution per item
Internal consistency: Cronbach's alpha with item-total correlations
Reverse-coding guard (run before reliability): a negatively-worded scale item must be recoded (min+max) - x before computing the scale total or Cronbach's alpha. An un-recoded reverse item produces a negative item-rest correlation and a negative alpha — which is a coding bug, not evidence of a multidimensional construct (do not defend it as such; you lose a review round). likert_summary.py prints the per-item item-rest correlations, flags negative ones as reverse-code suspects, warns loudly on a negative alpha, and accepts --reverse-items E3 ... to apply the recode before scoring. To screen at cleaning time, run /clean-data scripts/check_reverse_coding.py. See the global rule survey-scale-reliability.md.
If comparing groups: Mann-Whitney or Kruskal-Wallis (ordinal data)
Visualization: diverging stacked bar chart

Survival Analysis

Methodology guide: references/analysis_guides/survival.md (load before generating code — competing risks first [naive 1−KM overestimates → produce the Aalen–Johansen/Fine–Gray CIF; cause-specific vs subdistribution for which question, produce-side of probe S3]; PH check → RMST when violated; reverse-KM follow-up + C-index variant [S6]; estimand provenance [S8])
Table type guide: references/table-standards/table-types/survival_results.md (Cox results table: events/person-time, reverse-KM median follow-up, univariable + adjusted HR with CI, PH-assumption footnote, EPV/sparse-stratum and RMST-when-PH-violated rules)
Kaplan-Meier curves with number-at-risk table
Log-rank test for group comparison
Cox proportional hazards: report HR (95% CI)
Events-per-variable (EPV) gate: check events / n_covariates >= 10 before fitting Cox (mirror of the logistic EPV rule). Warn if violated and fall back to a Firth/penalized Cox or profile-likelihood CIs; do not report Wald CIs from a sparse-event model as if stable
Nested observation units (cluster-robust CI): when a subject contributes more than one analysed unit (multiple lesions, both eyes, repeated episodes), pass a subject id so the HR CIs use a robust cluster-sandwich variance (coxph(..., cluster = id) / robust = TRUE in R, cluster_col= in lifelines, e.g. survival_analysis.py --cluster <id>). Treating correlated rows as independent understates the standard errors and narrows the CI artificially
Check proportional hazards assumption (Schoenfeld residuals)
PH violation → do not report a single time-averaged HR. If the Schoenfeld global test is significant (or a covariate's residual trends with time), a single Cox HR averages a changing effect and is misleading. Report a piecewise / time-stratified HR (split follow-up at a clinically sensible cut, or tt() time-transform), or switch to RMST difference at a fixed horizon, and state the violation explicitly
Horizon vs follow-up. Do not read a KM/CIF estimate at a horizon beyond the data: if a reported time point (e.g., a 15-year cumulative incidence) exceeds the reverse-KM median follow-up, either restrict the horizon to where the risk set is non-trivial or report the number-at-risk at that horizon so the reader can judge the extrapolation
Report median survival with 95% CI
Warranty period / quantile estimands (T25 etc.): Time to a fixed cumulative incidence. Use quantile() from the KM/survfit object and always emit the 95% CI (the lower/upper from quantile(km, conf.int=TRUE), or a log-transformed / bootstrap CI) alongside the events/n that define it. A quantile point estimate reported without its CI is incomplete. If the event rate is below the target quantile, report "not reached" and consider Weibull parametric extrapolation (also with an interval)

Interval-Censored Survival

When exact event times are unknown (e.g., health screening cohorts where status changes are detected at periodic visits), standard KM underestimates time-to-event. Use interval-censored methods:

R packages: icenReg (parametric/semi-parametric IC regression), interval (NPMLE/Turnbull), survival (Surv type "interval2")
Turnbull estimator: Non-parametric MLE for interval-censored data — analogous to KM but accounts for the interval between last negative and first positive observation
Parametric IC models: Weibull or log-logistic via icenReg::ic_par(). Report shape/scale parameters and compare AIC across distributions
Mid-point imputation: Simple approximation — event time = midpoint of (last negative, first positive). Acceptable as sensitivity analysis but NOT as primary method
When to use: Serial measurement cohorts (e.g., health screening databases), cancer screening intervals, repeated biomarker assessments
Auto-trigger: if the event date is defined by a periodic visit / scheduled re-examination (the event is detected at a visit, not observed exactly), interval-censoring is not optional — make an IC model the primary analysis, or at minimum a mandatory pre-specified sensitivity analysis, and do not present a right-censored Cox coxph() on visit-dated events as if the times were exact
Multistate / transition models: for repeated transitions (e.g., msm), account for subject-level clustering with a subject random effect or a sandwich (robust) variance, and check the time-homogeneity assumption (constant transition intensities) before trusting a single rate
Reporting: State the interval-censored nature of the data explicitly in Methods. Report both standard KM (for comparability with prior literature) and IC estimates (as primary or sensitivity)

Competing Risks

When death or other events preclude the outcome of interest, standard KM overestimates cumulative incidence (treats competing events as censored). Use competing risk methods:

R packages: cmprsk (Fine-Gray), tidycmprsk (tidy interface), survival (cause-specific Cox)
Cumulative incidence function (CIF): cmprsk::cuminc() — replaces 1-KM for each event type. Gray's test for group comparison
Fine-Gray subdistribution hazard: cmprsk::crr() or tidycmprsk::crr() — reports subdistribution HR (sHR) with 95% CI. Interpretable as effect on CIF directly. Check the subdistribution-PH assumption the same way you check it for Cox (a time-interaction term on the subdistribution scale, or inspection of scaled-residual analogues); a constant sHR is an assumption, not a given. Report the cause-specific HR alongside it so the etiologic and prognostic readings are both visible
Cause-specific Cox: Standard Cox censoring competing events — reports cause-specific HR. Better for etiology; Fine-Gray better for prognosis/prediction
When to use: Mortality studies with multiple causes of death, cardiovascular events when non-CV death is frequent, any outcome where competing events are common (>5% of total events)
Reporting: Present CIF plots (NOT 1-KM) when competing risks exist. Report both cause-specific HR and subdistribution HR when the research question is etiologic. State which competing events were defined. When a CIF is quoted at a horizon beyond the median follow-up, report the number-at-risk at that horizon (or restrict the horizon) — a CIF extrapolated past the data is not a stable estimate

Group Comparison

2 independent groups: t-test or Mann-Whitney U
2 paired groups: paired t-test or Wilcoxon signed-rank
3+ independent groups: ANOVA or Kruskal-Wallis, with post-hoc
3+ paired groups: repeated measures ANOVA or Friedman, with post-hoc
Always report: test statistic, degrees of freedom, p-value, effect size

Correlation

Pearson r (if bivariate normal) or Spearman rho (if not)
Report: coefficient, 95% CI, p-value
Scatter plot with regression line and CI band
For multiple variables: correlation matrix heatmap

Logistic Regression

Guide: Load analysis_guides/regression.md before generating code
Template: references/templates/regression.py (set regression_type = "logistic")
Run univariable analysis first, then multivariable with clinically selected variables
Required outputs: OR table (univariable + multivariable), C-statistic (95% CI), and calibration (intercept + slope + flexible plot — not Hosmer–Lemeshow, which is deprecated; see the calibration guide)
Prediction-model calibration guide: references/analysis_guides/calibration.md (load before generating code for any model that outputs a risk used for a decision — the apparent slope of exactly 1.00 is the in-sample tell, so produce the bootstrap optimism-corrected slope/intercept; Van Calster's calibration levels; scaled Brier; why Hosmer–Lemeshow is dropped; produce-side of probe S7)
Check VIF < 5, EPV >= 10 (warn if violated)
Nested observation units: when rows are clustered within subjects (multiple lesions/visits per patient), use cluster-robust standard errors (cov_type="cluster", cov_kwds={"groups": id} in statsmodels) or a mixed-effects logistic model — a naive logit CI assumes independent rows and is too narrow
Box-Tidwell test for continuous predictor linearity
Forest plot of adjusted ORs
NRI/IDI if comparing models (incremental value assessment)

Linear Regression

Guide: Load analysis_guides/regression.md before generating code
Template: references/templates/regression.py (set regression_type = "linear")
Required outputs: coefficient table (β, 95% CI, P), R²/adjusted R², VIF
Always generate 4-panel diagnostic plot (residuals vs fitted, Q-Q, scale-location, leverage)
Check assumptions: normality of residuals, homoscedasticity, multicollinearity
Report both unstandardized β (primary) and standardized β (for effect size comparison)

Propensity Score

Guide: Load analysis_guides/propensity_score.md before generating code
Template: references/templates/propensity_score.py
Step 1: PS estimation (logistic regression)
Step 2: Apply method (matching with caliper = 0.2 × SD logit PS, IPTW/SIPTW with stabilized weights, or overlap weighting)
Step 3: Balance assessment — SMD < 0.10 for all covariates, Love plot mandatory
Step 4: Weighted/matched outcome analysis with robust SE
Step 5: Sensitivity analysis (E-value for unmeasured confounding)
Always state the estimand (ATE/ATT/ATO) explicitly
Recommend overlap weighting as default (no extreme weight issues)
SIPTW: Stabilized IPTW variant used in emulated target trial frameworks; report effective sample size

Survey-Weighted Analysis

Guide: Load analysis_guides/survey_weighted.md before generating code
Template: references/templates/survey_weighted_analysis.py
For KNHANES/NHANES/KCHS and similar complex survey designs
Always declare survey design (strata, cluster/PSU, weight) before analysis
Use correct weight variable (interview vs exam vs nutrition)
R survey package strongly recommended over Python for publication
Sequential model building: Model 1 (age+sex) → Model 2 (full adjustment)
Report weighted odds ratios (wOR) with 95% CI
Cross-national: analyze each country separately, never pool
Subgroup analysis: exclude the stratification variable from covariates

Mediation Analysis

Guide: Load analysis_guides/mediation.md before generating code
Bootstrapped product-of-coefficients (a×b) indirect effect (R mediation / CMAverse / PROCESS); ≥2000 resamples, bias-corrected percentile CI — not the Sobel test
Binary outcome: counterfactual / natural-effects decomposition (CMAverse, regmedint), not the naive OR product
Report total, direct, indirect effects each with a bootstrap CI; proportion mediated only with uncertainty and only when the total effect is well-estimated (unstable / can exceed 100% when total is near-null)
Identification, not the bootstrap, is the issue: mediation needs no unmeasured mediator–outcome confounding (sequential ignorability) → report an E-value for the indirect effect (or ρ-based sensitivity). A cross-sectional design cannot order X→M→Y — frame as association-level (review probe O13)
Report against AGReMA

Interaction & Effect Modification

Choose and state the scale. A public-health / biological synergy claim is an additive-scale statement → report RERI, AP (attributable proportion), or S (synergy index), each with a CI — not only a multiplicative OR/HR product term. A non-significant multiplicative interaction is compatible with a large additive one (and vice versa)
"Joint association" via a combined multi-level exposure (high/high vs low/low) shows joint categories, not interaction — add the product term (multiplicative) and/or RERI (additive) to claim interaction
Stratified-only "stronger in A than B" is the difference-in-significance fallacy — report the formal interaction term, not two separate stratum estimates
R interactionR / epiR for RERI/AP/S with CIs; follow Knol & VanderWeele interaction-reporting recommendations. Review-side probe: O14 in observational_confounding.md

Multiple Testing & High-Dimensional Screening

Guide: Load analysis_guides/multiplicity.md before generating code
For agnostic many-exposure scans (ExWAS / EWAS / MWAS / proteome-/nutrient-wide) and any "screen N predictors, report the significant ones" pass
Match the correction to the claim: FWER (Bonferroni / Holm / permutation-based study-wide threshold) for a confirmatory single hit; FDR (Benjamini–Hochberg q-value) for discovery — then frame as hypothesis-generating
Report the correction method and the number of tests m (the denominator), applied to the whole tested set — never shrink m to the winners
Replication is the real safeguard: split-half / second cohort / cross-cycle with directional concordance and a reported replication rate; a single-cohort FDR-significant scan is exploratory
Correlated exposures → raw Bonferroni is over-conservative (permutation or effective-number-of-tests via poolr::meff()); a univariate hit may be a marker for a correlated cause (consider WQS / quantile g-computation / BKMR before causal reading)
Report full results (all effect sizes + p/q), not only winners; complex surveys combine design-based SEs (survey_weighted.md) WITH the correction. Review-side probe: O17 in observational_confounding.md

Mendelian Randomization

Guide: Load analysis_guides/mendelian_randomization.md before generating code
For genetic-instrument causal inference: two-sample summary-data MR, one-sample MR, MVMR, drug-target / cis-MR, non-linear MR
State and evidence the 3 IV assumptions: relevance (F-statistic / R²), independence (ancestry + confounder scan), exclusion restriction (no horizontal pleiotropy — the untestable one)
Pre-specify the full sensitivity suite, not IVW alone: IVW + MR-Egger (intercept = directional pleiotropy) + weighted median + weighted mode + MR-PRESSO; Cochran's Q + leave-one-out; concordance across methods is the robustness claim (R TwoSampleMR / MendelianRandomization)
Address reverse causation (Steiger / bidirectional), sample overlap (report fraction; overlap + weak instruments inflate type-1 error), winner's curse (select instruments in an independent GWAS), and ancestry matching
Drug-target / cis-MR: GLS-IVW for correlated cis variants + colocalization (coloc) + positive control + adverse-effect phenome scan. Non-linear MR: residual/doubly-ranked shapes can be artefactual → require negative/positive controls + extreme-stratum sensitivity
Interpret as a lifelong genetic-proxy effect direction, not a clinical-intervention magnitude; report against STROBE-MR. Review-side probes: MR1–MR8 in mendelian_randomization.md

Polygenic Risk Score (PRS / PGS)

Guide: Load analysis_guides/polygenic_risk_score.md before generating code
For developing/validating/applying a genome-wide polygenic score as a predictor or risk-stratifier (distinct from MR: PRS is prediction, MR is causal inference)
Base (discovery GWAS) and target/validation samples must be independent; tune (P+T / LDpred2 / PRS-CS shrinkage / quantile cut) on a separate tuning set and evaluate out-of-sample (avoid overfitting / winner's curse). Tools: PRSice-2, LDpred2 (bigsnpr), PRS-CS / PRS-CSx, BridgePRS
Ancestry portability is the central issue: report performance separately per target ancestry (within-group differences can rival between-group); prefer ancestry-matched / multi-ancestry discovery + PCs; do not extend a European-derived score to other ancestries without per-ancestry validation
Report OR/HR per SD (CI) + quantile absolute risk; discrimination (C/AUC) and calibration (plot + slope/intercept) in the target population — discrimination ≠ calibration
Incremental value is the clinical crux: report PRS on top of the guideline clinical model (SCORE2/QRISK3/PCE/Tyrer-Cuzick) — ΔC-statistic (CI), NRI/IDI, net benefit — not PRS-alone AUC. A screening claim needs detection-rate-at-fixed-FPR / likelihood ratio, not AUC
Prefer prospective/incident validation (prevalent case–control overstates utility); report against PGS-RS / TRIPOD+AI. Review-side probes: PG1–PG8 in polygenic_risk_score.md

NHIS Claims-Based Studies

Guide: Load analysis_guides/nhis_icd10_mapping.md for disease definition patterns
Claims-based algorithms: N-claim rule, claim+medication, look-back period
Always specify ICD-10 code ranges, claim count requirement, and time windows
Charlson comorbidity index: cite Quan 2005 adaptation
Anchor covariates to most recent data prior to index date
Sensitivity analysis: test stricter/looser disease definitions

Burden of Disease, Decomposition & Forecasting

Guide: Load analysis_guides/burden_decomposition_forecasting.md before generating code
For a burden-of-disease estimate, attributable-risk (PAF / comparative risk assessment), temporal-trend (joinpoint / AAPC), decomposition (Das Gupta; Arriaga life-expectancy), or forecast (BAPC / age-period-cohort)
The value-add-layer playbook — a descriptive rate is rarely publishable alone; bolt on ONE layer: decomposition (why the rate changed — aging vs population growth vs epidemiological change), PAF (how much is modifiable), joinpoint/AAPC pre-vs-post (did a datable policy/shock bend the trend), forecast (where it is going), Arriaga (which ages/causes drove ΔLE)
Uncertainty intervals, not CIs: report a draw-based 95% UI (2.5th–97.5th percentile of 250–500 draws propagated end-to-end); a UI crossing the null means insufficient evidence for direction, not a non-significant test
Single-center cohort adaptation: three layers port onto existing follow-up without new data — trend-break (reslice around a datable guideline/scanner-era change), forecast (project a serial imaging trajectory), and global framing (place the individual-level effect next to the published GBD burden from GHDx — contextualization, not re-estimation, so no GATHER trigger). The ecological "UI-replaces-confounding-control" shortcut does not port: individual-level data still needs the DAG / E-value / negative-control toolkit
Report the estimate against GATHER (/check-reporting); keep burden/attribution/decomposition/forecast descriptive or associational unless a causal design (natural experiment, MR — analysis_guides/mendelian_randomization.md) is in place

Repeated Measures

Guide: Load analysis_guides/repeated_measures.md before generating code
Template: references/templates/repeated_measures.py
Default method: LMM (handles missing data, no sphericity assumption)
RM ANOVA only if: no missing data AND few time points AND sphericity met
GEE for: population-averaged effects or non-normal outcomes
Always convert wide → long format first
Time × Group interaction is the key result — always report and interpret
Generate spaghetti plot (individual trajectories) + group mean trajectory plot
For LMM: report random effects structure, covariance structure (CS/AR1/UN), AIC/BIC
For RM ANOVA: report Mauchly's test, epsilon, correction method (Greenhouse-Geisser)
If missing > 5%: load analysis_guides/missing_data.md and apply MICE before analysis

Covariate Pitfalls: Structural Zeros & Dose/Duration Variables

Applies to any multivariable adjustment (logistic / linear / Cox / propensity-score / survey-weighted). Two coupled failure modes around a dose/duration variable anchored to a categorical exposure (pack-years under smoking status, grams/week under alcohol use, cessation-duration under former-smoker):

Structural-zero guard (do not impute): a never-smoker's pack_years is a structural zero, not missing-at-random — the value is known to be 0 by definition of the category. Feeding it to MICE/MNAR imputation as if it were missing fabricates a non-zero dose for unexposed subjects and corrupts the exposure contrast. Before imputing any dose/duration column, set the implied zero explicitly (IF status == 'never' THEN dose = 0) and impute only the genuinely-missing residual among the exposed. /clean-data flags categorical-implied-zero contradictions (a never row with a NULL dose) and ships scripts/check_structural_zero.py.
Complete-case collapse warning (use status, not dose, for adjustment): when a dose/duration variable enters a complete-case multivariable model, the unexposed stratum — which carries structural zeros often stored as NULL — is dropped wholesale, collapsing n (commonly 40–60%) and distorting subgroup estimates (a small stratum can shrink to a handful of subjects). For confounder adjustment use the categorical status variable (never/former/current); reserve the continuous dose for an exposed-only (e.g., ever-smoker-restricted) secondary analysis. Always report n before and after model fitting and confirm the denominator did not silently collapse.

Covariate Selection: Over-adjustment in a Cross-Sectional Outcome Model

Applies to any cross-sectional / single-visit outcome regression (the exposure and outcome are measured at one time point, so temporal order is not observed). The selection rule is causal, not statistical:

Do not adjust for a consequence or mediator of the outcome. A covariate that the outcome physiologically drives sits on or after the causal path; adjusting for it is over-adjustment / collider bias and removes part of the effect under study. The signature case is a renal-function outcome: with eGFR as the outcome, serum uric acid is renally excreted (a lower eGFR mechanically raises urate), so uric acid is an outcome-consequence, not a confounder; blood pressure and HbA1c are often similarly downstream. Classify each candidate covariate against a DAG as confounder / mediator / outcome-consequence / collider, and keep only confounders in the primary model.
"It differs in Table 1" is not a confounder-selection criterion. Baseline imbalance by exposure justifies considering a variable, but a mediator or outcome-consequence stays out regardless of how imbalanced it is. A kitchen-sink "adjust for everything that differs" model is over-adjusted by construction.
Report the suspect-covariate sensitivity + VIF. Make a parsimonious, history-/design-based model the primary one; report the fuller model as a sensitivity analysis that drops the suspect covariate (or adds it, if you start parsimonious), and show whether the headline estimate moves. Always print VIF (collinearity between an outcome-consequence and the outcome's other correlates is common) and the n actually fitted. If dropping the covariate materially changes the estimate, propagate to the abstract and conclusion.
Compare adjusted-vs-unadjusted on the SAME frame (extended-adjustment missingness trap). When an extended-adjustment model adds covariates that carry missingness, the analytic n shrinks (e.g. 84 → 49 events). Comparing that adjusted estimate to the full-frame unadjusted/base estimate confounds adjustment with case-concentrated missingness — it can look as if "adjustment inflated the estimate" when the drift is who-was-dropped. The fair anchor is the unadjusted estimate refit on the reduced complete-case frame (the same rows the adjusted model used): report unadjusted-and-adjusted on the reduced frame alongside the full-frame estimate, and never describe "adjustment changed the estimate" from a comparison across different frames. (Equivalently, use multiple imputation so all models share one frame.)

Language

Code and output: English
Communication with user: Match user's preferred language
Medical terms: English only

What This Skill Does NOT Do

Does not fabricate or simulate data to fill gaps
Does not choose analysis endpoints -- the user decides the research question
Does not interpret clinical significance -- only statistical results
Does not replace biostatistician review for complex designs (e.g., adaptive trials)

Anti-Hallucination

Never fabricate variable names, dataset column names, or variable codings. If a variable mapping is uncertain, output [VERIFY: variable_name] and ask the user to confirm against the data dictionary.
Never fabricate statistical results — no invented p-values, effect sizes, confidence intervals, or sample sizes. All numbers must come from executed code output.
Never generate references from memory. Use /search-lit for all citations.
If a function, package, or API does not exist or you are unsure, say so explicitly rather than guessing.

analyze-stats

Popularity

Invocation

Configuration

Context Preview

Supporting Files

SKILL.md

analyze-stats

Popularity

Invocation

Configuration

Context Preview

Supporting Files

SKILL.md

Statistical Analysis Skill

Data Privacy Check

Reference Files

Workflow

Phase 1: Data Assessment

Phase 2: Analysis Plan

Phase 3: Execute

Script Structure

Execution Rules

Assumption Checking

Multiple Comparisons

Stratified & Ordinal-Trend Reporting

Output Manifest

Phase 3.5: Generated-Code Quality Gate

Phase 4: Report

Statistical Reporting Rules (Always Enforced)

Effect-Size Real-World Translation

Error Handling

Output Conventions

Tables

Figures

Console Output

Analysis-Specific Guidelines

Table 1 (Demographics)

Diagnostic Accuracy

Inter-rater Agreement

Meta-analysis

DTA Meta-Analysis

Network Meta-Analysis

Health Economic Evaluation

Survey/Likert

Survival Analysis

Interval-Censored Survival

Competing Risks

Group Comparison

Correlation

Logistic Regression

Linear Regression

Propensity Score

Survey-Weighted Analysis

Mediation Analysis

Interaction & Effect Modification

Multiple Testing & High-Dimensional Screening

Mendelian Randomization

Polygenic Risk Score (PRS / PGS)

NHIS Claims-Based Studies

Burden of Disease, Decomposition & Forecasting

Repeated Measures

Covariate Pitfalls: Structural Zeros & Dose/Duration Variables

Covariate Selection: Over-adjustment in a Cross-Sectional Outcome Model

Language

What This Skill Does NOT Do

Anti-Hallucination

Similar Skills

Statistical Analysis Skill

Data Privacy Check

Reference Files

Workflow

Phase 1: Data Assessment

Phase 2: Analysis Plan

Phase 3: Execute

Script Structure

Execution Rules

Assumption Checking

Multiple Comparisons

Stratified & Ordinal-Trend Reporting