Skill

design-study

Reviews study design and validity for radiology and medical AI research, checking analysis unit, cohort logic, leakage, comparator design, and reporting guideline fit.

ai-ml

npx claudepluginhub aperivue/medsci-skills --plugin medsci-presentation

Popularity

Stars

145

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/medsci-project:design-study

User invocable

Model invocable

Inline context

Default effort

Configuration

Modelinherit

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

This skill pressure-tests whether a study is answerable, interpretable, and defensible before large amounts of drafting or analysis work accumulate.

Supporting Files

skill.yml

SKILL.md

299 lines · ~3.9k tokens

Similar Skills

design-ai-benchmarking

145

Designs and reviews validity of AI-vs-human-expert benchmarks, covering rubrics, calibration probes, reviewer panels, and reliability targets. Use before data collection.

3 files

medsci-project

domain-research-health-science

119

Guides clinical and health science research through PICOT question formulation, evidence hierarchy assessment, bias evaluation (Cochrane RoB 2, ROBINS-I), outcome prioritization, and GRADE certainty rating.

3 files

thinking-frameworks-skills

scientific-critical-thinking

Evaluates scientific claims and evidence quality using GRADE and Cochrane Risk of Bias frameworks. Assesses experimental design, biases, confounders, and statistical validity.

6 files4 tools

superpowers

Stats

LanguagePython

Stars145

Forks37

MaintenanceExcellent

Last CommitJun 11, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Help us improve

Share bugs, ideas, or general feedback.

Design-Study Skill

Purpose

This skill pressure-tests whether a study is answerable, interpretable, and defensible before large amounts of drafting or analysis work accumulate.

Use it when:

a study question is known but the analysis plan is still fluid
the user wants a methods sanity check
a manuscript feels vulnerable to reviewer criticism
a peer review requires explicit methodological diagnosis

Communication Rules

Communicate with the user in their preferred language.
Use English for statistical, radiologic, and reporting-guideline terminology.
Be direct about validity risks, but always propose the smallest feasible fix first.

Core Review Questions

Always inspect these dimensions:

What is the exact research question?
What is the analysis unit: patient, lesion, exam, study, phase, report?
What is the index date or decision point?
How are inclusion and exclusion criteria applied?
Is there any information leakage?
What is the reference standard or endpoint definition?
What comparator is clinically meaningful?
What validation strategy is used?
What uncertainty reporting is required?
Which reporting guideline best fits?
Are exposure/outcome/covariate definitions literature-grounded, or invented ad-hoc from the data dictionary? If ad-hoc, defer to /define-variables before drafting Methods.

Standard Output

## Study Design Review
Question: ...
Study type: ...
Analysis unit: ...
Index date / prediction timepoint: ...

### Strengths
- ...

### Major validity risks
1. ...
2. ...

### Minimal fixes
- ...

### Reporting fit
- Recommended guideline: ...

### Decision
- Ready for analysis / Needs redesign / Drafting can proceed with limitations

Workflow

Phase 1: Reconstruct the study

Extract from protocol, draft, slides, tables, or notes:

clinical problem
intended use case
population
inputs
outputs
outcome definition
timing of variable availability

Gate: Present the reconstructed study summary (question, analysis unit, intended use) to the user. Confirm before proceeding — if the reconstruction is wrong, the entire validity review will be misdirected.

Phase 2: Check structural validity

A. Analysis unit

Look for mismatches such as:

patient-level claim from lesion-level analysis
exam-level split with patient overlap
phase-level samples treated as independent

B. Leakage

Look for:

postoperative features used for preoperative prediction
normalization or thresholding performed before data split
repeated exams across train/test
reader annotations derived from outcome information
input-text contamination for NLP/LLM extraction tasks: if the model input includes report sections such as clinical history, indication, impression, prior diagnosis, or referral text, confirm that those fields do not literally name or strongly imply the target label. If the target is already present in the supplied text, the task is information retrieval under label leakage, not phenotype inference; redesign the input mask, report a sensitivity analysis excluding leaky fields, or reframe the claim.
construct dependence (a predictor that is a definitional component of the outcome). Two cases: (i) mathematical definition — an input that computes the outcome (when the outcome is HOMA-IR = f(fasting insulin, fasting glucose), those two inputs are not independent predictors); (ii) near-tautological composite — a ratio or score built from the outcome's defining components, which shows an inflated, near-circular association. Test: "could this predictor be derived, in whole or part, from the outcome's definition or the same measurement?" If yes, exclude it, or retain it only as a labeled calibration probe rather than a reported discovery.

F. Time origin & survivorship (incident / transition models)

For any time-to-event or incident/transition design, check before drafting:

Time origin per model. Each incident model starts its at-risk clock at the correct origin. Watch for immortal-time bias (a span in which the event cannot occur, misattributed to one group) and left-truncation / delayed entry (subjects entering the risk set after the origin).
Mediator-ascertainment-window survivorship. A "progressor" / transition label that is conditional on surviving to a later ascertainment (a second scan, a follow-up visit) is survivorship-biased; plan a landmark time or an explicit intermediate-state (multistate / illness-death) model.
Primary-analysis-set selection. If the primary will not be the full cohort (e.g., complete-case while a large fraction is missing), pre-specify the selection justification and a MAR rationale; do not let the complete-case model become primary because it is the significant one (an outcome-dependent choice).
A design that cannot yet answer these should say so honestly — but note that at review time a Methods/Limitations admission that the issue was "not formally assessed" is escalated to a MAJOR by the survival probe (S1), not waved through as a limitation.

C. Reference standard

Check:

who established ground truth
when it was established
whether blinding was possible
whether only a subset had gold standard verification
Construct ↔ nominal-definition match. Does the exposure/finding construct stay inside its stated definition, or does it quietly exceed it? An "incidentaloma" defined as an indeterminate finding must not include frank malignancy reads; a label that overshoots its definition inflates the apparent cohort and breaks the κ. For each construct, restate the nominal definition and confirm every included case satisfies it.
Per-flag reference-standard concordance. When the index finding is flagged against a reference standard, report the concordance per flag category (not just overall). A construct where a large fraction of flags do not match the reference standard (e.g., ~86% non-match) is measuring something other than the named construct.
Manuscript definition ↔ variable_operationalization.md. The variable definitions written in Methods must match the operationalization table verbatim (dictionary-first). A blinded re-classification form must quote the analytic protocol's definition verbatim — paraphrase / "common-sense extension" in the form (but not the Methods) is the documented cause of a low κ that is a definition mismatch, not real disagreement. Cross-check with /define-variables output before drafting.

D. Validation

Classify:

apparent only
internal split
cross-validation
temporal validation
external validation
multi-center external validation

E. Reader / Expert-Elicitation Study Design

When the study elicits expert ratings (reader study, annotation panel, AI-output evaluation), check the following before data collection.

Rubric design

Decouple the axes. Each rated dimension should measure one construct. Keep "is the finding valid/correct" separate from "is it novel", "is it feasible to measure", "does it add value over current tools", and "would it change action". A candidate can be high-validity yet low-added-value ("real but redundant"); a single blended score hides this.
Anchor every Likert point with a short verbal descriptor; pilot the anchors with at least one reviewer before locking.
Pre-specify discriminant validity: hypothesize which dimensions should correlate vs be orthogonal, then report the full inter-dimension correlation matrix to confirm the rubric measures distinct constructs.

Calibration probes (planted control items) Insert a small number of deliberate control items, blinded and randomized across raters (record who received which, e.g. a probe_arm flag), to (i) anchor the scale, (ii) measure rater drift and fatigue, and (iii) audit the rubric and pipeline itself. Four useful flavors:

Positive control / "too-good" item — a known-strong or near-tautological item; tests whether raters equate "largest effect" with "best", and whether an upstream construct-independence gate works.
Known-bad negative control — an engineered defect (fabricated reference, missing key statistic); expected to score low.
Instability item — an estimate that reverses or fails to replicate on holdout; tests caveat handling.
Mechanism-contradiction item — an empirical direction that opposes the proposed mechanism.

Report inter-rater reliability on the control items separately as primary evidence of rubric and scale validity; a low overall ICC is interpretable only if raters at least converge on the controls.

Operational rigor

Randomize item order per reviewer (not one global seed); analyze order and fatigue effects.
Collect reviewer metadata (years of experience, prior AI-evaluation experience, subspecialty) for descriptive reporting.
Define a structured export schema (per-item ratings, free-text justifications, follow-ups, timing) up front.
Require each item to be judged standalone; discourage cross-item references in free-text, which signal non-independent rating.

For an AI-system-versus-human-expert benchmark specifically, route to /design-ai-benchmarking, which extends this subsection with arm definition, LLM-as-judge versus human-as-judge adjudication, and a structured export schema.

Phase 3: Clinical framing

Ask whether the comparator and endpoint support the stated claim:

is the model better than current practice or just another model?
is the endpoint clinically meaningful?
does performance translate to action?
incremental value: if the study frames the model/marker as adding value beyond / on top of / incremental to an existing tool (a clinical score, a routine test, a baseline model), the design must pre-specify the baseline comparator built from the in-routine-use predictors and an incremental-value metric — ΔC-index / ΔAUC (with a paired CI, e.g. DeLong), categorical or continuous NRI, IDI, or decision-curve net benefit. A standalone discrimination number ("our model's AUC was 0.84") does not support a "beyond X" claim; without the nested-model comparison the finding may be real but redundant. Plan this at design time — it cannot be added post hoc without the baseline model.
fine-tuning contribution baseline: if an NLP/LLM study claims that fine-tuning, LoRA, prompt engineering, or a multi-agent wrapper improves extraction/classification, pre-specify a same-backbone zero-shot or few-shot comparator on the identical input, output schema, and test split. A comparison only against a weaker or unrelated baseline cannot establish that the proposed adaptation adds value.
endpoint↔conclusion scope: decide up front what kind of conclusion the design can support, so the manuscript does not overreach. A cross-sectional / single-visit / prevalence design cannot support a prognostic or surveillance claim (rescreen interval, disease progression) — that needs longitudinal follow-up. A binary surrogate endpoint (present/absent, >0, dichotomized) is risk stratification, not a patient-care directive (defer/withhold/initiate therapy). At review time /self-review §D + check_scope_coherence.py flag CROSS_SECTIONAL_PROGNOSTIC / SURROGATE_CARE_DIRECTIVE against the conclusion.

Phase 4: Reporting fit

Recommend one primary guideline:

TRIPOD-AI
CLAIM
STARD
STROBE
PRISMA
CARE
ARRIVE
journal-specific additions if needed

Frequent Failure Modes

Diagnostic AI

no clinically relevant comparator
exam-level split instead of patient-level split
unclear reference standard
AUROC-only reporting without threshold metrics

Prognostic modeling

unclear time zero
immortal time bias
feature timing mismatch
no calibration

Retrospective cohort / screening database

time zero misalignment: cohort entry ≠ follow-up start → immortal time bias
interval-censored outcomes treated as exact → underestimation of event times
healthy volunteer bias unacknowledged → inflated external validity claims
surveillance bias from unequal follow-up frequency between groups
3 bias classification (Hernan/Robins): selection bias (who enters), information bias (how measured), confounding (what else differs) — explicitly map each threat
confounding completeness: pre-specify the adjustment set from a DAG (not a Table-1 p < 0.05 rule), and plan to report whether any measured covariate that turns out imbalanced by exposure but outside the adjustment set leaves the primary estimate robust (an extended-adjustment sensitivity model). At review time /self-review Phase 2.5e + the O1–O6 probes in observational_confounding.md check this against Table 1.

Multimodal LLM / report generation

no clear rubric for clinical correctness
benchmark labels derived from noisy reports without adjudication
unsupported claims about safety or workflow benefit
input text contains the target label or diagnosis being predicted
no same-backbone zero-shot/few-shot baseline for a fine-tuning or prompt-engineering claim

Imaging meta-analysis

overlapping cohorts
paired modalities analyzed as independent
heterogeneity metrics missing
zero-cell handling unspecified

Minimal-Fix Principle

Whenever possible, recommend the smallest feasible repair first:

clarify the claim
narrow the target population
add a limitation statement
add a clinically relevant baseline
re-run one key sensitivity analysis
redefine the endpoint more explicitly

Escalate to redesign only when the central claim is not defensible otherwise.

Handoff Rules

route to analyze-stats when the design is basically sound but analysis details need refinement
route to check-reporting after the design is locked
route to self-review when the user wants a pre-submission quality check on their own manuscript
route back to write-paper only after the main validity risks are documented

What This Skill Does NOT Do

It does not compute statistics directly
It does not draft full manuscript prose
It does not resolve raw data engineering issues
It does not replace a full peer review when journal-facing tone is required

Anti-Hallucination

Never fabricate references. All citations must be verified via /search-lit with confirmed DOI or PMID. Mark unverified references as [UNVERIFIED - NEEDS MANUAL CHECK].
Never invent clinical definitions, diagnostic criteria, or guideline recommendations. If uncertain, flag with [VERIFY] and ask the user.
Never fabricate numerical results — compliance percentages, scores, effect sizes, or sample sizes must come from actual data or analysis output.
If a reporting guideline item, journal policy, or clinical standard is uncertain, state the uncertainty rather than guessing.

design-study

Popularity

Invocation

Configuration

Context Preview

Supporting Files

SKILL.md

Similar Skills

Help us improve

Help us improve

Find plugins for your project

design-study

Popularity

Invocation

Configuration

Context Preview

Supporting Files

SKILL.md

Design-Study Skill

Purpose

Communication Rules

Core Review Questions

Standard Output

Workflow

Phase 1: Reconstruct the study

Phase 2: Check structural validity

A. Analysis unit

B. Leakage

F. Time origin & survivorship (incident / transition models)

C. Reference standard

D. Validation

E. Reader / Expert-Elicitation Study Design

Phase 3: Clinical framing

Phase 4: Reporting fit

Frequent Failure Modes

Diagnostic AI

Prognostic modeling

Retrospective cohort / screening database

Multimodal LLM / report generation

Imaging meta-analysis

Minimal-Fix Principle

Handoff Rules

What This Skill Does NOT Do

Anti-Hallucination

Similar Skills

Help us improve

Design-Study Skill

Purpose

Communication Rules

Core Review Questions

Standard Output

Workflow

Phase 1: Reconstruct the study

Phase 2: Check structural validity

A. Analysis unit

B. Leakage

F. Time origin & survivorship (incident / transition models)

C. Reference standard

D. Validation

E. Reader / Expert-Elicitation Study Design

Phase 3: Clinical framing

Phase 4: Reporting fit

Frequent Failure Modes

Diagnostic AI

Prognostic modeling

Retrospective cohort / screening database

Multimodal LLM / report generation

Imaging meta-analysis

Minimal-Fix Principle

Handoff Rules

What This Skill Does NOT Do

Anti-Hallucination