From awesome-cognitive-and-neuroscience-skills
Guides sample-size planning for fMRI/EEG/MEG studies using effect-size benchmarks, simulations, and multiple-comparison adjustments. For new study design, grants, and power evaluation.
npx claudepluginhub neuroaihub/awesome_cognitive_and_neuroscience_skills --plugin awesome-cognitive-and-neuroscience-skillsThis skill uses the workspace's default tool permissions.
Statistical power in neuroimaging is fundamentally different from power in behavioral research. The massive multiple comparisons problem (testing ~100,000 voxels simultaneously), spatial correlation structure, and non-standard test statistics mean that standard power formulas underestimate required sample sizes. Meanwhile, the field has historically been severely underpowered: the median fMRI s...
Guides simulation-based sample size planning for neuroimaging studies (fMRI, EEG, MEG) using effect-size maps. For grant proposals, registered reports, or pilot data evaluation.
Guides statistical test selection, assumption checks, power analysis, hypothesis tests (t-tests, ANOVA, chi-square, regression, Bayesian), effect sizes, and APA-formatted reports for research data.
Selects statistical tests, interprets effect sizes and confidence intervals, conducts power analysis, verifies assumptions for quantitative research data analysis.
Share bugs, ideas, or general feedback.
Statistical power in neuroimaging is fundamentally different from power in behavioral research. The massive multiple comparisons problem (testing ~100,000 voxels simultaneously), spatial correlation structure, and non-standard test statistics mean that standard power formulas underestimate required sample sizes. Meanwhile, the field has historically been severely underpowered: the median fMRI study has only ~20% power to detect a typical effect (Button et al., 2013).
A competent programmer without neuroimaging training would apply standard power calculations (e.g., G*Power for a t-test) without accounting for multiple comparison correction, would not know typical effect sizes in neuroimaging, and would dramatically underestimate the sample sizes needed. This skill encodes the domain-specific knowledge for neuroimaging power analysis.
Before executing the domain-specific steps below, you MUST:
For detailed methodology guidance, see the research-literacy skill.
This skill was generated by AI from academic literature. All parameters, thresholds, and citations require independent verification before use in research. If you find errors, please open an issue.
Standard power analysis assumes a single statistical test. Neuroimaging involves:
| Challenge | Impact on Power | Source |
|---|---|---|
| Massive multiple comparisons | ~100,000 voxels tested; correction reduces sensitivity by orders of magnitude | Nichols & Hayasaka, 2003 |
| Spatial smoothness | Adjacent voxels are correlated, reducing effective number of independent tests but complicating power calculation | Worsley et al., 1996 |
| Multi-level inference | Subject-level estimation + group-level test; both levels contribute noise | Mumford & Nichols, 2008 |
| Effect size variability | Effects vary across voxels, regions, and subjects; no single "effect size" characterizes a study | Poldrack et al., 2017 |
| Threshold-dependent power | Power depends heavily on the statistical threshold (corrected vs. uncorrected) and correction method | Hayasaka et al., 2007 |
Key implication: A standard G*Power calculation for a two-sample t-test will dramatically overestimate the power of a whole-brain fMRI analysis because it ignores multiple comparison correction (Mumford & Nichols, 2008).
| Analysis Type | Typical Effect Size | Unit | Source |
|---|---|---|---|
| Task activation (voxel-level) | Cohen's d = 0.5-1.0 | Standardized mean difference | Poldrack et al., 2017 |
| Task activation (ROI-level) | Cohen's d = 0.5-1.5 | Standardized mean difference | Poldrack et al., 2017 |
| Between-group difference (voxel) | Cohen's d = 0.3-0.8 | Standardized mean difference | Poldrack et al., 2017 |
| Functional connectivity (correlation) | r = 0.2-0.5 | Pearson correlation | Marek et al., 2022 |
| Brain-behavior association | r = 0.1-0.3 | Pearson correlation | Marek et al., 2022 |
| Brain-wide association (replicable) | r < 0.05 at N < 1000 | Pearson correlation | Marek et al., 2022 |
Critical finding: Marek et al. (2022) demonstrated that brain-behavior correlations in typical neuroimaging samples (N < 100) are severely inflated. Replicable brain-behavior associations require N > 2,000 for whole-brain analyses.
| Analysis Type | Typical Effect Size | Source |
|---|---|---|
| ERP component amplitude (e.g., N400, P300) | Cohen's d = 0.3-0.8 | Boudewyn et al., 2018 |
| ERP latency differences | Cohen's d = 0.2-0.5 | Luck, 2014 |
| EEG oscillatory power | Cohen's d = 0.3-0.6 | Cohen, 2014 |
| EEG connectivity (coherence/PLV) | Cohen's d = 0.2-0.5 | Cohen, 2014 |
| Design | Minimum N | Recommended N | Assumptions | Source |
|---|---|---|---|---|
| Within-subject task activation | 20 | 25-30 | Large effect (d > 0.8), lenient correction | Desmond & Glover, 2002 |
| Between-group comparison (large effect, d = 0.8) | 20 per group | 25-30 per group | Whole-brain, cluster-corrected | Thirion et al., 2007 |
| Between-group comparison (medium effect, d = 0.5) | 40 per group | 50+ per group | Whole-brain, cluster-corrected | Thirion et al., 2007; Poldrack et al., 2017 |
| Resting-state individual differences | 25+ | 50+ (much more for replicability) | Depends on reliability of measure | Marek et al., 2022 |
| Brain-behavior correlations | 100+ | N > 2,000 for replicable whole-brain | Large-scale only | Marek et al., 2022 |
| ROI-based analysis (a priori) | 15-20 | 25+ | Single ROI, no whole-brain correction | Desmond & Glover, 2002 |
| Design | Minimum per Condition | Recommended per Condition | Source |
|---|---|---|---|
| ERP trials per condition per subject | 30 | 40-60 | Boudewyn et al., 2018 |
| ERP between-group (medium d = 0.5) | 34 per group | 50+ per group | Boudewyn et al., 2018 |
| ERP within-subject (medium d = 0.5) | 25 subjects | 30+ subjects | Luck, 2014 |
| Time-frequency analysis | 40 trials | 60+ trials | Cohen, 2014 |
| N (per group) | Power for d = 0.5 (uncorrected) | Power for d = 0.5 (corrected, whole-brain) | Power for d = 0.8 (corrected) |
|---|---|---|---|
| 10 | ~26% | < 10% | ~25% |
| 20 | ~50% | ~20% | ~50% |
| 30 | ~70% | ~35% | ~70% |
| 40 | ~82% | ~50% | ~85% |
| 60 | ~94% | ~70% | ~95% |
Values are approximate, based on simulations from Mumford & Nichols (2008) and Desmond & Glover (2002). Exact power depends on design, smoothness, effect spatial extent, and correction method.
What type of analysis are you planning?
|
+-- Whole-brain voxelwise analysis
| |
| +-- Within-subject (one-sample t-test)
| | --> Minimum N = 20; aim for N = 25-30
| | (Desmond & Glover, 2002)
| |
| +-- Between-group comparison
| | |
| | +-- Large expected effect (d > 0.8)
| | | --> N = 20-25 per group (Thirion et al., 2007)
| | |
| | +-- Medium expected effect (d = 0.5)
| | | --> N = 40-50 per group (Poldrack et al., 2017)
| | |
| | +-- Small expected effect (d = 0.3)
| | --> N = 80+ per group; consider ROI approach
| |
| +-- Brain-behavior correlation
| --> N = 100+ minimum; N > 2,000 for replicability
| (Marek et al., 2022)
|
+-- ROI-based analysis (a priori regions)
| --> Use standard power formulas (G*Power) with expected
| effect size from literature or pilot data.
| No multiple comparison correction needed for single ROI.
| N = 15-30 typical for medium-large effects.
|
+-- ERP analysis
|
+-- Between-group
| --> 30-50 per group for medium effects
| (Boudewyn et al., 2018)
|
+-- Within-subject
--> 25-30 subjects, 30+ trials per condition
(Boudewyn et al., 2018; Luck, 2014)
Estimates power using pilot group-level activation maps:
Requirements: Pilot data from at least 10-15 subjects for stable variance estimates (Mumford & Nichols, 2008)
Web-based tool for peak-based power estimation:
Advantage: Does not require individual subject data; can use published group maps URL: https://neuropowertools.org
Advantage: Fully nonparametric; accounts for the exact multiple comparison correction used Disadvantage: Computationally expensive (requires running thousands of permutation tests per power estimate)
Simulation-based power using parametric assumptions:
The choice of correction method dramatically affects required sample size:
| Correction Method | Effective Alpha per Voxel | Relative Power | Source |
|---|---|---|---|
| None (p < 0.001 uncorrected) | 0.001 | Highest (but invalid inference) | -- |
| FDR q < 0.05 | ~0.0001-0.001 (data-dependent) | Moderate-High | Genovese et al., 2002 |
| Cluster-based (CDT p < 0.001) | Depends on cluster size | Moderate-High for large effects | Eklund et al., 2016 |
| Voxelwise FWE (RFT, p < 0.05) | ~0.00000005 | Low | Worsley et al., 1996 |
| TFCE + permutation | Varies | Moderate | Smith & Nichols, 2009 |
Domain insight: Switching from voxelwise FWE to cluster-based or FDR correction can increase power by 50-200% for the same sample size, because these methods exploit the spatial extent of true activations (Nichols & Hayasaka, 2003).
For individual differences designs (correlating brain measures with behavior), reliability of the brain measure is critical (Elliott et al., 2020):
| Measure | Typical ICC | Implication | Source |
|---|---|---|---|
| Task fMRI activation (ROI) | 0.3-0.6 | Poor to moderate reliability | Elliott et al., 2020 |
| Resting-state connectivity | 0.3-0.7 | Moderate reliability; depends on scan duration | Elliott et al., 2020 |
| ERP amplitude | 0.5-0.8 | Moderate to good | Cassidy et al., 2012 |
| EEG oscillatory power | 0.6-0.9 | Good to excellent | Cohen, 2014 |
Critical formula: The maximum detectable correlation between brain and behavior is bounded by the reliabilities of both measures:
r_observed_max = r_true * sqrt(reliability_brain * reliability_behavior)
With brain ICC = 0.5 and behavior reliability = 0.8, even a true correlation of r = 0.5 would appear as r = 0.5 * sqrt(0.5 * 0.8) = 0.32 on average (Elliott et al., 2020). This attenuation means far larger samples are needed.
Recommendation: For individual differences designs, collect longer scan sessions (at least 20-30 minutes of resting-state data; Birn et al., 2013) or use multi-session data to improve reliability.
See references/ for detailed simulation examples and effect size lookup tables.