Skill

stats-methods

Applies statistical techniques to data analysis: descriptive stats, distributions, hypothesis testing, A/B evaluation, outliers, trends, forecasting, p-values, confidence intervals, and pitfalls like Simpson's paradox.

Python

data-engineering

npx claudepluginhub vm0-ai/vm0-skills --plugin user-research

Tool Access

This skill uses the workspace's default tool permissions.

Preview

A practitioner's guide to applying statistics in data analysis, from summarizing distributions through testing hypotheses and spotting analytical traps.

SKILL.md

Similar Skills

statistical-analysis

10.9k

Applies descriptive stats, trend analysis, outlier detection, hypothesis testing to distributions, anomalies, correlations, and business metrics.

data

Statistical Hypothesis Testing

180

Conducts statistical hypothesis tests including t-tests, chi-square, ANOVA, Mann-Whitney U using Python's scipy.stats for p-value analysis, significance testing, and A/B validation.

2 files

aj-geddes-useful-ai-prompts-4

statistical-analysis

Conducts statistical hypothesis tests (t-test, ANOVA, chi-square), regression, correlation, Bayesian analyses, power analysis, assumption checks, and APA reporting for academic research data.

6 files

scientific-skills

Stats

Stars52

Forks11

Last CommitMar 13, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Statistical Methods

A practitioner's guide to applying statistics in data analysis, from summarizing distributions through testing hypotheses and spotting analytical traps.

Summarizing Numeric Data

Choosing a Center Metric

Data Characteristic	Recommended Measure	Rationale
Symmetric, outlier-free	Mean	Maximally efficient estimator
Asymmetric or outlier-heavy	Median	Unaffected by extreme values
Non-numeric or ranked	Mode	Sole option for categorical data
Business KPIs like revenue per user	Both mean and median	The gap between them reveals skewness

Guideline: For any business metric, present the mean alongside the median. When they differ substantially, the distribution is skewed and the mean by itself will mislead.

Quantifying Variability

Standard deviation: Typical distance from the mean; best suited to bell-shaped data.
IQR (interquartile range): Gap between the 25th and 75th percentiles; resistant to extreme values.
Coefficient of variation: Standard deviation divided by the mean; enables apples-to-apples variability comparison across different scales.
Range: Maximum minus minimum; gives a quick but outlier-sensitive view of data spread.

Telling the Story with Percentiles

Go beyond averages by reporting a percentile ladder:

p1:   Floor of the distribution (bottom 1%)
p5:   Lower boundary of typical values
p25:  First quartile
p50:  Median — the representative observation
p75:  Third quartile
p90:  Top 10% threshold (heavy users, premium tier)
p95:  Upper boundary of typical values
p99:  Extreme top 1%

Sample insight: "Half of all sessions last under 4.2 minutes, yet the top decile exceeds 22 minutes, which pushes the average to 7.8 minutes."

Characterizing Distributions

For every numeric column, document:

Shape: Gaussian, right-tailed, left-tailed, bimodal, uniform, heavy-tailed
Center: Mean vs. median and the magnitude of their difference
Spread: Standard deviation or IQR as appropriate
Extremes: Count and severity of outliers
Boundaries: Natural limits such as zero floors or 100% ceilings

Trend Analysis and Projection

Smoothing Noisy Time Series

# Weekly smoother — useful for daily data with weekday/weekend cycles
df['smooth_7'] = df['metric'].rolling(window=7, min_periods=1).mean()

# Four-week smoother — irons out both weekly and monthly rhythms
df['smooth_28'] = df['metric'].rolling(window=28, min_periods=1).mean()

Period Comparisons

Week-over-week: Same weekday, one week apart
Month-over-month: Calendar month versus prior calendar month
Year-over-year: The gold standard for businesses with seasonal patterns
Same-calendar-day: Matches the exact date from the prior year

Measuring Growth

Simple rate:   (current - prior) / prior
CAGR:          (final / initial) ^ (1 / n_years) - 1
Log rate:      ln(current / prior)   # more stable for volatile series

Spotting Seasonal Cycles

Visually inspect the raw series first
Aggregate by day-of-week to surface weekly rhythms
Aggregate by calendar month to surface annual rhythms
Always use year-over-year or matched-period comparisons to separate trend from seasonality

Lightweight Forecasting Approaches

For analysts who need quick projections rather than full modeling:

Naive: Forecast equals the most recent observation. Serves as the minimum-viable baseline.
Seasonal naive: Forecast equals the value from the same period in the prior cycle.
Linear extrapolation: Fit a straight line to recent history. Only appropriate when the trend is clearly linear.
Trailing average: Use a rolling mean as the projected value.

Always express forecasts as ranges, not point estimates:

Good: "Next month should bring 10,000 to 12,000 registrations based on the trailing quarter"
Misleading: "Next month will yield exactly 11,234 registrations"

Hand off to a specialist when the pattern is non-linear, multiple seasonal cycles overlap, external drivers (ad spend, holidays) matter, or when forecast precision drives resource decisions.

Detecting and Handling Outliers

Identification Techniques

Z-score approach (assumes approximate normality):

z = (df['val'] - df['val'].mean()) / df['val'].std()
outliers = df[abs(z) > 3]  # beyond 3 standard deviations

IQR fence approach (works regardless of distribution shape):

q1 = df['val'].quantile(0.25)
q3 = df['val'].quantile(0.75)
iqr = q3 - q1
lo = q1 - 1.5 * iqr
hi = q3 + 1.5 * iqr
outliers = df[(df['val'] < lo) | (df['val'] > hi)]

Percentile cutoff approach (most straightforward):

outliers = df[(df['val'] < df['val'].quantile(0.01)) |
              (df['val'] > df['val'].quantile(0.99))]

What to Do with Outliers

Never strip outliers automatically. Follow this decision process:

Diagnose: Is this a recording error, a legitimately extreme observation, or a sign of a separate population?
Errors: Correct or exclude (e.g., negative ages, epoch-zero timestamps)
Legitimate extremes: Retain but switch to robust summaries (median, IQR)
Distinct populations: Analyze separately (e.g., enterprise accounts vs. self-serve)

Document every exclusion: "We set aside 47 records (0.3% of the dataset) with order values above $50K; these bulk enterprise transactions are covered in a separate section."

Detecting Anomalies in Time Series

Establish an expected baseline (rolling average or year-ago value)
Compute the residual: actual minus expected
Flag residuals exceeding 2-3 standard deviations of historical residuals
Differentiate one-off spikes (point anomalies) from lasting shifts (change points)

Hypothesis Testing Essentials

When It Applies

Use formal testing whenever you need to distinguish a real signal from random noise:

Evaluating A/B experiment results
Measuring the impact of a product change (before vs. after)
Comparing metrics across customer segments

Step-by-Step Process

State the null (H0): No difference exists (default position)
State the alternative (H1): A difference exists
Set the significance threshold (alpha): 0.05 is standard (5% false-positive tolerance)
Calculate the test statistic and p-value
Decide: p < alpha means sufficient evidence to reject H0

Selecting the Right Test

Question	Appropriate Test	Conditions
Two group means differ?	Independent samples t-test	Roughly normal, two groups
Two conversion rates differ?	Proportions z-test	Binary outcomes
Same entities measured twice?	Paired t-test	Pre/post on identical subjects
Three or more group means?	ANOVA	Multiple variants or segments
Non-normal data, two groups?	Mann-Whitney U	Skewed or ordinal metrics
Two categorical variables related?	Chi-squared test	Frequency table data

Beyond p-values: Practical Impact

A statistically significant result only means the effect is unlikely due to chance. It does not guarantee the effect matters in practice. Always accompany test results with:

Effect magnitude: "Variant B lifted conversion by 0.3 percentage points"
Confidence interval: The plausible range of the true effect
Business translation: Revenue, user, or efficiency implications

Sample Size Awareness

Small samples yield unreliable conclusions even when p-values look good
Proportions require roughly 30 or more events per group for baseline reliability
Detecting subtle effects (e.g., a 1-point conversion shift) can demand thousands of observations per arm
When data is limited, say so: "With 200 observations per group, effects smaller than X% would likely go undetected"

Guarding Against Statistical Pitfalls

Correlation vs. Causation

Whenever a correlation surfaces, explicitly evaluate:

Reverse direction: Perhaps B drives A rather than A driving B
Hidden third factor: Some unmeasured variable C could be behind both
Coincidence: Enough variable pairs will show spurious associations

Safe phrasing: "Users who adopt feature X exhibit 30% higher retention" Unsafe phrasing: "Feature X causes 30% higher retention" (requires experimental evidence)

The Multiple Testing Trap

Running many tests inflates false positives:

At alpha = 0.05, testing 20 metrics yields roughly one spurious hit by chance
If you explored numerous segments before finding the "interesting" one, acknowledge that
Apply Bonferroni correction (alpha / number of tests) or transparently report total tests conducted

Simpson's Paradox

An overall trend can invert when you break the data into subgroups:

Verify that aggregate conclusions hold within each key segment
Classic scenario: total conversion rises while every segment's conversion falls, because traffic shifted toward a naturally higher-converting segment

Survivorship Bias

Your dataset only contains entities that persisted long enough to be recorded:

Studying current users ignores everyone who already left
Profiling winning products overlooks the failures
Routinely ask: "Who is absent from this data, and would including them change the conclusion?"

Ecological Fallacy

Group-level patterns may not describe individuals:

"Nations with higher X tend to have higher Y" does not mean the same holds per person
Resist applying aggregate statistics to individual-level predictions

Illusory Precision

Overly specific numbers suggest unjustified confidence:

"Churn will be 4.73% next quarter" implies an accuracy that rarely exists
Prefer honest ranges: "Churn is likely between 4% and 6%"
Round to the level of certainty you actually possess