From sciagent-skills
Computes NaN-safe Spearman/Pearson correlations per feature across datasets with missing values using pairwise deletion, degenerate input filtering, and large-dataset optimization.
npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skillsThis skill uses the workspace's default tool permissions.
Computing correlations across many features (genes, proteins, variants) when missing values are present is error-prone. The most common mistake is using bulk matrix shortcuts that silently mishandle NaN, producing incorrect correlation values. This guide covers correct per-feature pairwise computation, degenerate input filtering, and performance optimization.
Filters degenerate uninformative inputs like constant features, zero-variance, all-NaN columns, empty files, and single-sequence alignments before statistical tests to avoid NaNs and crashes.
Computes Pearson, Spearman, Kendall correlations, matrices, and heatmaps to analyze variable relationships, detect multicollinearity, and support EDA/feature selection in Python.
Assists with correlation analyzer operations in data analytics, providing step-by-step guidance, code, and configs for SQL queries, data visualization, statistical analysis, and BI.
Share bugs, ideas, or general feedback.
Computing correlations across many features (genes, proteins, variants) when missing values are present is error-prone. The most common mistake is using bulk matrix shortcuts that silently mishandle NaN, producing incorrect correlation values. This guide covers correct per-feature pairwise computation, degenerate input filtering, and performance optimization.
Different features have different missing value patterns across samples. Bulk methods handle this inconsistently:
| Method | Problem |
|---|---|
DataFrame.rank() then corrwith() | rank() assigns NaN ranks; corrwith() may drop globally or per-column inconsistently |
DataFrame.corrwith(method='spearman') | Implementation varies by pandas version; may use listwise deletion |
np.corrcoef on ranked data | Propagates NaN to entire result if any value is missing |
Features that produce undefined or unstable correlations:
| Type | Description | Effect |
|---|---|---|
| Constant features | All values identical (variance = 0) | Correlation undefined (division by zero) |
| Near-constant features | Very low variance | Correlation numerically unstable |
| Too few valid values | After NaN removal, fewer than min_valid pairs | Statistically unreliable |
| Single-value after filtering | Only one unique value remains post-NaN removal | Correlation undefined |
Do you have missing values (NaN) in your feature matrix?
├── No NaN at all → Bulk methods are safe (corrwith, np.corrcoef)
└── Yes, NaN present
├── Same NaN pattern across all features? → Listwise deletion is acceptable
└── Different NaN patterns per feature (typical)
├── < 10,000 features → Per-feature loop with scipy.stats.spearmanr
└── > 10,000 features → Parallelized per-feature loop (joblib)
| Scenario | Recommended Approach | Rationale |
|---|---|---|
| No missing data | DataFrame.corrwith() | Fast, correct when no NaN |
| Sparse NaN, < 10K features | Per-feature spearmanr loop | Correct pairwise deletion, acceptable speed |
| Sparse NaN, > 10K features | Parallelized per-feature loop | Same correctness, scales with cores |
| Dense NaN (> 50% missing) | Per-feature loop + strict min_valid | Many features will be skipped; report skip count |
| Uniform NaN pattern | Listwise deletion + bulk method | If all features share same NaN rows, pairwise = listwise |
Always print NaN summary before analysis: Report total NaN count, features with any NaN, and per-feature NaN distribution. This documents data quality and alerts you to severe missingness patterns.
Use scipy.stats.spearmanr per feature in a loop: This is the only method that guarantees correct pairwise NaN removal for each feature independently.
Set a minimum valid pair threshold (min_valid): Default to 10. Features with fewer valid pairs after NaN removal produce unreliable correlations and should be skipped with NaN.
Filter degenerate inputs before computing correlations: Remove constant features, near-constant features, and features with excessive NaN before the correlation loop. This avoids undefined results and speeds up computation.
Track n_valid per feature in the output: The number of valid pairs varies per feature. Report it alongside rho and p-value so downstream analysis can assess reliability.
Report how many features were skipped or filtered: Silent feature loss is a common source of confusion. Always print the count of filtered degenerate features and skipped low-data features.
Use parallelization for large datasets: For > 10,000 features, use joblib to distribute the per-feature loop across cores. The per-feature computation is embarrassingly parallel.
Using bulk rank-then-correlate with NaN present: df.rank() followed by corrwith() silently mishandles NaN, producing incorrect correlations.
scipy.stats.spearmanr per feature when NaN is present.Assuming uniform sample count across features: Different features have different NaN patterns, so each correlation is computed on a different number of samples.
n_valid for every feature.Not filtering degenerate inputs: Constant or near-constant features produce undefined correlations or divide-by-zero warnings that can silently corrupt results.
filter_degenerate() before the correlation loop.Using listwise deletion when NaN patterns differ: Listwise deletion removes any row with NaN in any feature, potentially discarding most of your data.
Ignoring the NaN summary step: Skipping the data quality report means you cannot verify whether the NaN pattern is severe enough to affect results.
Setting min_valid too low: With fewer than ~10 valid pairs, Spearman correlation is unreliable and p-values are meaningless.
Step 1: Print NaN Summary
Step 2: Filter Degenerate Features
Step 3: Compute Per-Feature Correlations
scipy.stats.spearmanrStep 4: Assemble and Report Results
from scipy.stats import spearmanr
import numpy as np
import pandas as pd
def nan_summary(df):
"""Print NaN summary before correlation analysis."""
print(f"Dataset shape: {df.shape}")
print(f"Total NaN: {df.isna().sum().sum()}")
print(f"Features with any NaN: {(df.isna().any()).sum()}")
print(f"NaN per feature (mean): {df.isna().sum().mean():.1f}")
print(f"NaN per feature (max): {df.isna().sum().max()}")
def filter_degenerate(df, min_unique=3, min_nonnan_frac=0.5):
"""Remove degenerate features before correlation analysis.
Args:
df: DataFrame (samples x features)
min_unique: Minimum number of unique non-NaN values required
min_nonnan_frac: Minimum fraction of non-NaN values required
Returns:
Filtered DataFrame, count of removed features
"""
n_samples = len(df)
keep = []
for col in df.columns:
values = df[col].dropna()
if len(values) < n_samples * min_nonnan_frac:
continue
if values.nunique() < min_unique:
continue
keep.append(col)
removed = len(df.columns) - len(keep)
print(f"Filtered {removed} degenerate features out of {len(df.columns)}")
return df[keep], removed
def pairwise_spearman(df_x, df_y, min_valid=10):
"""Compute per-feature Spearman correlation with pairwise NaN removal.
Args:
df_x: DataFrame (samples x features), aligned with df_y
df_y: DataFrame (samples x features), same shape as df_x
min_valid: Minimum number of valid (non-NaN) pairs required
Returns:
DataFrame with columns: rho, pvalue, n_valid
"""
nan_summary(df_x)
nan_summary(df_y)
results = []
for feature in df_x.columns:
x = df_x[feature].values
y = df_y[feature].values
mask = ~(np.isnan(x) | np.isnan(y))
n_valid = mask.sum()
if n_valid < min_valid:
results.append({'feature': feature, 'rho': np.nan,
'pvalue': np.nan, 'n_valid': n_valid})
continue
rho, pval = spearmanr(x[mask], y[mask])
results.append({'feature': feature, 'rho': rho,
'pvalue': pval, 'n_valid': n_valid})
result_df = pd.DataFrame(results).set_index('feature')
skipped = result_df['rho'].isna().sum()
if skipped > 0:
print(f"Skipped {skipped} features with < {min_valid} valid pairs")
return result_df
# WRONG: Bulk rank-then-correlate
ranked_x = df_x.rank()
ranked_y = df_y.rank()
corrs = ranked_x.corrwith(ranked_y)
# WRONG: Bulk corrwith with method parameter
corrs = df_x.corrwith(df_y, method='spearman')
# WRONG: numpy corrcoef on ranked arrays (propagates NaN)
corrs = np.corrcoef(df_x.rank().values.T, df_y.rank().values.T)
from joblib import Parallel, delayed
def parallel_spearman(df_x, df_y, min_valid=10, n_jobs=4):
"""Parallelized per-feature Spearman correlation."""
def compute_one(feature):
x = df_x[feature].values
y = df_y[feature].values
mask = ~(np.isnan(x) | np.isnan(y))
n = mask.sum()
if n < min_valid:
return feature, np.nan, np.nan, n
rho, pval = spearmanr(x[mask], y[mask])
return feature, rho, pval, n
results = Parallel(n_jobs=n_jobs)(
delayed(compute_one)(f) for f in df_x.columns
)
return pd.DataFrame(
results, columns=['feature', 'rho', 'pvalue', 'n_valid']
).set_index('feature')
statistical-analysis -- General statistical test selection and assumption checkingdegenerate-input-filtering -- Broader guide on filtering uninformative data before any statistical testscikit-learn-machine-learning -- Feature selection and preprocessing pipelines