Best practices for data aggregation, recalculation, and category management in scientific analyses. Covers when to recalculate vs reuse aggregated data, handling category changes, and ensuring analytical accuracy.
npx claudepluginhub joshuarweaver/cascade-ai-ml-engineering --plugin delphine-l-claude-globalThis skill is limited to using the following tools:
Expert guidance for making critical decisions in data analysis workflows, particularly around aggregation, recalculation, and maintaining analytical integrity.
Creates isolated Git worktrees for feature branches with prioritized directory selection, gitignore safety checks, auto project setup for Node/Python/Rust/Go, and baseline verification.
Executes implementation plans in current session by dispatching fresh subagents per independent task, with two-stage reviews: spec compliance then code quality.
Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.
Expert guidance for making critical decisions in data analysis workflows, particularly around aggregation, recalculation, and maintaining analytical integrity.
When you have pre-aggregated data but need different categories or groupings:
For detailed patterns and code examples, see data-manipulation-recipes.md
When merging datasets from multiple sources, a single identifier often isn't unique enough:
| or ::)For implementation details, see data-manipulation-recipes.md
When one metric combines multiple independent features, separate into independent analyses:
For examples, see data-manipulation-recipes.md
Type mismatches are common when enriching DataFrames from external sources:
pd.notna() checksFor the type-safe assignment pattern and examples, see data-manipulation-recipes.md
When enriching tabular data from AWS S3 or external repositories:
For implementation patterns, see enrichment-patterns.md
Column names don't always match their content. Always verify against source code before using categorical columns:
For the full verification workflow and prevention checklist, see validation-patterns.md
Derived columns may use inferior sources, causing silent data loss:
For diagnostic patterns, see validation-patterns.md
Separate computation (notebooks) from interpretation (markdown files):
analysis_files/ directory with per-figure markdown filesFor directory structure and writing guidelines, see analysis-organization.md
When experimental design has multiple factors:
For interpretation framework and examples, see analysis-interpretation.md
When one category performs better on metric X but worse on related metrics Y and Z:
For documentation patterns, see analysis-interpretation.md
When external services use different species names than your metadata:
For reconciliation workflow and code, see species-reconciliation.md
Track what percentage of your phylogenetic tree has data available:
For coverage analysis workflow, see species-reconciliation.md
When analyzing multiple groups, determine if lack of effect is real or insufficient power:
For reporting recommendations and examples, see analysis-interpretation.md
Temporal trends may reflect technology adoption rather than methodology improvements:
For the systematic testing approach, see analysis-interpretation.md
When working with multiple intermediate dataset versions:
For workflow details, see enrichment-patterns.md
For large data files, compress instead of delete:
For compression benchmarks and workflows, see compression-strategies.md
Before deciding to reuse aggregated data, check: Can you perfectly reconstruct raw data from aggregates? If NO, recalculate.
"""
Data source: scaffold_telomere_data.csv (n=6,356 scaffolds)
Recalculated: 2026-01-29
Reason: Previous aggregation conflated terminal and interstitial presence
Method: [describe categorization logic]
"""
original_total = df['cat1'] + df['cat2'] + df['cat3'] + df['cat4']
new_total = df['new_cat1'] + df['new_cat2'] + df['new_cat3']
assert (original_total == new_total).all(), "Category totals don't match!"
Recalculation is often faster than you think:
# Modern pandas on 10,000+ rows
df['new_cat'] = df.apply(categorize_func, axis=1)
result = df.groupby('species').agg({'new_cat': 'value_counts'})
# Often < 1 second
Optimize: use vectorized operations, filter to relevant columns, cache intermediate results.
| File | Content |
|---|---|
| data-manipulation-recipes.md | Recalculation patterns, composite keys, conflated features, type conversion |
| enrichment-patterns.md | AWS enrichment, data consolidation, date extraction, filtered dataset rebuilding |
| validation-patterns.md | Column name verification, data quality checks, data provenance |
| analysis-interpretation.md | Multi-factor design, paradoxical results, power limitations, technology confounding |
| species-reconciliation.md | Species name reconciliation, phylogenetic tree coverage |
| analysis-organization.md | Token-efficient analysis text organization, statistical results population |
| compression-strategies.md | File compression decision tree, benchmarks, script updates |