Skill

jupyter-notebook-analysis

Best practices for creating comprehensive Jupyter notebook data analyses with statistical rigor, outlier handling, and publication-quality visualizations. Includes Claude API image size helpers.

Install

npx claudepluginhub joshuarweaver/cascade-ai-ml-engineering --plugin delphine-l-claude-global

Tool Access

This skill is limited to using the following tools:

ReadGrepGlobBash

Preview

Expert knowledge for creating comprehensive, statistically rigorous Jupyter notebook analyses.

Supporting Assets

notebook-editing.mdnotebook-organization.mdnotebook-patterns.mdsharing-and-export.mdstatistical-methods.mdtroubleshooting.mdvisualization-guide.md

SKILL.md

Similar Skills

using-git-worktrees

Creates isolated Git worktrees for feature branches with prioritized directory selection, gitignore safety checks, auto project setup for Node/Python/Rust/Go, and baseline verification.

superpowers

168.3k

subagent-driven-development

3 files

Executes implementation plans in current session by dispatching fresh subagents per independent task, with two-stage reviews: spec compliance then code quality.

superpowers

168.3k

dispatching-parallel-agents

Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.

superpowers

168.3k

Stats

Stars12

Forks2

Last CommitMar 14, 2026

Actions

View Source View Plugin View on GitHub View README

Jupyter Notebook Analysis Patterns

Expert knowledge for creating comprehensive, statistically rigorous Jupyter notebook analyses.

When to Use This Skill

Creating multi-cell Jupyter notebooks for data analysis
Adding correlation analyses with statistical testing
Implementing outlier removal strategies
Building series of related visualizations (10+ figures)
Analyzing large datasets with multiple characteristics
Building data update/enrichment notebooks with multi-source merging
Generating figures for sharing with Claude or other AI tools

Important: Image Size Constraints

When generating images to share with Claude, images must not exceed 8000 pixels in either dimension. Add this helper to your notebook imports:

# Standard imports with Claude size checking
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image

MAX_CLAUDE_DIM = 7999  # Claude API limit with safety margin

def save_figure(filename, dpi=300, **kwargs):
    """Save figure with automatic Claude size constraint check."""
    plt.savefig(filename, dpi=dpi, bbox_inches='tight', **kwargs)

    # Verify and auto-resize if needed
    img = Image.open(filename)
    if img.width > MAX_CLAUDE_DIM or img.height > MAX_CLAUDE_DIM:
        print(f"Auto-resizing {filename} for Claude compatibility")
        print(f"   Original: {img.width}x{img.height}")
        img.thumbnail((MAX_CLAUDE_DIM, MAX_CLAUDE_DIM), Image.Resampling.LANCZOS)
        img.save(filename)
        print(f"   Resized: {img.width}x{img.height}")
    else:
        print(f"OK {filename}: {img.width}x{img.height}")

# Safe figure sizes for Claude (300 DPI)
FIG_SIZES = {
    'small': (7, 5),       # 2100x1500 px
    'medium': (12, 9),     # 3600x2700 px
    'large': (20, 15),     # 6000x4500 px
    'max': (26, 26),       # 7800x7800 px - maximum safe
}

# Use in notebook
fig, ax = plt.subplots(figsize=FIG_SIZES['medium'])
# ... plotting code ...
save_figure('figure.png')

For complete image size guidance, see the data-visualization skill.

Core Notebook Patterns

Data Update/Enrichment Notebooks

Use structured notebook patterns for multi-source data merging and enrichment. Key principles:

Configuration section at top with safety defaults (ENABLE_AWS_FETCH = False, TEST_MODE = True)
Composite keys for complex merge uniqueness requirements
Conflict resolution with configurable strategy (NEW vs OLD priority)
Idempotent column addition -- check if columns exist before adding
Enrichment tracking -- count what was actually saved, not just fetched
Two-stage file workflow -- input file -> distinct output file (never in-place)

For detailed patterns including data update, enrichment, and AWS GenomeArk workflows, see notebook-patterns.md.

Notebook Editing

Always use NotebookEdit tool for .ipynb file modifications -- never the Edit tool (corrupts JSON structure).

Three modes: replace (update cell content), insert (add new cell after target), delete (remove cell).

Key rules:

Always specify cell_type when inserting
Find cell IDs with jq or Python JSON parsing
After programmatic edits, instruct user to "Restart & Run All"
Update in dependency order when changing metrics across cells

For NotebookEdit usage, programmatic JSON manipulation, bulk operations, and cell newline handling, see notebook-editing.md.

Statistical Methods

Required for All Correlation Analyses

Pearson correlation with p-values using scipy.stats.pearsonr
Report r, p-value, and n on every correlation plot
Mann-Whitney U test for group comparisons

Outlier Handling

Stage 1: Count-based outliers (IQR method) -- remove before analysis
Stage 2: Value-based outliers (percentile) -- apply only to visualization, not statistics
Apply characteristic-specific outlier removal separately per analysis
Always report number of outliers removed

Statistical Claim Verification (CRITICAL)

BEFORE finalizing any analysis notebook, verify ALL statistical claims against actual computed values. Text claims can become stale after data/code updates. Extract claims, rerun tests, create verification table.

For detailed statistical methods, outlier removal code, claim verification workflow, and confounding analysis, see statistical-methods.md.

Publication-Quality Figures

Key Standards

DPI: 300 for publication, 150 for digital viewing
Font sizes: Title 18pt bold, axis labels 16pt bold, ticks 14pt, legend 12pt
Colors: Use colorblind-safe palettes (IBM/Okabe-Ito). Blue #0173B2 + Orange #DE8F05 for two-group comparisons
Data imbalance: Add prominent warnings when sample size ratio > 5x

Image Display

Use HTML <img> tags in markdown cells for responsive SVG/PNG scaling
Crop SVGs by modifying viewBox attributes directly (no ImageMagick needed)
Manage DPI to prevent "Output too large" errors (use 150 DPI default)

For detailed font size tables, color palette code, imbalance handling, SVG manipulation, and DPI management, see visualization-guide.md.

Notebook Organization

Large Notebooks (60+ cells)

Use markdown section headers with cell pairing pattern
Consistent naming for figures, variables, and functions
Progressive enhancement from basic to complex analyses

Dual-Notebook System

For analyses with 5+ figures preparing for publication:

Code notebook: Executable analysis, figure generation, statistical tests
Presentation notebook: Figure displays, captions, interpretations, methods

Splitting and Deprecation

When splitting notebooks, recreate all calculated columns and variable definitions in each split. When deprecating, create dated directories with documentation.

For figure usage analysis, splitting strategies, dual-notebook workflow, publication notebook structure, TOC generation, deprecation workflow, and migration guides, see notebook-organization.md.

Sharing and Export

Key Rules

Preserve outputs when preparing sharing packages (outputs ARE the documentation)
Use relative paths (never absolute) for portability
HTML export is best for sharing (self-contained, no software needed)
Update paths programmatically when moving notebooks to subdirectories

For path management, HTML/PDF/LaTeX export, sharing package structure, and output preservation guidelines, see sharing-and-export.md.

Template and Helper Patterns

Template Generation

For creating multiple similar analysis cells:

template = '''
if len(data_with_species) > 0:
    print('Analyzing {display} vs {metric}...\\n')
    species_data = {{}}
    for inv in data_with_species:
        {name} = safe_float_convert(inv.get('{name}'))
        if {name} is None:
            continue
        # ... analysis code
'''

characteristics = [
    {'name': 'genome_size', 'display': 'Genome Size', 'unit': 'Gb'},
    {'name': 'heterozygosity', 'display': 'Heterozygosity', 'unit': '%'},
]

for char in characteristics:
    code = template.format(**char)

Helper Function Pattern

Define once, reuse throughout:

def safe_float_convert(value):
    """Convert string to float, handling comma separators"""
    if not value or not str(value).strip():
        return None
    try:
        return float(str(value).replace(',', ''))
    except (ValueError, TypeError):
        return None

Troubleshooting

Key pitfalls to watch for:

Variable shadowing: Never use data as a loop variable (shadows global)
Column name mismatches: Always print df.columns.tolist() before processing
Cell execution order: After NotebookEdit inserts, "Restart & Run All"
Notebook size: Use jq for notebooks > 256 KB

For detailed troubleshooting, variable validation, debugging techniques, and environment setup, see troubleshooting.md.

Best Practices Summary

Always check data availability before creating analyses
Document outlier removal clearly in titles and comments
Use consistent naming for variables and figures
Include statistical testing for all correlations
Separate visualization from statistics when filtering outliers
Create templates for repetitive analyses
Use helper functions consistently across cells
Organize with markdown headers for navigation
Test with small datasets before running full analyses
Save intermediate results for expensive computations
Use NotebookEdit tool for all .ipynb file modifications

Supporting Files Reference

File	Contents
notebook-patterns.md	Data update, enrichment, AWS GenomeArk patterns
notebook-editing.md	NotebookEdit tool, programmatic manipulation, metrics updates
visualization-guide.md	Publication figures, colors, image display, SVG, DPI
statistical-methods.md	Outlier handling, statistical rigor, claim verification
notebook-organization.md	Splitting, dual-notebook, deprecation, figure analysis
sharing-and-export.md	Paths, HTML/PDF export, sharing packages
troubleshooting.md	Common pitfalls, debugging, validation, environment