Search everything...

Skill

omics-analysis-guide

Guides post-quantification analysis of omics data (bulk RNA-seq transcriptomics, proteomics) via three-tiered approach: validated pipelines (DESeq2, MaxQuant), standard workflows, custom methods.

data-engineering

npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skills

Tool Access

This skill uses the workspace's default tool permissions.

Preview

---

SKILL.md

Similar Skills

pydeseq2

Performs differential gene expression analysis on bulk RNA-seq count data using PyDESeq2. Handles multi-factor designs, Wald tests, FDR correction, apeGLM shrinkage, and volcano/MA plots.

3 files

scientific-skills

tooluniverse-single-cell

1.3k

Analyzes scRNA-seq data using scanpy/anndata: QC, normalization, PCA/UMAP, Leiden clustering, DE (Wilcoxon/DESeq2), annotation, batch correction, trajectory, cell-cell communication via ligand-receptor pairs. Supports h5ad/10X/CSV.

13 files

mims-harvard-tooluniverse

rnaseq-de

778

Performs differential expression analysis on bulk RNA-seq and pseudo-bulk count matrices using PyDESeq2, with QC, PCA visualization, contrast testing, volcano/MA plots, and markdown reports.

6 files

clawbio

Stats

Stars135

Forks16

Last CommitMay 2, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

omics-analysis-guide | sciagent-skills | ClaudePluginHub

Back to Skills

Skill

omics-analysis-guide

From sciagent-skills

Guides post-quantification analysis of omics data (bulk RNA-seq transcriptomics, proteomics) via three-tiered approach: validated pipelines (DESeq2, MaxQuant), standard workflows, custom methods.

data-engineering

npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skills

Tool Access

This skill uses the workspace's default tool permissions.

Preview

---

SKILL.md

Omics Data Analysis Guide: Three-Tiered Approach

Metadata

Short Description: Comprehensive guide for analyzing omics data (transcriptomics, proteomics) using validated pipelines, standard workflows, or custom analysis methods.

Authors: HITS

Version: 1.0

Last Updated: December 2025

License: CC BY 4.0

Commercial Use: Allowed

Citations and Acknowledgments

If you use validated pipelines or tools (Option 1):

Citation: Always cite the original publication associated with each tool or pipeline
Acknowledgment: Cite the specific tools and methods used in your analysis

If you use standard workflows (Option 2):

Acknowledgment Statement: "Analysis performed using standard omics data analysis workflows and best practices"
Citation for RNA-seq analysis: Dobin A, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15-21. PMID: 23104886
Citation for proteomics: Cox J, Mann M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat Biotechnol. 2008;26(12):1367-72. PMID: 19029910

Overview

This guide provides a three-tiered approach to omics data analysis, prioritizing validated pipelines and standard workflows before moving to custom analysis. Always start with Option 1 and proceed to subsequent options only if needed.

The guide covers:

Transcriptomics: Bulk RNA-seq
Proteomics: Pre-quantified protein abundance data (similar to bulk RNA-seq analysis)

Note: This guide focuses on analysis of already-quantified data. For raw data processing (alignment, quantification), refer to specialized tools and pipelines.

Key Concepts

Validated Pipeline vs. Standard Workflow vs. Custom Analysis

A validated pipeline is a specific tool with peer-reviewed benchmarking data demonstrating performance on data like yours (e.g., DESeq2 for RNA-seq counts, MaxQuant for label-free proteomics). A standard workflow is the canonical sequence of QC → normalization → statistical test → multiple-testing correction assembled from accepted community practice but tuned to your specific dataset. Custom analysis is bespoke statistical or computational modeling required when neither prior tier covers the data type or research question. The progression Option 1 → Option 2 → Option 3 trades reproducibility for flexibility — always exhaust earlier tiers first.

Missing Value Mechanisms (MCAR / MAR / MNAR)

Missing data in omics arises from three distinct mechanisms with different correct treatments. MCAR (Missing Completely At Random) means missingness is independent of any value — safe to impute with mean, median, or KNN. MAR (Missing At Random) means missingness depends on observed variables but not the unobserved value — KNN or model-based imputation is appropriate. MNAR (Missing Not At Random) means missingness depends on the missing value itself, typical in proteomics where low-abundance proteins drop below detection — requires left-censored imputation (minprob/QRILC) below the detection limit. Choosing the wrong mechanism systematically biases downstream statistics.

Test Assumptions and Test Selection

Parametric tests (Student's t-test, Welch's t-test) assume approximate normality and (for Student's) equal variances; they have higher power than non-parametric tests when assumptions hold. Non-parametric tests (Mann-Whitney U, permutation) make weaker assumptions and are correct under skewed distributions or small n, at the cost of statistical power. The choice depends on sample size (n < 10 favors non-parametric), normality (Shapiro-Wilk / Anderson-Darling at the feature level), variance homogeneity (Levene's test), and outlier prevalence.

Multiple Testing Correction

Omics analyses test thousands of features simultaneously. Without correction, expected false positives at α=0.05 across 20,000 genes is 1,000. Family-wise error rate (FWER) corrections like Bonferroni control the probability of any false positive but are conservative. False discovery rate (FDR) corrections like Benjamini-Hochberg control the expected proportion of false positives among reported significant features and are the standard for omics. Always report adjusted p-values, never raw p-values, when calling significance.

Decision Framework

Use this tree to choose the right analysis tier for your data:

              Have you searched for a validated
              pipeline matching your data type?
                            │
              ┌─────────────┴─────────────┐
              │                           │
             NO                          YES
              │                           │
              ▼                           ▼
       Run Method 1            Did you find a validated
       (literature) AND        pipeline with benchmarks
       Method 2 (consortia     matching your data type
       workflows) FIRST        and biological question?
                                          │
                                  ┌───────┴───────┐
                                  │               │
                                 YES              NO
                                  │               │
                                  ▼               ▼
                          OPTION 1:        Is your data a
                          Use validated    common type
                          pipeline         (RNA-seq counts,
                          (e.g., DESeq2,   pre-quantified
                          edgeR, MaxQuant) proteomics)?
                                                  │
                                          ┌───────┴───────┐
                                          │               │
                                         YES              NO
                                          │               │
                                          ▼               ▼
                                  OPTION 2:        OPTION 3:
                                  Standard         Custom analysis
                                  workflow         (consult
                                  (QC → norm →     statistician;
                                   test → FDR)     document
                                                   thoroughly)

Decision Table

Data type	Sample size	Has validated pipeline?	Recommended tier	Specific tool / approach
Bulk RNA-seq counts	n ≥ 3/group	Yes (DESeq2, edgeR)	Option 1	DESeq2 (negative binomial, default FDR < 0.05)
Pre-quantified proteomics, normal-distributed	n ≥ 5/group	Sometimes	Option 1 if pipeline matches; else Option 2	limma or t-test + BH-FDR
Pre-quantified proteomics, MNAR-heavy	n ≥ 5/group	No (mechanism-specific)	Option 2	minprob imputation → t-test or Mann-Whitney → BH-FDR
Small-cohort omics (n < 5)	n < 5	Rarely	Option 2 with caution	Permutation test, report effect sizes; flag results as preliminary
Multi-omics integration	Variable	Limited	Option 3	MOFA, DIABLO, or custom Bayesian model
Novel data type (e.g., spatial multi-omics)	Variable	No	Option 3	Build from first principles; cross-validate
Time-series omics	n per timepoint	Sometimes (maSigPro, ImpulseDE2)	Option 1 if available; else Option 3	maSigPro for transcriptomics; custom for proteomics

Option 1: Search for Validated Analysis Methods (Recommended First)

1.1 Search for Validated Analysis Pipelines

IMPORTANT: You MUST complete BOTH Method 1 AND Method 2 before proceeding to Option 2. Do not skip Method 2 even if Method 1 finds no results.

Method 1: Literature Search for Best Practices

Search for validated analysis methods using web search tools or literature databases (PubMed, Google Scholar).

Search queries to try (use multiple):

"[DATA_TYPE]" "[ANALYSIS_TYPE]" validated pipeline best practices
"[DATA_TYPE]" analysis workflow "[ORGANISM]" published
"[DATA_TYPE]" "[TOOL_NAME]" validation benchmark comparison

Example for bulk RNA-seq:

"RNA-seq" "differential expression" validated pipeline human
"DESeq2" "edgeR" comparison validation RNA-seq

Example for proteomics:

"proteomics" "differential abundance" analysis validated methods
"proteomics" normalization imputation best practices

What to search for in results:

Published papers with validated analysis pipelines
Benchmark studies comparing different tools
Best practices guides from major consortia (e.g., ENCODE, TCGA)
Tool documentation with validation data

IMPORTANT: Spend adequate time searching literature. Look through at least the first 10-15 search results and check supplementary materials of relevant papers.

Method 2: Review Standard Analysis Workflows

Review established workflows from major consortia and publications:

ENCODE RNA-seq analysis pipeline
TCGA analysis protocols
Published benchmark studies

What to Do with Results:

If you find validated pipelines or methods:

Record the pipeline/method name and version
Note the reference: Record the publication DOI/PubMed ID
Record validation details: Benchmark results, recommended parameters, any limitations
Document the workflow: Step-by-step analysis procedure

Example result format:

Data Type: Bulk RNA-seq
Analysis Goal: Differential expression
Pipeline: DESeq2 (v1.40.0)
Reference: Love MI, et al. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. PMID: 25516281
Validation: Validated in multiple benchmark studies, recommended for count data
Parameters: Default parameters, FDR < 0.05, log2FC > 1

If no validated pipelines found in BOTH Method 1 AND Method 2: Only then proceed to Option 2: Use Standard Workflows

Option 2: Use Standard Analysis Workflows

When to Use This Option?

No validated pipelines found for your specific data type
Need to build a custom workflow from standard components
Working with common data types (RNA-seq, proteomics)
Want to follow community best practices

2.1 Overview of Standard Workflows

RNA-seq (Bulk):

Quality control
Normalization and filtering (if count data, use DESeq2/edgeR normalization)
Statistical analysis (differential expression)
Multiple testing correction
Functional enrichment (optional)

Proteomics (Pre-quantified):

Quality control
Missing value assessment and imputation
Normalization
Batch correction (if needed)
Statistical analysis (differential abundance)
Multiple testing correction

2.2 Essential Quality Control Steps

CRITICAL: Quality control must be performed before any statistical analysis. Poor data quality will lead to unreliable results regardless of statistical methods used.

Sample-Level Quality Control

Check for outlier samples:

Use PCA + Isolation Forest to detect outlier samples
Standardize data, perform PCA, then apply Isolation Forest
Remove or investigate samples identified as outliers

Check sample correlation:

Calculate correlation matrix between samples
Low correlations (< 0.5) may indicate poor quality samples
Remove samples with consistently low correlation

Check for batch effects:

Use PCA + silhouette score to assess batch separation
If silhouette score > 0.3, strong batch effect detected
Apply batch correction (ComBat or similar) if needed

Feature-Level Quality Control

Assess missing value patterns:

Calculate missing percentage per feature and per sample
Test correlation between mean intensity and missingness
Determine mechanism: MCAR (Missing Completely At Random), MAR (Missing At Random), or MNAR (Missing Not At Random)
MNAR: Low intensity -> more missing (common in proteomics)
MCAR: No relationship between intensity and missingness

Check feature detection consistency:

Count how many features are detected in minimum number of samples
Filter features detected in < 50% of samples (adjustable threshold)

2.3 Preprocessing Steps

Missing Value Imputation

CRITICAL: Choose imputation method based on missing value mechanism:

MNAR (Missing Not At Random): Use minimum probability imputation (minprob)
- Impute values below detection limit using normal distribution
- Parameters: downshift=1.8, width=0.3
MCAR/MAR (Missing Completely/At Random): Use KNN imputation
- Use k-nearest neighbors (default: k=5) to impute missing values
- More robust than mean/median imputation
Simple methods (if few missing values):
- Mean imputation: Replace with feature mean
- Median imputation: Replace with feature median

Normalization

For RNA-seq count data: Normalization is typically handled by DESeq2/edgeR (size factors).

For proteomics/continuous data:

Median normalization: Scale each sample to global median
Quantile normalization: Make distributions identical across samples
Z-score normalization: Standardize to mean=0, std=1
Total intensity normalization: Scale to total intensity

2.4 Statistical Analysis: Choosing the Right Test

CRITICAL: Always check statistical test assumptions before performing analysis. Using the wrong test can lead to incorrect conclusions.

Step 1: Check Test Assumptions

Key checks to perform:

Normality test:
- Use Shapiro-Wilk test for n < 50, Anderson-Darling for n >= 50
- Sample subset of features (100 features) for speed
- If >=70% of features are normal, data is considered normal
Variance homogeneity test:
- Use Levene's test to check equal variances
- If >=70% of features have equal variances, assume equal variance
Sample size check:
- n < 5: Very small, results unreliable
- n < 10: Small, prefer non-parametric tests
- n >= 10: Can use parametric tests if assumptions met
Outlier check:
- Calculate z-scores, flag values with |z| > 3
- If >5% outliers, prefer non-parametric tests

Test selection logic:

n < 5: Permutation test or Mann-Whitney U test
n < 10: Mann-Whitney U test (non-parametric)
Normal + Equal variance: Student's t-test
Normal + Unequal variance: Welch's t-test
Non-normal: Mann-Whitney U test

Step 2: Perform Statistical Test

Implementation steps:

For each feature:
- Extract values for group1 and group2
- Remove NaN values
- Calculate means and log2 fold change
- Perform selected test (t-test, Welch's t-test, or Mann-Whitney U)
- Record statistic and p-value
Apply FDR correction:
- Use Benjamini-Hochberg procedure (FDR_BH)
- Adjust p-values for multiple testing
- Mark features with p_adj < 0.05 as significant

Key libraries:

scipy.stats: Statistical tests (ttest_ind, mannwhitneyu, shapiro, levene)
statsmodels.stats.multitest: FDR correction (multipletests with method='fdr_bh')

2.5 Visualization

Volcano Plot:

X-axis: Log2 fold change
Y-axis: -Log10 adjusted p-value
Color by significance: Upregulated (red), Downregulated (blue), Not significant (gray)
Add threshold lines for fold change and p-value

PCA Plot (for quality control):

Standardize data, perform PCA
Plot PC1 vs PC2
Label samples, check for outliers and batch effects

2.6 What to Do with Results

Once you have completed the standard workflow:

Document all steps and parameters used
Save intermediate results for reproducibility
Validate results using independent methods when possible
Report key findings with appropriate statistics

If standard workflows don't meet your needs: Proceed to Option 3: Custom Analysis

Option 3: Custom Analysis Methods (Last Resort)

When to Use This Option?

Novel data type not covered by standard workflows
Specialized research questions requiring custom approaches
Integration of multiple omics data types
Advanced statistical modeling requirements

3.1 General Principles

Essential Requirements:

Data Quality: Ensure high-quality data before custom analysis
- Perform all QC steps from Option 2
- Remove outliers and batch effects
- Validate technical replicates
Statistical Rigor:
- Always check test assumptions before analysis
- Use appropriate statistical tests for your data distribution
- Apply multiple testing correction (FDR)
- Validate assumptions
Reproducibility:
- Document all steps and parameters
- Use version control for code
- Save intermediate results
- Provide seed values for random processes
Validation:
- Cross-validation when applicable
- Independent validation set if available
- Compare with known results when possible

Best Practices:

Start simple: Begin with basic analyses before complex methods
Validate assumptions: Test normality, independence, etc.
Use appropriate transformations: Log transform if needed
Consider biological context: Interpret results in light of known biology
Consult literature: Review similar studies for guidance

Quick Start Examples

Example 1: Bulk RNA-seq Differential Expression Analysis

Step 1: Quality Control

Check for outlier samples using PCA + Isolation Forest
Check sample correlation matrix
Remove low-quality samples

Step 2: For RNA-seq count data, use DESeq2 (typically in R)

library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = count_matrix, colData = sample_metadata, design = ~ condition)
dds <- DESeq(dds)
res <- results(dds, contrast=c("condition", "treatment", "control"))

Step 3: Functional Enrichment (optional)

Use GSEA or GO enrichment tools (gseapy, etc.)
Prepare ranked gene list from log2FC
Run enrichment analysis

Example 2: Proteomics Differential Abundance Analysis

Step 1: Quality Control

Check for outlier samples
Assess missing values (determine mechanism: MCAR, MAR, or MNAR)

Step 2: Impute Missing Values

If MNAR: Use minprob imputation
If MCAR/MAR: Use KNN imputation

Step 3: Normalization

Apply median or quantile normalization

Step 4: Check for Batch Effects

Assess using PCA + silhouette score
Apply batch correction if needed (ComBat or similar)

Step 5: Differential Abundance Analysis

Check test assumptions (normality, variance, sample size)
Select appropriate test (auto-select based on assumptions)
Perform test, apply FDR correction
Filter significant results (p_adj < 0.05)

Step 6: Visualization

Create volcano plot
Create PCA plot for QC

Data Type-Specific Considerations

RNA-seq (Bulk)

Count data: Use DESeq2 or edgeR (negative binomial models)
Normalization: Built into DESeq2/edgeR (size factors)
Filtering: Remove low-count genes before analysis
Multiple testing: Always apply FDR correction
Statistical test: DESeq2/edgeR handle count data appropriately

Proteomics (Pre-quantified)

Continuous data: Similar to normalized RNA-seq data
Missing values: Common, especially for low-abundance proteins
- Assess missing mechanism (MCAR, MAR, MNAR)
- Use appropriate imputation method
Normalization: Median, quantile, or total intensity normalization
Statistical tests:
- Check normality and variance assumptions
- Use t-test if normal, Mann-Whitney if non-normal
- Always apply FDR correction
Batch effects: Common in proteomics, check and correct if needed

Best Practices

Exhaust validated pipelines before building anything custom. Run both literature search and consortium-workflow review before falling back to bespoke analysis. Rationale: validated pipelines have peer-reviewed benchmarking; novel methods require their own validation effort and reduce reproducibility.
Perform sample-level QC before any statistical analysis. Use PCA + Isolation Forest for outlier detection, sample correlation matrices, and PCA + silhouette score for batch effects. Rationale: a single outlier sample or unrecognized batch effect can dominate test statistics and produce uninterpretable results regardless of the test chosen.
Diagnose the missing-value mechanism (MCAR / MAR / MNAR) before imputing. Check the correlation between mean intensity and missingness rate per feature. Rationale: imputing MNAR data with KNN biases low-abundance features upward; imputing MCAR data with minprob biases everything downward. Mechanism-aware imputation prevents systematic distortion.
Always check test assumptions, then choose the test — never the reverse. Run Shapiro-Wilk / Anderson-Darling for normality and Levene's for variance homogeneity on a representative feature subset. Rationale: applying a t-test to non-normal small-n data inflates type I error; defaulting to Mann-Whitney on well-behaved data wastes power.
Always apply FDR correction (Benjamini-Hochberg) for genome-wide tests. Report p_adj (or q-value), not raw p. Rationale: with 20,000 genes tested at α=0.05, ~1,000 false positives are expected without correction — the result set is meaningless.
Document every parameter and version, save intermediate outputs, and pin random seeds. Record tool version, parameter values, normalization method, imputation method, test choice, FDR threshold, and the seed for any stochastic step. Rationale: omics pipelines have many tunable knobs; without exact provenance the analysis cannot be reproduced or audited.
Validate findings on an independent dataset or with an orthogonal method whenever possible. Examples: confirm DE genes via qPCR, replicate in a public dataset (GEO, ArrayExpress), or compare across batches. Rationale: even FDR-controlled hits can be false positives driven by batch artifacts, contamination, or normalization choices.

Common Pitfalls

Skipping QC and going directly to statistics. Problem: Outlier samples and batch effects produce false signals that pass statistical tests, polluting the result list with artifacts. How to avoid: Always run sample-level PCA, correlation matrices, and outlier detection before any differential test. Treat QC as mandatory, not optional.
Imputing missing values with a one-size-fits-all method. Problem: Using mean imputation on MNAR proteomics data biases low-abundance proteins; using minprob on MCAR data biases everything below the detection limit downward. How to avoid: Diagnose the mechanism (correlation between intensity and missingness), then pick an appropriate imputer: minprob for MNAR, KNN for MCAR/MAR.
Using t-tests on non-normal or small-n data. Problem: Student's t-test assumes normality and (with pooled variance) equal variances; with n < 10 and skewed data, type I error inflates well above the nominal α. How to avoid: Run normality and variance tests first; use Welch's t-test for unequal variance, Mann-Whitney for non-normal, and permutation tests for n < 5.
Reporting raw p-values without multiple testing correction. Problem: Across thousands of features, raw p-values produce massive false discovery rates; the resulting "significant" gene lists are dominated by noise. How to avoid: Always apply Benjamini-Hochberg FDR (or BY for dependent tests) and report adjusted p-values. Set p_adj < 0.05 (or q < 0.05) as the significance threshold.
Confusing fold change with statistical significance. Problem: A high log2 fold change at high p_adj is unreliable noise; a low log2 fold change at very low p_adj may be real but biologically negligible. How to avoid: Filter on both — typical thresholds are |log2FC| > 1 AND p_adj < 0.05. Report effect sizes alongside p-values.
Failing to correct for batch effects when present. Problem: Batch effects masquerade as biological signal, especially in proteomics and multi-cohort studies; PC1 ends up reflecting batch rather than condition. How to avoid: Check batch separation with PCA + silhouette score; if silhouette > ~0.3, apply ComBat, limma's removeBatchEffect, or include batch as a covariate in the model.
Treating Option 3 (custom analysis) as a shortcut. Problem: Jumping straight to custom methods without first running standard workflows skips peer-reviewed validation and makes results harder to publish and reproduce. How to avoid: Document a clear justification for why Options 1 and 2 are inadequate before moving to Option 3, and validate any custom method on simulated or held-out data.

References

Pipelines and Tools

DESeq2: https://bioconductor.org/packages/release/bioc/html/DESeq2.html — Love MI, Huber W, Anders S. Genome Biology 2014; 15:550. PMID: 25516281
edgeR: https://bioconductor.org/packages/release/bioc/html/edgeR.html — Robinson MD, McCarthy DJ, Smyth GK. Bioinformatics 2010; 26(1):139-40.
STAR aligner: https://github.com/alexdobin/STAR — Dobin A, et al. Bioinformatics 2013; 29(1):15-21. PMID: 23104886
MaxQuant: https://www.maxquant.org/ — Cox J, Mann M. Nat Biotechnol 2008; 26(12):1367-72. PMID: 19029910
limma: https://bioconductor.org/packages/release/bioc/html/limma.html — Ritchie ME, et al. Nucleic Acids Res 2015; 43(7):e47.
ComBat (sva package): https://bioconductor.org/packages/release/bioc/html/sva.html — Johnson WE, Li C, Rabinovic A. Biostatistics 2007; 8(1):118-27.

Consortium Best Practices

ENCODE RNA-seq pipeline: https://www.encodeproject.org/data-standards/rna-seq/
GTEx Analysis Protocol: https://gtexportal.org/home/methods
TCGA Analysis Protocols: https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/

Statistical Methods

Benjamini-Hochberg FDR: Benjamini Y, Hochberg Y. J. R. Stat. Soc. B 1995; 57(1):289-300.
Multiple imputation in proteomics review: Lazar C, et al. J Proteome Res 2016; 15(4):1116-25.
scipy.stats: https://docs.scipy.org/doc/scipy/reference/stats.html
statsmodels multiple testing: https://www.statsmodels.org/stable/stats.html

Data Repositories for Validation

GEO: https://www.ncbi.nlm.nih.gov/geo/
ArrayExpress: https://www.ebi.ac.uk/biostudies/arrayexpress
PRIDE (proteomics): https://www.ebi.ac.uk/pride/

Remember: Always start with validated pipelines (Option 1), then move to standard workflows (Option 2), and only use custom analysis (Option 3) when necessary. Document all steps and parameters for reproducibility. Quality control is essential at every stage of analysis. Always check statistical test assumptions before performing analysis.

Similar Skills

pydeseq2

Performs differential gene expression analysis on bulk RNA-seq count data using PyDESeq2. Handles multi-factor designs, Wald tests, FDR correction, apeGLM shrinkage, and volcano/MA plots.

3 files

scientific-skills

tooluniverse-single-cell

1.3k

13 files

mims-harvard-tooluniverse

rnaseq-de

778

Performs differential expression analysis on bulk RNA-seq and pseudo-bulk count matrices using PyDESeq2, with QC, PCA visualization, contrast testing, volcano/MA plots, and markdown reports.

6 files

clawbio

Stats

Stars135

Forks16

Last CommitMay 2, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Omics Data Analysis Guide: Three-Tiered Approach

Metadata

Short Description: Comprehensive guide for analyzing omics data (transcriptomics, proteomics) using validated pipelines, standard workflows, or custom analysis methods.

Authors: HITS

Version: 1.0

Last Updated: December 2025

License: CC BY 4.0

Commercial Use: Allowed

Citations and Acknowledgments

If you use validated pipelines or tools (Option 1):

Citation: Always cite the original publication associated with each tool or pipeline
Acknowledgment: Cite the specific tools and methods used in your analysis

If you use standard workflows (Option 2):

Acknowledgment Statement: "Analysis performed using standard omics data analysis workflows and best practices"
Citation for RNA-seq analysis: Dobin A, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15-21. PMID: 23104886
Citation for proteomics: Cox J, Mann M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat Biotechnol. 2008;26(12):1367-72. PMID: 19029910

Overview

The guide covers:

Transcriptomics: Bulk RNA-seq
Proteomics: Pre-quantified protein abundance data (similar to bulk RNA-seq analysis)

Note: This guide focuses on analysis of already-quantified data. For raw data processing (alignment, quantification), refer to specialized tools and pipelines.

Key Concepts

Validated Pipeline vs. Standard Workflow vs. Custom Analysis

Missing Value Mechanisms (MCAR / MAR / MNAR)

Test Assumptions and Test Selection

Multiple Testing Correction

Decision Framework

Use this tree to choose the right analysis tier for your data:

              Have you searched for a validated
              pipeline matching your data type?
                            │
              ┌─────────────┴─────────────┐
              │                           │
             NO                          YES
              │                           │
              ▼                           ▼
       Run Method 1            Did you find a validated
       (literature) AND        pipeline with benchmarks
       Method 2 (consortia     matching your data type
       workflows) FIRST        and biological question?
                                          │
                                  ┌───────┴───────┐
                                  │               │
                                 YES              NO
                                  │               │
                                  ▼               ▼
                          OPTION 1:        Is your data a
                          Use validated    common type
                          pipeline         (RNA-seq counts,
                          (e.g., DESeq2,   pre-quantified
                          edgeR, MaxQuant) proteomics)?
                                                  │
                                          ┌───────┴───────┐
                                          │               │
                                         YES              NO
                                          │               │
                                          ▼               ▼
                                  OPTION 2:        OPTION 3:
                                  Standard         Custom analysis
                                  workflow         (consult
                                  (QC → norm →     statistician;
                                   test → FDR)     document
                                                   thoroughly)

Decision Table

Data type	Sample size	Has validated pipeline?	Recommended tier	Specific tool / approach
Bulk RNA-seq counts	n ≥ 3/group	Yes (DESeq2, edgeR)	Option 1	DESeq2 (negative binomial, default FDR < 0.05)
Pre-quantified proteomics, normal-distributed	n ≥ 5/group	Sometimes	Option 1 if pipeline matches; else Option 2	limma or t-test + BH-FDR
Pre-quantified proteomics, MNAR-heavy	n ≥ 5/group	No (mechanism-specific)	Option 2	minprob imputation → t-test or Mann-Whitney → BH-FDR
Small-cohort omics (n < 5)	n < 5	Rarely	Option 2 with caution	Permutation test, report effect sizes; flag results as preliminary
Multi-omics integration	Variable	Limited	Option 3	MOFA, DIABLO, or custom Bayesian model
Novel data type (e.g., spatial multi-omics)	Variable	No	Option 3	Build from first principles; cross-validate
Time-series omics	n per timepoint	Sometimes (maSigPro, ImpulseDE2)	Option 1 if available; else Option 3	maSigPro for transcriptomics; custom for proteomics

Option 1: Search for Validated Analysis Methods (Recommended First)

1.1 Search for Validated Analysis Pipelines

IMPORTANT: You MUST complete BOTH Method 1 AND Method 2 before proceeding to Option 2. Do not skip Method 2 even if Method 1 finds no results.

Method 1: Literature Search for Best Practices

Search for validated analysis methods using web search tools or literature databases (PubMed, Google Scholar).

Search queries to try (use multiple):

"[DATA_TYPE]" "[ANALYSIS_TYPE]" validated pipeline best practices
"[DATA_TYPE]" analysis workflow "[ORGANISM]" published
"[DATA_TYPE]" "[TOOL_NAME]" validation benchmark comparison

Example for bulk RNA-seq:

"RNA-seq" "differential expression" validated pipeline human
"DESeq2" "edgeR" comparison validation RNA-seq

Example for proteomics:

"proteomics" "differential abundance" analysis validated methods
"proteomics" normalization imputation best practices

What to search for in results:

Published papers with validated analysis pipelines
Benchmark studies comparing different tools
Best practices guides from major consortia (e.g., ENCODE, TCGA)
Tool documentation with validation data

IMPORTANT: Spend adequate time searching literature. Look through at least the first 10-15 search results and check supplementary materials of relevant papers.

Method 2: Review Standard Analysis Workflows

Review established workflows from major consortia and publications:

ENCODE RNA-seq analysis pipeline
TCGA analysis protocols
Published benchmark studies

What to Do with Results:

If you find validated pipelines or methods:

Record the pipeline/method name and version
Note the reference: Record the publication DOI/PubMed ID
Record validation details: Benchmark results, recommended parameters, any limitations
Document the workflow: Step-by-step analysis procedure

Example result format:

Data Type: Bulk RNA-seq
Analysis Goal: Differential expression
Pipeline: DESeq2 (v1.40.0)
Reference: Love MI, et al. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. PMID: 25516281
Validation: Validated in multiple benchmark studies, recommended for count data
Parameters: Default parameters, FDR < 0.05, log2FC > 1

If no validated pipelines found in BOTH Method 1 AND Method 2: Only then proceed to Option 2: Use Standard Workflows

Option 2: Use Standard Analysis Workflows

When to Use This Option?

No validated pipelines found for your specific data type
Need to build a custom workflow from standard components
Working with common data types (RNA-seq, proteomics)
Want to follow community best practices

2.1 Overview of Standard Workflows

RNA-seq (Bulk):

Quality control
Normalization and filtering (if count data, use DESeq2/edgeR normalization)
Statistical analysis (differential expression)
Multiple testing correction
Functional enrichment (optional)

Proteomics (Pre-quantified):

Quality control
Missing value assessment and imputation
Normalization
Batch correction (if needed)
Statistical analysis (differential abundance)
Multiple testing correction

2.2 Essential Quality Control Steps

CRITICAL: Quality control must be performed before any statistical analysis. Poor data quality will lead to unreliable results regardless of statistical methods used.

Sample-Level Quality Control

Check for outlier samples:

Use PCA + Isolation Forest to detect outlier samples
Standardize data, perform PCA, then apply Isolation Forest
Remove or investigate samples identified as outliers

Check sample correlation:

Calculate correlation matrix between samples
Low correlations (< 0.5) may indicate poor quality samples
Remove samples with consistently low correlation

Check for batch effects:

Use PCA + silhouette score to assess batch separation
If silhouette score > 0.3, strong batch effect detected
Apply batch correction (ComBat or similar) if needed

Feature-Level Quality Control

Assess missing value patterns:

Calculate missing percentage per feature and per sample
Test correlation between mean intensity and missingness
Determine mechanism: MCAR (Missing Completely At Random), MAR (Missing At Random), or MNAR (Missing Not At Random)
MNAR: Low intensity -> more missing (common in proteomics)
MCAR: No relationship between intensity and missingness

Check feature detection consistency:

Count how many features are detected in minimum number of samples
Filter features detected in < 50% of samples (adjustable threshold)

2.3 Preprocessing Steps

Missing Value Imputation

CRITICAL: Choose imputation method based on missing value mechanism:

MNAR (Missing Not At Random): Use minimum probability imputation (minprob)
- Impute values below detection limit using normal distribution
- Parameters: downshift=1.8, width=0.3
MCAR/MAR (Missing Completely/At Random): Use KNN imputation
- Use k-nearest neighbors (default: k=5) to impute missing values
- More robust than mean/median imputation
Simple methods (if few missing values):
- Mean imputation: Replace with feature mean
- Median imputation: Replace with feature median

Normalization

For RNA-seq count data: Normalization is typically handled by DESeq2/edgeR (size factors).

For proteomics/continuous data:

Median normalization: Scale each sample to global median
Quantile normalization: Make distributions identical across samples
Z-score normalization: Standardize to mean=0, std=1
Total intensity normalization: Scale to total intensity

2.4 Statistical Analysis: Choosing the Right Test

CRITICAL: Always check statistical test assumptions before performing analysis. Using the wrong test can lead to incorrect conclusions.

Step 1: Check Test Assumptions

Key checks to perform:

Normality test:
- Use Shapiro-Wilk test for n < 50, Anderson-Darling for n >= 50
- Sample subset of features (100 features) for speed
- If >=70% of features are normal, data is considered normal
Variance homogeneity test:
- Use Levene's test to check equal variances
- If >=70% of features have equal variances, assume equal variance
Sample size check:
- n < 5: Very small, results unreliable
- n < 10: Small, prefer non-parametric tests
- n >= 10: Can use parametric tests if assumptions met
Outlier check:
- Calculate z-scores, flag values with |z| > 3
- If >5% outliers, prefer non-parametric tests

Test selection logic:

n < 5: Permutation test or Mann-Whitney U test
n < 10: Mann-Whitney U test (non-parametric)
Normal + Equal variance: Student's t-test
Normal + Unequal variance: Welch's t-test
Non-normal: Mann-Whitney U test

Step 2: Perform Statistical Test

Implementation steps:

For each feature:
- Extract values for group1 and group2
- Remove NaN values
- Calculate means and log2 fold change
- Perform selected test (t-test, Welch's t-test, or Mann-Whitney U)
- Record statistic and p-value
Apply FDR correction:
- Use Benjamini-Hochberg procedure (FDR_BH)
- Adjust p-values for multiple testing
- Mark features with p_adj < 0.05 as significant

Key libraries:

scipy.stats: Statistical tests (ttest_ind, mannwhitneyu, shapiro, levene)
statsmodels.stats.multitest: FDR correction (multipletests with method='fdr_bh')

2.5 Visualization

Volcano Plot:

X-axis: Log2 fold change
Y-axis: -Log10 adjusted p-value
Color by significance: Upregulated (red), Downregulated (blue), Not significant (gray)
Add threshold lines for fold change and p-value

PCA Plot (for quality control):

Standardize data, perform PCA
Plot PC1 vs PC2
Label samples, check for outliers and batch effects

2.6 What to Do with Results

Once you have completed the standard workflow:

Document all steps and parameters used
Save intermediate results for reproducibility
Validate results using independent methods when possible
Report key findings with appropriate statistics

If standard workflows don't meet your needs: Proceed to Option 3: Custom Analysis

Option 3: Custom Analysis Methods (Last Resort)

When to Use This Option?

Novel data type not covered by standard workflows
Specialized research questions requiring custom approaches
Integration of multiple omics data types
Advanced statistical modeling requirements

3.1 General Principles

Essential Requirements:

Data Quality: Ensure high-quality data before custom analysis
- Perform all QC steps from Option 2
- Remove outliers and batch effects
- Validate technical replicates
Statistical Rigor:
- Always check test assumptions before analysis
- Use appropriate statistical tests for your data distribution
- Apply multiple testing correction (FDR)
- Validate assumptions
Reproducibility:
- Document all steps and parameters
- Use version control for code
- Save intermediate results
- Provide seed values for random processes
Validation:
- Cross-validation when applicable
- Independent validation set if available
- Compare with known results when possible

Best Practices:

Start simple: Begin with basic analyses before complex methods
Validate assumptions: Test normality, independence, etc.
Use appropriate transformations: Log transform if needed
Consider biological context: Interpret results in light of known biology
Consult literature: Review similar studies for guidance

Quick Start Examples

Example 1: Bulk RNA-seq Differential Expression Analysis

Step 1: Quality Control

Check for outlier samples using PCA + Isolation Forest
Check sample correlation matrix
Remove low-quality samples

Step 2: For RNA-seq count data, use DESeq2 (typically in R)

library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = count_matrix, colData = sample_metadata, design = ~ condition)
dds <- DESeq(dds)
res <- results(dds, contrast=c("condition", "treatment", "control"))

Step 3: Functional Enrichment (optional)

Use GSEA or GO enrichment tools (gseapy, etc.)
Prepare ranked gene list from log2FC
Run enrichment analysis

Example 2: Proteomics Differential Abundance Analysis

Step 1: Quality Control

Check for outlier samples
Assess missing values (determine mechanism: MCAR, MAR, or MNAR)

Step 2: Impute Missing Values

If MNAR: Use minprob imputation
If MCAR/MAR: Use KNN imputation

Step 3: Normalization

Apply median or quantile normalization

Step 4: Check for Batch Effects

Assess using PCA + silhouette score
Apply batch correction if needed (ComBat or similar)

Step 5: Differential Abundance Analysis

Check test assumptions (normality, variance, sample size)
Select appropriate test (auto-select based on assumptions)
Perform test, apply FDR correction
Filter significant results (p_adj < 0.05)

Step 6: Visualization

Create volcano plot
Create PCA plot for QC

Data Type-Specific Considerations

RNA-seq (Bulk)

Count data: Use DESeq2 or edgeR (negative binomial models)
Normalization: Built into DESeq2/edgeR (size factors)
Filtering: Remove low-count genes before analysis
Multiple testing: Always apply FDR correction
Statistical test: DESeq2/edgeR handle count data appropriately

Proteomics (Pre-quantified)

Continuous data: Similar to normalized RNA-seq data
Missing values: Common, especially for low-abundance proteins
- Assess missing mechanism (MCAR, MAR, MNAR)
- Use appropriate imputation method
Normalization: Median, quantile, or total intensity normalization
Statistical tests:
- Check normality and variance assumptions
- Use t-test if normal, Mann-Whitney if non-normal
- Always apply FDR correction
Batch effects: Common in proteomics, check and correct if needed

Best Practices

Exhaust validated pipelines before building anything custom. Run both literature search and consortium-workflow review before falling back to bespoke analysis. Rationale: validated pipelines have peer-reviewed benchmarking; novel methods require their own validation effort and reduce reproducibility.
Perform sample-level QC before any statistical analysis. Use PCA + Isolation Forest for outlier detection, sample correlation matrices, and PCA + silhouette score for batch effects. Rationale: a single outlier sample or unrecognized batch effect can dominate test statistics and produce uninterpretable results regardless of the test chosen.
Diagnose the missing-value mechanism (MCAR / MAR / MNAR) before imputing. Check the correlation between mean intensity and missingness rate per feature. Rationale: imputing MNAR data with KNN biases low-abundance features upward; imputing MCAR data with minprob biases everything downward. Mechanism-aware imputation prevents systematic distortion.
Always check test assumptions, then choose the test — never the reverse. Run Shapiro-Wilk / Anderson-Darling for normality and Levene's for variance homogeneity on a representative feature subset. Rationale: applying a t-test to non-normal small-n data inflates type I error; defaulting to Mann-Whitney on well-behaved data wastes power.
Always apply FDR correction (Benjamini-Hochberg) for genome-wide tests. Report p_adj (or q-value), not raw p. Rationale: with 20,000 genes tested at α=0.05, ~1,000 false positives are expected without correction — the result set is meaningless.
Document every parameter and version, save intermediate outputs, and pin random seeds. Record tool version, parameter values, normalization method, imputation method, test choice, FDR threshold, and the seed for any stochastic step. Rationale: omics pipelines have many tunable knobs; without exact provenance the analysis cannot be reproduced or audited.
Validate findings on an independent dataset or with an orthogonal method whenever possible. Examples: confirm DE genes via qPCR, replicate in a public dataset (GEO, ArrayExpress), or compare across batches. Rationale: even FDR-controlled hits can be false positives driven by batch artifacts, contamination, or normalization choices.

Common Pitfalls

Skipping QC and going directly to statistics. Problem: Outlier samples and batch effects produce false signals that pass statistical tests, polluting the result list with artifacts. How to avoid: Always run sample-level PCA, correlation matrices, and outlier detection before any differential test. Treat QC as mandatory, not optional.
Imputing missing values with a one-size-fits-all method. Problem: Using mean imputation on MNAR proteomics data biases low-abundance proteins; using minprob on MCAR data biases everything below the detection limit downward. How to avoid: Diagnose the mechanism (correlation between intensity and missingness), then pick an appropriate imputer: minprob for MNAR, KNN for MCAR/MAR.
Using t-tests on non-normal or small-n data. Problem: Student's t-test assumes normality and (with pooled variance) equal variances; with n < 10 and skewed data, type I error inflates well above the nominal α. How to avoid: Run normality and variance tests first; use Welch's t-test for unequal variance, Mann-Whitney for non-normal, and permutation tests for n < 5.
Reporting raw p-values without multiple testing correction. Problem: Across thousands of features, raw p-values produce massive false discovery rates; the resulting "significant" gene lists are dominated by noise. How to avoid: Always apply Benjamini-Hochberg FDR (or BY for dependent tests) and report adjusted p-values. Set p_adj < 0.05 (or q < 0.05) as the significance threshold.
Confusing fold change with statistical significance. Problem: A high log2 fold change at high p_adj is unreliable noise; a low log2 fold change at very low p_adj may be real but biologically negligible. How to avoid: Filter on both — typical thresholds are |log2FC| > 1 AND p_adj < 0.05. Report effect sizes alongside p-values.
Failing to correct for batch effects when present. Problem: Batch effects masquerade as biological signal, especially in proteomics and multi-cohort studies; PC1 ends up reflecting batch rather than condition. How to avoid: Check batch separation with PCA + silhouette score; if silhouette > ~0.3, apply ComBat, limma's removeBatchEffect, or include batch as a covariate in the model.
Treating Option 3 (custom analysis) as a shortcut. Problem: Jumping straight to custom methods without first running standard workflows skips peer-reviewed validation and makes results harder to publish and reproduce. How to avoid: Document a clear justification for why Options 1 and 2 are inadequate before moving to Option 3, and validate any custom method on simulated or held-out data.

References

Pipelines and Tools

DESeq2: https://bioconductor.org/packages/release/bioc/html/DESeq2.html — Love MI, Huber W, Anders S. Genome Biology 2014; 15:550. PMID: 25516281
edgeR: https://bioconductor.org/packages/release/bioc/html/edgeR.html — Robinson MD, McCarthy DJ, Smyth GK. Bioinformatics 2010; 26(1):139-40.
STAR aligner: https://github.com/alexdobin/STAR — Dobin A, et al. Bioinformatics 2013; 29(1):15-21. PMID: 23104886
MaxQuant: https://www.maxquant.org/ — Cox J, Mann M. Nat Biotechnol 2008; 26(12):1367-72. PMID: 19029910
limma: https://bioconductor.org/packages/release/bioc/html/limma.html — Ritchie ME, et al. Nucleic Acids Res 2015; 43(7):e47.
ComBat (sva package): https://bioconductor.org/packages/release/bioc/html/sva.html — Johnson WE, Li C, Rabinovic A. Biostatistics 2007; 8(1):118-27.

Consortium Best Practices

ENCODE RNA-seq pipeline: https://www.encodeproject.org/data-standards/rna-seq/
GTEx Analysis Protocol: https://gtexportal.org/home/methods
TCGA Analysis Protocols: https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/

Statistical Methods

Benjamini-Hochberg FDR: Benjamini Y, Hochberg Y. J. R. Stat. Soc. B 1995; 57(1):289-300.
Multiple imputation in proteomics review: Lazar C, et al. J Proteome Res 2016; 15(4):1116-25.
scipy.stats: https://docs.scipy.org/doc/scipy/reference/stats.html
statsmodels multiple testing: https://www.statsmodels.org/stable/stats.html

Data Repositories for Validation

GEO: https://www.ncbi.nlm.nih.gov/geo/
ArrayExpress: https://www.ebi.ac.uk/biostudies/arrayexpress
PRIDE (proteomics): https://www.ebi.ac.uk/pride/