From sciagent-skills
Guides post-quantification analysis of omics data (bulk RNA-seq transcriptomics, proteomics) via three-tiered approach: validated pipelines (DESeq2, MaxQuant), standard workflows, custom methods.
npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skillsThis skill uses the workspace's default tool permissions.
---
Performs differential gene expression analysis on bulk RNA-seq count data using PyDESeq2. Handles multi-factor designs, Wald tests, FDR correction, apeGLM shrinkage, and volcano/MA plots.
Analyzes scRNA-seq data using scanpy/anndata: QC, normalization, PCA/UMAP, Leiden clustering, DE (Wilcoxon/DESeq2), annotation, batch correction, trajectory, cell-cell communication via ligand-receptor pairs. Supports h5ad/10X/CSV.
Performs differential expression analysis on bulk RNA-seq and pseudo-bulk count matrices using PyDESeq2, with QC, PCA visualization, contrast testing, volcano/MA plots, and markdown reports.
Share bugs, ideas, or general feedback.
Short Description: Comprehensive guide for analyzing omics data (transcriptomics, proteomics) using validated pipelines, standard workflows, or custom analysis methods.
Authors: HITS
Version: 1.0
Last Updated: December 2025
License: CC BY 4.0
Commercial Use: Allowed
This guide provides a three-tiered approach to omics data analysis, prioritizing validated pipelines and standard workflows before moving to custom analysis. Always start with Option 1 and proceed to subsequent options only if needed.
The guide covers:
Note: This guide focuses on analysis of already-quantified data. For raw data processing (alignment, quantification), refer to specialized tools and pipelines.
A validated pipeline is a specific tool with peer-reviewed benchmarking data demonstrating performance on data like yours (e.g., DESeq2 for RNA-seq counts, MaxQuant for label-free proteomics). A standard workflow is the canonical sequence of QC → normalization → statistical test → multiple-testing correction assembled from accepted community practice but tuned to your specific dataset. Custom analysis is bespoke statistical or computational modeling required when neither prior tier covers the data type or research question. The progression Option 1 → Option 2 → Option 3 trades reproducibility for flexibility — always exhaust earlier tiers first.
Missing data in omics arises from three distinct mechanisms with different correct treatments. MCAR (Missing Completely At Random) means missingness is independent of any value — safe to impute with mean, median, or KNN. MAR (Missing At Random) means missingness depends on observed variables but not the unobserved value — KNN or model-based imputation is appropriate. MNAR (Missing Not At Random) means missingness depends on the missing value itself, typical in proteomics where low-abundance proteins drop below detection — requires left-censored imputation (minprob/QRILC) below the detection limit. Choosing the wrong mechanism systematically biases downstream statistics.
Parametric tests (Student's t-test, Welch's t-test) assume approximate normality and (for Student's) equal variances; they have higher power than non-parametric tests when assumptions hold. Non-parametric tests (Mann-Whitney U, permutation) make weaker assumptions and are correct under skewed distributions or small n, at the cost of statistical power. The choice depends on sample size (n < 10 favors non-parametric), normality (Shapiro-Wilk / Anderson-Darling at the feature level), variance homogeneity (Levene's test), and outlier prevalence.
Omics analyses test thousands of features simultaneously. Without correction, expected false positives at α=0.05 across 20,000 genes is 1,000. Family-wise error rate (FWER) corrections like Bonferroni control the probability of any false positive but are conservative. False discovery rate (FDR) corrections like Benjamini-Hochberg control the expected proportion of false positives among reported significant features and are the standard for omics. Always report adjusted p-values, never raw p-values, when calling significance.
Use this tree to choose the right analysis tier for your data:
Have you searched for a validated
pipeline matching your data type?
│
┌─────────────┴─────────────┐
│ │
NO YES
│ │
▼ ▼
Run Method 1 Did you find a validated
(literature) AND pipeline with benchmarks
Method 2 (consortia matching your data type
workflows) FIRST and biological question?
│
┌───────┴───────┐
│ │
YES NO
│ │
▼ ▼
OPTION 1: Is your data a
Use validated common type
pipeline (RNA-seq counts,
(e.g., DESeq2, pre-quantified
edgeR, MaxQuant) proteomics)?
│
┌───────┴───────┐
│ │
YES NO
│ │
▼ ▼
OPTION 2: OPTION 3:
Standard Custom analysis
workflow (consult
(QC → norm → statistician;
test → FDR) document
thoroughly)
| Data type | Sample size | Has validated pipeline? | Recommended tier | Specific tool / approach |
|---|---|---|---|---|
| Bulk RNA-seq counts | n ≥ 3/group | Yes (DESeq2, edgeR) | Option 1 | DESeq2 (negative binomial, default FDR < 0.05) |
| Pre-quantified proteomics, normal-distributed | n ≥ 5/group | Sometimes | Option 1 if pipeline matches; else Option 2 | limma or t-test + BH-FDR |
| Pre-quantified proteomics, MNAR-heavy | n ≥ 5/group | No (mechanism-specific) | Option 2 | minprob imputation → t-test or Mann-Whitney → BH-FDR |
| Small-cohort omics (n < 5) | n < 5 | Rarely | Option 2 with caution | Permutation test, report effect sizes; flag results as preliminary |
| Multi-omics integration | Variable | Limited | Option 3 | MOFA, DIABLO, or custom Bayesian model |
| Novel data type (e.g., spatial multi-omics) | Variable | No | Option 3 | Build from first principles; cross-validate |
| Time-series omics | n per timepoint | Sometimes (maSigPro, ImpulseDE2) | Option 1 if available; else Option 3 | maSigPro for transcriptomics; custom for proteomics |
IMPORTANT: You MUST complete BOTH Method 1 AND Method 2 before proceeding to Option 2. Do not skip Method 2 even if Method 1 finds no results.
Search for validated analysis methods using web search tools or literature databases (PubMed, Google Scholar).
Search queries to try (use multiple):
"[DATA_TYPE]" "[ANALYSIS_TYPE]" validated pipeline best practices
"[DATA_TYPE]" analysis workflow "[ORGANISM]" published
"[DATA_TYPE]" "[TOOL_NAME]" validation benchmark comparison
Example for bulk RNA-seq:
"RNA-seq" "differential expression" validated pipeline human
"DESeq2" "edgeR" comparison validation RNA-seq
Example for proteomics:
"proteomics" "differential abundance" analysis validated methods
"proteomics" normalization imputation best practices
What to search for in results:
IMPORTANT: Spend adequate time searching literature. Look through at least the first 10-15 search results and check supplementary materials of relevant papers.
Review established workflows from major consortia and publications:
If you find validated pipelines or methods:
Example result format:
Data Type: Bulk RNA-seq
Analysis Goal: Differential expression
Pipeline: DESeq2 (v1.40.0)
Reference: Love MI, et al. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. PMID: 25516281
Validation: Validated in multiple benchmark studies, recommended for count data
Parameters: Default parameters, FDR < 0.05, log2FC > 1
If no validated pipelines found in BOTH Method 1 AND Method 2: Only then proceed to Option 2: Use Standard Workflows
RNA-seq (Bulk):
Proteomics (Pre-quantified):
CRITICAL: Quality control must be performed before any statistical analysis. Poor data quality will lead to unreliable results regardless of statistical methods used.
Check for outlier samples:
Check sample correlation:
Check for batch effects:
Assess missing value patterns:
Check feature detection consistency:
CRITICAL: Choose imputation method based on missing value mechanism:
MNAR (Missing Not At Random): Use minimum probability imputation (minprob)
MCAR/MAR (Missing Completely/At Random): Use KNN imputation
Simple methods (if few missing values):
For RNA-seq count data: Normalization is typically handled by DESeq2/edgeR (size factors).
For proteomics/continuous data:
CRITICAL: Always check statistical test assumptions before performing analysis. Using the wrong test can lead to incorrect conclusions.
Key checks to perform:
Normality test:
Variance homogeneity test:
Sample size check:
Outlier check:
Test selection logic:
Implementation steps:
For each feature:
Apply FDR correction:
Key libraries:
scipy.stats: Statistical tests (ttest_ind, mannwhitneyu, shapiro, levene)statsmodels.stats.multitest: FDR correction (multipletests with method='fdr_bh')Volcano Plot:
PCA Plot (for quality control):
Once you have completed the standard workflow:
If standard workflows don't meet your needs: Proceed to Option 3: Custom Analysis
Data Quality: Ensure high-quality data before custom analysis
Statistical Rigor:
Reproducibility:
Validation:
Step 1: Quality Control
Step 2: For RNA-seq count data, use DESeq2 (typically in R)
library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = count_matrix, colData = sample_metadata, design = ~ condition)
dds <- DESeq(dds)
res <- results(dds, contrast=c("condition", "treatment", "control"))
Step 3: Functional Enrichment (optional)
Step 1: Quality Control
Step 2: Impute Missing Values
Step 3: Normalization
Step 4: Check for Batch Effects
Step 5: Differential Abundance Analysis
Step 6: Visualization
Exhaust validated pipelines before building anything custom. Run both literature search and consortium-workflow review before falling back to bespoke analysis. Rationale: validated pipelines have peer-reviewed benchmarking; novel methods require their own validation effort and reduce reproducibility.
Perform sample-level QC before any statistical analysis. Use PCA + Isolation Forest for outlier detection, sample correlation matrices, and PCA + silhouette score for batch effects. Rationale: a single outlier sample or unrecognized batch effect can dominate test statistics and produce uninterpretable results regardless of the test chosen.
Diagnose the missing-value mechanism (MCAR / MAR / MNAR) before imputing. Check the correlation between mean intensity and missingness rate per feature. Rationale: imputing MNAR data with KNN biases low-abundance features upward; imputing MCAR data with minprob biases everything downward. Mechanism-aware imputation prevents systematic distortion.
Always check test assumptions, then choose the test — never the reverse. Run Shapiro-Wilk / Anderson-Darling for normality and Levene's for variance homogeneity on a representative feature subset. Rationale: applying a t-test to non-normal small-n data inflates type I error; defaulting to Mann-Whitney on well-behaved data wastes power.
Always apply FDR correction (Benjamini-Hochberg) for genome-wide tests. Report p_adj (or q-value), not raw p. Rationale: with 20,000 genes tested at α=0.05, ~1,000 false positives are expected without correction — the result set is meaningless.
Document every parameter and version, save intermediate outputs, and pin random seeds. Record tool version, parameter values, normalization method, imputation method, test choice, FDR threshold, and the seed for any stochastic step. Rationale: omics pipelines have many tunable knobs; without exact provenance the analysis cannot be reproduced or audited.
Validate findings on an independent dataset or with an orthogonal method whenever possible. Examples: confirm DE genes via qPCR, replicate in a public dataset (GEO, ArrayExpress), or compare across batches. Rationale: even FDR-controlled hits can be false positives driven by batch artifacts, contamination, or normalization choices.
Skipping QC and going directly to statistics. Problem: Outlier samples and batch effects produce false signals that pass statistical tests, polluting the result list with artifacts. How to avoid: Always run sample-level PCA, correlation matrices, and outlier detection before any differential test. Treat QC as mandatory, not optional.
Imputing missing values with a one-size-fits-all method. Problem: Using mean imputation on MNAR proteomics data biases low-abundance proteins; using minprob on MCAR data biases everything below the detection limit downward. How to avoid: Diagnose the mechanism (correlation between intensity and missingness), then pick an appropriate imputer: minprob for MNAR, KNN for MCAR/MAR.
Using t-tests on non-normal or small-n data. Problem: Student's t-test assumes normality and (with pooled variance) equal variances; with n < 10 and skewed data, type I error inflates well above the nominal α. How to avoid: Run normality and variance tests first; use Welch's t-test for unequal variance, Mann-Whitney for non-normal, and permutation tests for n < 5.
Reporting raw p-values without multiple testing correction.
Problem: Across thousands of features, raw p-values produce massive false discovery rates; the resulting "significant" gene lists are dominated by noise.
How to avoid: Always apply Benjamini-Hochberg FDR (or BY for dependent tests) and report adjusted p-values. Set p_adj < 0.05 (or q < 0.05) as the significance threshold.
Confusing fold change with statistical significance.
Problem: A high log2 fold change at high p_adj is unreliable noise; a low log2 fold change at very low p_adj may be real but biologically negligible.
How to avoid: Filter on both — typical thresholds are |log2FC| > 1 AND p_adj < 0.05. Report effect sizes alongside p-values.
Failing to correct for batch effects when present.
Problem: Batch effects masquerade as biological signal, especially in proteomics and multi-cohort studies; PC1 ends up reflecting batch rather than condition.
How to avoid: Check batch separation with PCA + silhouette score; if silhouette > ~0.3, apply ComBat, limma's removeBatchEffect, or include batch as a covariate in the model.
Treating Option 3 (custom analysis) as a shortcut. Problem: Jumping straight to custom methods without first running standard workflows skips peer-reviewed validation and makes results harder to publish and reproduce. How to avoid: Document a clear justification for why Options 1 and 2 are inadequate before moving to Option 3, and validate any custom method on simulated or held-out data.
Remember: Always start with validated pipelines (Option 1), then move to standard workflows (Option 2), and only use custom analysis (Option 3) when necessary. Document all steps and parameters for reproducibility. Quality control is essential at every stage of analysis. Always check statistical test assumptions before performing analysis.