Help us improve
Share bugs, ideas, or general feedback.
From encode-toolkit
Executes ENCODE RNA-seq pipeline from FASTQ to gene quantification and signal tracks using Nextflow, STAR alignment, RSEM/Kallisto, with Docker and cloud deployment.
npx claudepluginhub ammawla/encode-toolkitHow this skill is triggered — by the user, by Claude, or both
Slash command
/encode-toolkit:pipeline-rnaseqThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- User wants to run an RNA-seq processing pipeline from FASTQ to gene quantification
Executes ENCODE RNA-seq pipeline from FASTQ to gene quantification and signal tracks using Nextflow, STAR alignment, RSEM/Kallisto, with Docker and cloud deployment.
Runs nf-core/rnaseq bulk RNA-seq preprocessing from FASTQ or BAM inputs with preflight checks, reproducible outputs, and downstream handoff to DE skills.
Processes deep-sequencing coverage with deepTools: converts BAM to bigWig, runs QC (correlation, PCA, fingerprint), and generates TSS/peak heatmaps and profiles for ChIP-seq, ATAC-seq, or RNA-seq data.
Share bugs, ideas, or general feedback.
Execute the ENCODE RNA-seq processing pipeline from raw FASTQ files through splice-aware alignment, gene/transcript quantification, and strand-specific signal track generation. This skill provides a complete Nextflow DSL2 implementation following ENCODE uniform analysis standards.
RNA-seq measures transcriptome-wide gene expression by sequencing cDNA derived from cellular RNA. The ENCODE pipeline processes RNA-seq data through quality control, splice-aware alignment with STAR (2-pass mode), gene and transcript quantification with RSEM, optional fast pseudoalignment with Kallisto, and generation of strand-specific signal tracks as bigWig files.
Key design decisions: STAR 2-pass mode for maximum splice junction sensitivity, RSEM for accurate gene/transcript/isoform quantification including multi-mapped reads, stranded library protocol (dUTP/rf-stranded) as the ENCODE standard, and paired-end sequencing with a minimum of 30 million uniquely mapped reads per replicate.
| Reference | Journal | Year | DOI | Relevance |
|---|---|---|---|---|
| Dobin et al. "STAR: ultrafast universal RNA-seq aligner" | Bioinformatics | 2013 | 10.1093/bioinformatics/bts635 | Splice-aware aligner (~12,000 citations) |
| Li & Dewey "RSEM: accurate transcript quantification from RNA-Seq data" | BMC Bioinformatics | 2011 | 10.1186/1471-2105-12-323 | Gene/transcript quantification (~6,000 citations) |
| Bray et al. "Near-optimal probabilistic RNA-seq quantification" | Nature Biotechnology | 2016 | 10.1038/nbt.3519 | Fast pseudoalignment (~4,000 citations) |
| Wang et al. "RSeQC: quality control of RNA-seq experiments" | Bioinformatics | 2012 | 10.1093/bioinformatics/bts356 | RNA-seq QC suite (~3,500 citations) |
| ENCODE Project Consortium "Expanded encyclopaedias" | Nature | 2020 | 10.1038/s41586-020-2493-4 | ENCODE Phase 3 standards |
| Frankish et al. "GENCODE 2021" | Nucleic Acids Research | 2021 | 10.1093/nar/gkaa1087 | Gene annotation reference |
FASTQ ──> FastQC / Trim Galore ──> STAR (2-pass) ──> Genome BAM + Transcriptome BAM
| | |
| ┌─────────────────────────────────────┘ |
| v v
| Signal Track Generation RSEM Quantification
| (strand-specific bigWig) (gene + transcript + isoform)
| | |
| v v
| Plus strand bigWig genes.results (TPM/FPKM)
| Minus strand bigWig isoforms.results
| |
| ┌───────────────────────────────────────────────────────────┘
| v
| Kallisto (optional fast pseudoalignment)
| |
| v
└──> RSeQC + MultiQC ──> Aggregated QC Report
| Stage | Tool | Input | Output | Reference |
|---|---|---|---|---|
| 1. QC & Trimming | FastQC, Trim Galore | Raw FASTQ | Trimmed FASTQ | references/01-qc-trimming.md |
| 2. Alignment | STAR (2-pass) | Trimmed FASTQ | Genome BAM + Transcriptome BAM | references/02-star-alignment.md |
| 3. Quantification | RSEM, Kallisto | Transcriptome BAM / FASTQ | Gene/transcript counts, TPM, FPKM | references/03-quantification.md |
| 4. Signal Tracks | bedGraphToBigWig | STAR bedGraph | Strand-specific bigWig | references/04-signal-tracks.md |
| 5. QC Metrics | RSeQC, MultiQC | BAM, counts | Strandedness, coverage, saturation | references/05-qc-metrics.md |
sample_id,read1,read2,replicate,strandedness
SAMPLE1_rep1,rna_R1.fq.gz,rna_R2.fq.gz,1,reverse
SAMPLE1_rep2,rna_R1.fq.gz,rna_R2.fq.gz,2,reverse
Strandedness: ENCODE uses dUTP-based stranded libraries. The resulting reads are
reverse stranded (read 2 matches the sense strand). If unknown, the pipeline will
auto-detect strandedness using RSeQC infer_experiment.py.
| Protocol | Strandedness | RSEM flag | Kallisto flag | Common Usage |
|---|---|---|---|---|
| dUTP (ENCODE standard) | Reverse | --strandedness reverse | --rf-stranded | Most ENCODE RNA-seq |
| SMARTer / SMART-Seq2 | Unstranded | --strandedness none | (default) | Single-cell, low-input |
| Illumina TruSeq Stranded | Reverse | --strandedness reverse | --rf-stranded | Standard bulk RNA-seq |
| Directional ligation | Forward | --strandedness forward | --fr-stranded | Some legacy protocols |
| Metric | Threshold | Category | Source |
|---|---|---|---|
| Total sequenced reads | >=30M PE reads | Read depth | ENCODE |
| Uniquely mapped reads | >=70% of total | Alignment | ENCODE |
| Multi-mapped reads | <10% | Alignment | ENCODE |
| rRNA rate | <10% | Sample quality | ENCODE |
| Strandedness agreement | >90% | Library prep | RSeQC |
| Exonic rate | >60% | Mapping quality | RSeQC |
| Gene body coverage | Relatively uniform (5'/3' bias <1.5) | RNA integrity | RSeQC |
| Duplication rate | <60% | Library complexity | Picard |
| Detected genes (TPM>1) | >12,000 (human) | Sensitivity | ENCODE |
| Saturation | Approaching plateau at sequencing depth | Depth sufficiency | RSeQC |
| Application | Minimum Reads (PE) | Recommended | Notes |
|---|---|---|---|
| Gene-level expression | 20M | 30M | ENCODE minimum |
| Transcript-level expression | 40M | 60M | Isoform resolution requires more depth |
| Differential expression | 20M per sample | 30M per sample | 3+ biological replicates per condition |
| Novel junction discovery | 60M | 100M+ | STAR 2-pass mode benefits from depth |
| Fusion detection | 50M | 80M+ | Chimeric reads are rare |
nextflow run scripts/main.nf \
-profile local \
--reads 'fastq/*_R{1,2}.fq.gz' \
--genome GRCh38 \
--outdir results/
nextflow run scripts/main.nf \
-profile slurm \
--reads 'fastq/*_R{1,2}.fq.gz' \
--genome GRCh38 \
--outdir results/
nextflow run scripts/main.nf \
-profile gcp \
--reads 'gs://bucket/fastq/*_R{1,2}.fq.gz' \
--genome GRCh38 \
--outdir 'gs://bucket/results/'
nextflow run scripts/main.nf \
-profile aws \
--reads 's3://bucket/fastq/*_R{1,2}.fq.gz' \
--genome GRCh38 \
--outdir 's3://bucket/results/'
| Platform | Instance | Cost/Sample | Time/Sample | Notes |
|---|---|---|---|---|
| GCP | n1-highmem-8 | ~$3-6 | 2-4 hours | STAR index loading dominates; preemptible recommended |
| AWS | r5.2xlarge | ~$3-6 | 2-4 hours | r-series for STAR memory; spot recommended |
| Local | 8 cores, 32GB | $0 | 3-6 hours | Docker required; STAR needs 32GB+ RAM |
| SLURM | 8 cores, 32GB | Varies | 2-4 hours | Singularity recommended |
Memory note: STAR genome index loading requires ~32GB RAM for human. Machines with
<32GB will fail at the alignment step. Use --limitGenomeGenerateRAM to cap index size
for constrained environments, but this reduces mapping sensitivity.
results/
fastqc/ # Raw and trimmed QC reports
trimmed/ # Trimmed FASTQ files
star/
genome_bam/ # Genome-aligned BAM files
transcriptome_bam/ # Transcriptome BAM files (for RSEM)
logs/ # STAR alignment logs (Log.final.out)
junctions/ # Splice junction files (SJ.out.tab)
rsem/
genes/ # genes.results (gene_id, TPM, FPKM, expected_count)
isoforms/ # isoforms.results (transcript_id, TPM, FPKM)
kallisto/ # Kallisto quantification output (optional)
abundance.tsv # transcript TPMs and counts
signal/
plus/ # Plus-strand bigWig signal tracks
minus/ # Minus-strand bigWig signal tracks
qc/
rseqc/ # RSeQC output (strandedness, coverage, distribution)
multiqc/ # Aggregated QC report
logs/ # Nextflow execution logs
STAR loads the entire genome index into shared memory (~32GB for human). This is the most common failure mode. Always allocate at least 32GB RAM for the alignment step. On shared HPC systems, check memory limits per job.
Using incorrect strandedness results in near-zero gene counts. If you see uniformly low
counts, check strandedness with infer_experiment.py from RSeQC. ENCODE dUTP libraries
are reverse stranded.
FPKM values are not comparable across samples because they depend on total library composition. Use TPM (comparable across samples) or raw counts with DESeq2/edgeR normalization for differential expression.
RSEM uses an expectation-maximization algorithm to probabilistically assign multi-mapped reads. This is critical for gene families and repetitive elements. Do not pre-filter multi-mappers before RSEM quantification.
High rRNA contamination (>10%) indicates failed rRNA depletion. This reduces effective sequencing depth and inflates apparent duplication rates. Always check rRNA rate before downstream analysis.
STAR 1-pass mode only uses annotated splice junctions. 2-pass mode first discovers novel junctions then re-maps, critical for non-model organisms or samples with extensive alternative splicing.
| File | Description | Lines |
|---|---|---|
scripts/main.nf | Nextflow DSL2 pipeline | ~130 |
scripts/nextflow.config | Execution profiles (local/slurm/gcp/aws) | ~60 |
scripts/Dockerfile | Docker build with STAR, RSEM, Kallisto, RSeQC | ~35 |
After running on your own data, compare with ENCODE reference:
# Find matching ENCODE RNA-seq experiments
encode_search_experiments(
assay_title="total RNA-seq",
organ="pancreas",
biosample_type="tissue"
)
# Download ENCODE gene quantifications for comparison
encode_batch_download(
download_dir="/data/encode_reference/",
output_type="gene quantifications",
assay_title="total RNA-seq",
organ="pancreas",
assembly="GRCh38"
)
# Download ENCODE signal tracks for browser visualization
encode_search_files(
file_format="bigWig",
assay_title="total RNA-seq",
organ="pancreas",
output_type="signal of unique reads"
)
--outSAMstrandField and RSEM needs --strandedness. Using wrong strand settings can halve gene counts or assign reads to antisense genes.Goal: Process raw RNA-seq FASTQ files through the ENCODE pipeline to generate gene expression quantifications (TPM/FPKM) and signal tracks. Context: The ENCODE RNA-seq pipeline uses STAR 2-pass alignment and RSEM quantification, producing both gene-level and transcript-level expression estimates.
encode_get_experiment(accession="ENCSR000CPR")
Expected output:
{
"accession": "ENCSR000CPR",
"assay_title": "RNA-seq",
"biosample_summary": "K562",
"replicates": 2,
"status": "released"
}
encode_list_files(accession="ENCSR000CPR", file_format="fastq")
Expected output:
{
"files": [
{"accession": "ENCFF200RN1", "output_type": "reads", "paired_end": "1", "biological_replicates": [1], "file_size_mb": 3200},
{"accession": "ENCFF201RN2", "output_type": "reads", "paired_end": "2", "biological_replicates": [1], "file_size_mb": 3300}
]
}
nextflow run pipeline-rnaseq/main.nf \
--fastq_r1 ENCFF200RN1.fastq.gz \
--fastq_r2 ENCFF201RN2.fastq.gz \
--genome GRCh38 \
--annotation gencode.v36.annotation.gtf \
--strandedness reverse \
-profile docker
Key pipeline steps:
| Metric | Threshold | Purpose |
|---|---|---|
| Mapping rate | >= 70% | Alignment success |
| rRNA rate | < 10% | rRNA depletion efficiency |
| Replicate correlation | >= 0.9 | Biological consistency |
| Exonic rate | > 60% | Expected for mRNA-seq |
Compare gene expression with enhancer marks:
encode_search_experiments(assay_title="Histone ChIP-seq", biosample_term_name="K562", target="H3K27ac", organism="Homo sapiens")
Interpretation: Genes with high TPM AND nearby H3K27ac peaks have validated enhancer-gene connections. Low expression despite nearby enhancer marks suggests poised or tissue-specific regulation.
encode_search_experiments(assay_title="total RNA-seq", organ="liver", organism="Homo sapiens")
Expected output:
{
"total": 35,
"results": [
{"accession": "ENCSR300RNA", "assay_title": "RNA-seq", "biosample_summary": "liver", "status": "released"}
]
}
encode_list_files(accession="ENCSR300RNA", file_format="tsv", output_type="gene quantifications", assembly="GRCh38")
Expected output:
{
"files": [
{"accession": "ENCFF400GEQ", "output_type": "gene quantifications", "file_format": "tsv", "file_size_mb": 5.2}
]
}
encode_download_files(accessions=["ENCFF400GEQ"], download_dir="/data/rnaseq/quantification")
Expected output:
{
"downloaded": 1,
"md5_verified": true,
"files": ["/data/rnaseq/quantification/ENCFF400GEQ.tsv"]
}
| This skill produces... | Feed into... | Purpose |
|---|---|---|
| Gene expression (TPM/FPKM) | peak-annotation | Validate enhancer targets with expression data |
| Expression matrix | gtex-expression | Compare cell-line vs. tissue expression |
| Differential expression results | compare-biosamples | Identify tissue-specific gene regulation |
| Signal tracks (bigWig) | visualization-workflow | Display expression signal in genome browser |
| Expression quantifications | disease-research | Connect gene expression to disease phenotypes |
| Pipeline run parameters | data-provenance | Record STAR/RSEM versions and settings |
| QC metrics | quality-assessment | Validate against ENCODE RNA-seq standards |
When reporting RNA-seq pipeline results:
integrative-analysis to combine RNA-seq with ChIP-seq/ATAC-seq for regulatory inference, or compare-biosamples for cross-tissue expression comparison