Help us improve
Share bugs, ideas, or general feedback.
From encode-toolkit
Integrates ENCODE RNA-seq, ATAC-seq, Histone ChIP-seq, and TF ChIP-seq data for tissues/cell types to build regulatory landscapes with ChromHMM annotation, enhancer linkage, and TF motif enrichment.
npx claudepluginhub ammawla/encode-toolkit --plugin encode-toolkitHow this skill is triggered — by the user, by Claude, or both
Slash command
/encode-toolkit:multi-omics-integrationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- User wants to integrate multiple ENCODE data types (RNA-seq + ATAC-seq + ChIP-seq) for a tissue
Integrates ENCODE RNA-seq, ATAC-seq, Histone ChIP-seq, and TF ChIP-seq data for tissues/cell types to build regulatory landscapes with ChromHMM annotation, enhancer linkage, and TF motif enrichment.
Analyzes chromatin state, histone modifications, ATAC-seq accessibility, and TF binding from ENCODE, Roadmap Epigenomics, and ChIP-Atlas. Use for regulatory landscape mapping and cCRE annotations.
Queries the ENCODE Portal REST API to retrieve regulatory genomics data: TF ChIP-seq, ATAC-seq, histone marks, RNA-seq metadata, BED/bigWig files, and SCREEN cCREs. Use for variant annotation, open chromatin analysis, and peak file download.
Share bugs, ideas, or general feedback.
Layer RNA-seq, ATAC-seq, Histone ChIP-seq, and TF ChIP-seq data from ENCODE to build a comprehensive regulatory landscape for a tissue or cell type.
The question: "What regulatory elements are active in my tissue, and how do expression, chromatin accessibility, histone marks, and TF binding converge to define cell identity?"
No single assay captures the full picture of gene regulation. RNA-seq tells you what is expressed. ATAC-seq tells you where chromatin is open. Histone ChIP-seq tells you how chromatin is modified. TF ChIP-seq tells you who is binding. Each assay provides one dimension; integrating them reveals the regulatory logic.
Mawla et al. (2023, BMC Genomics) demonstrated this integrative approach by comparing ATAC-seq chromatin accessibility between alpha, beta, and delta cells in mouse pancreatic islets. Key findings:
Cell type-specific chromatin accessibility defines cell identity: Differentially accessible regions between alpha, beta, and delta cells map to cell type-specific enhancers. Both alpha and delta cells appear poised, but repressed, from becoming beta cells.
Distal-intergenic enrichment in beta cells: Differential chromatin accessibility shows preferentially enriched distal-intergenic regions in beta cells compared to alpha or delta cells — indicating a larger enhancer repertoire.
TF motif enrichment reveals regulatory logic: Differentially accessible regions are enriched for binding motifs of known lineage-defining TFs, connecting chromatin structure to transcriptional regulation.
Cross-validation with expression: Common endocrine enhancers (accessible in all three cell types) map near genes expressed in all cell types, while cell type-specific enhancers map near differentially expressed genes.
Enhancer databases as validation: Previously discovered enhancer regions from the literature were confirmed and novel regions identified through chromatin accessibility analysis.
Multi-omics integration is not a single workflow — the approach depends on the question:
| Question | Required Data Layers | Approach |
|---|---|---|
| "What enhancers are active in my tissue?" | ATAC-seq + H3K27ac + RNA-seq | Intersection of accessible + H3K27ac+ regions near expressed genes |
| "What chromatin states exist?" | H3K4me1 + H3K4me3 + H3K27ac + H3K27me3 + H3K36me3 | ChromHMM segmentation |
| "Which TFs drive cell identity?" | ATAC-seq + TF ChIP-seq + RNA-seq | TF footprinting + motif enrichment in accessible regions |
| "What distinguishes cell type A from B?" | Cell type-resolved ATAC-seq + RNA-seq | Differential accessibility + expression correlation |
| "Where are super-enhancers?" | H3K27ac + H3K4me1 + ATAC-seq | ROSE algorithm on H3K27ac + accessibility confirmation |
| "What are poised vs. active elements?" | H3K4me1 + H3K27ac + H3K27me3 | Poised = H3K4me1+ H3K27ac- (± H3K27me3+) |
| "What is the full regulatory network?" | All layers + multiome if available | GRaNIE (bulk) or SCENIC+ (single-cell) |
| "Which variants affect regulation?" | Enhancer catalog + variant list | Enformer variant effect prediction |
Clarify with the user which question they are asking before proceeding.
For each data layer, search ENCODE systematically:
encode_search_experiments(
assay_title="total RNA-seq",
organ="pancreas",
biosample_type="tissue",
limit=50
)
For cell type-resolved expression, also check:
encode_search_experiments(
assay_title="total RNA-seq",
biosample_term_name="GM12878", # specific cell line if applicable
limit=50
)
encode_search_experiments(
assay_title="ATAC-seq",
organ="pancreas",
limit=50
)
Search for each core mark separately:
core_marks = ["H3K27ac", "H3K4me1", "H3K4me3", "H3K27me3", "H3K36me3"]
# Optionally: "H3K9me3" (heterochromatin), "H3K9ac" (active)
for mark in core_marks:
encode_search_experiments(
assay_title="Histone ChIP-seq",
target=mark,
organ="pancreas",
limit=50
)
encode_search_experiments(
assay_title="TF ChIP-seq",
organ="pancreas",
limit=100
)
Present to the user a data availability matrix:
Data Layer | Experiments | Biosamples | Labs | Files Available
----------------|-------------|------------|------|----------------
RNA-seq | N | N | N | gene quant TSV
ATAC-seq | N | N | N | narrowPeak, bigWig
H3K27ac ChIP | N | N | N | narrowPeak
H3K4me1 ChIP | N | N | N | narrowPeak
H3K4me3 ChIP | N | N | N | narrowPeak
H3K27me3 ChIP | N | N | N | broadPeak
H3K36me3 ChIP | N | N | N | broadPeak
TF ChIP-seq | N | N (TFs)| N | narrowPeak
Critical: Flag if any data layer is missing entirely — multi-omics integration is only as strong as its weakest layer. Missing H3K27ac makes enhancer calling unreliable. Missing ATAC-seq prevents accessibility-based enhancer validation.
| Assay | Key Quality Metric | Threshold |
|---|---|---|
| RNA-seq | Mapping rate, library complexity | >80% mapping, >15,000 genes detected |
| ATAC-seq | FRiP, TSS enrichment, fragment size | FRiP >1%, TSS enrichment >5, nucleosomal ladder |
| Histone ChIP | FRiP, NSC, RSC, NRF | FRiP >1%, NSC >1.05, RSC >0.8, NRF >0.8 |
| TF ChIP | FRiP, IDR consistency | FRiP >1%, IDR peaks available |
For each assay, download the appropriate file type:
RNA-seq: Gene quantifications (TSV) or signal tracks (bigWig)
encode_list_files(
experiment_accession="ENCSR...",
output_type="gene quantifications",
assembly="GRCh38",
preferred_default=True
)
ATAC-seq / Histone ChIP-seq: IDR thresholded peaks + signal tracks
encode_list_files(
experiment_accession="ENCSR...",
output_type="IDR thresholded peaks",
assembly="GRCh38"
)
For broad histone marks (H3K27me3, H3K36me3, H3K9me3), use broadPeak instead of narrowPeak.
TF ChIP-seq: IDR thresholded peaks
encode_list_files(
experiment_accession="ENCSR...",
output_type="IDR thresholded peaks",
file_format="bed",
assembly="GRCh38"
)
Always filter against ENCODE Blacklist regions before any analysis:
hg38-blacklist.v2.bed.gz from Boyle-Lab/Blacklistmm10-blacklist.v2.bed.gzIf you have 5+ histone marks, ChromHMM provides the most comprehensive regulatory annotation:
Required marks (minimum for useful segmentation):
ChromHMM output — 15 or 18-state model:
| State | Marks Present | Interpretation |
|---|---|---|
| Active TSS | H3K4me3, H3K27ac | Active promoter |
| Flanking TSS | H3K4me1, H3K4me3 | Promoter-proximal |
| Strong Enhancer | H3K4me1, H3K27ac | Active enhancer |
| Weak Enhancer | H3K4me1 only | Poised/weak enhancer |
| Bivalent TSS | H3K4me3, H3K27me3 | Bivalent/poised promoter |
| Bivalent Enhancer | H3K4me1, H3K27me3 | Poised enhancer (repressed) |
| Repressed Polycomb | H3K27me3 only | Polycomb-silenced |
| Transcription | H3K36me3 | Actively transcribed gene body |
| Quiescent | None | No marks (heterochromatin or desert) |
Validation: Cross-reference ChromHMM states with ATAC-seq peaks:
Active enhancers are defined by the convergence of multiple signals:
ENHANCER = H3K27ac+ AND H3K4me1+ AND ATAC-seq accessible AND NOT H3K4me3+ (not promoter)
Step-by-step:
Poised enhancers:
POISED = H3K4me1+ AND H3K27ac- AND (optionally H3K27me3+)
Super-enhancers: Use the ROSE algorithm on H3K27ac signal — regions above the inflection point in a ranked H3K27ac signal plot.
Linking enhancers to their target genes is one of the hardest problems in genomics. Use multiple complementary approaches:
Caveats:
Integrate TF ChIP-seq binding with enhancer locations and expression:
From Mawla et al. 2023: Differentially accessible chromatin between cell types shows enriched TF motifs that correspond to known lineage-defining factors. This validates the approach of using motif enrichment in differentially accessible regions to identify regulatory TFs.
When you have matched chromatin accessibility + RNA-seq across multiple samples (e.g., individuals), GRaNIE (Kamal et al. 2023) provides a principled framework for building enhancer-mediated gene regulatory networks:
When to use GRaNIE vs. SCENIC+:
When single-cell multiome data is available (joint scATAC-seq + scRNA-seq from the same cells), this is the gold standard for cell type-resolved multi-omics:
Weighted Nearest Neighbor (WNN) (Hao et al. 2021, Seurat v4):
Integration workflow:
SCENIC+ (Gonzalez-Blas et al. 2023) can then infer cell-type-specific enhancer GRNs directly from the multiome data, identifying TF regulons that couple chromatin accessibility to gene expression.
Caveats for multiome data:
Enformer (Avsec et al. 2021) can serve as an independent validation layer for identified regulatory elements:
When to use: As a validation/prioritization layer, NOT as a primary discovery tool. Enformer predicts from sequence alone (no cell type specificity without additional cell-type-specific inputs). Use to prioritize enhancers or variants for experimental follow-up.
Limitation: Enformer was trained on bulk epigenomic data. Cell-type-specific predictions require additional frameworks (e.g., Enformer Celltyping, Murphy et al. 2024).
When cell type-resolved data is available:
Compare identified enhancers/promoters against ENCODE's candidate cis-regulatory elements:
# Download cCRE file for the relevant assembly
encode_search_files(
search_term="cCRE",
assembly="GRCh38",
output_type="candidate Cis-Regulatory Elements"
)
Report:
Use GREAT (McLean et al. 2010) to assign biological meaning to enhancer/regulatory region sets:
Cross-reference with:
encode_link_reference(
experiment_accession="ENCSR...",
reference_type="pmid",
reference_id="37069576",
description="Multi-omics integration using this experiment"
)
Log all derived files:
encode_log_derived_file(
file_path="/path/to/enhancer_catalog.bed",
source_accessions=["ENCSR...", "ENCSR...", ...],
description="Active enhancers in [tissue]: ATAC+H3K27ac+H3K4me1, TSS-subtracted",
file_type="enhancer_catalog",
tool_used="bedtools intersect + ENCODE Blacklist filter",
parameters="H3K27ac AND H3K4me1 AND ATAC, NOT H3K4me3, NOT TSS±2kb"
)
encode_log_derived_file(
file_path="/path/to/chromhmm_states.bed",
source_accessions=["ENCSR...", "ENCSR...", ...],
description="ChromHMM 18-state annotation for [tissue]",
file_type="chromatin_states",
tool_used="ChromHMM LearnModel + MakeSegmentation",
parameters="18 states, 200bp bins, 5 marks"
)
For the integrated regulatory landscape:
For detailed biological meaning of each histone mark, ChromHMM combinatorial states, functional categories (active promoters, active/poised enhancers, super-enhancers, silencers), contradictions, and cancer-specific states, consult the comprehensive reference at skills/histone-aggregation/references/histone-marks-reference.md (1,442 lines, 21 marks, 37 key papers).
Goal: Combine ENCODE ChIP-seq, ATAC-seq, RNA-seq, and Hi-C data at a single gene locus to build a complete regulatory model. Context: Multi-omics integration at a specific locus connects enhancer marks, accessibility, gene expression, and 3D contacts into a mechanistic regulatory model.
encode_get_facets(facet_field="assay_title", organ="heart", organism="Homo sapiens")
Expected output:
{
"facets": {
"assay_title": {"Histone ChIP-seq": 45, "ATAC-seq": 18, "RNA-seq": 15, "Hi-C": 8, "WGBS": 6}
}
}
encode_search_experiments(assay_title="Histone ChIP-seq", organ="heart", target="H3K27ac", organism="Homo sapiens")
encode_search_experiments(assay_title="ATAC-seq", organ="heart", organism="Homo sapiens")
encode_search_experiments(assay_title="total RNA-seq", organ="heart", organism="Homo sapiens")
encode_search_experiments(assay_title="Hi-C", organ="heart", organism="Homo sapiens")
encode_download_files(accessions=["ENCFF100H3K", "ENCFF200ATK", "ENCFF300RNA", "ENCFF400HIC"], download_dir="/data/multiomics")
Focus on MYH7 locus (chr14:23,380,000-23,500,000) — cardiac myosin gene:
Interpretation: Enhancers with all 4 evidence layers (H3K27ac + ATAC + expression + 3D contact) are high-confidence regulatory elements for MYH7.
encode_get_facets(facet_field="assay_title", organ="heart", organism="Homo sapiens")
Expected output:
{
"facets": {"assay_title": {"Histone ChIP-seq": 45, "ATAC-seq": 18, "RNA-seq": 15, "Hi-C": 8}}
}
encode_compare_experiments(accession_1="ENCSR100CHI", accession_2="ENCSR200ATK")
Expected output:
{
"comparison": {
"shared": {"organ": "heart", "organism": "Homo sapiens"},
"differences": {"assay": ["Histone ChIP-seq", "ATAC-seq"]}
}
}
encode_summarize_collection()
Expected output:
{
"total_tracked": 4,
"by_assay": {"Histone ChIP-seq": 1, "ATAC-seq": 1, "RNA-seq": 1, "Hi-C": 1}
}
When reporting multi-omics integration results:
visualization-workflow for publication-quality multi-track figures, or epigenome-profiling for broader epigenomic characterization of the tissuehistone-aggregation — Aggregate histone ChIP-seq across multiple experiments (input layer)accessibility-aggregation — Aggregate ATAC-seq/DNase-seq across experiments (input layer)regulatory-elements — Focused enhancer/promoter analysisepigenome-profiling — Broader epigenomic characterizationscrna-meta-analysis — Single-cell RNA-seq integration (expression layer)hic-aggregation — Chromatin contact data for enhancer-gene linkagemethylation-aggregation — DNA methylation data (additional regulatory layer)publication-trust — Verify literature claims backing analytical decisions