Help us improve
Share bugs, ideas, or general feedback.
From encode-toolkit
Analyzes ENCODE functional genomics screens including CRISPR, MPRA, and STARR-seq to find data, process results, identify functional regulatory elements, and integrate with epigenomic annotations.
npx claudepluginhub ammawla/encode-toolkit --plugin encode-toolkitHow this skill is triggered — by the user, by Claude, or both
Slash command
/encode-toolkit:functional-screen-analysisThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- User wants to find or analyze CRISPR screen, MPRA, or STARR-seq data from ENCODE
Analyzes ENCODE functional genomics screens including CRISPR, MPRA, and STARR-seq to find data, process results, identify functional regulatory elements, and integrate with epigenomic annotations.
Queries the ENCODE Portal REST API to retrieve regulatory genomics data: TF ChIP-seq, ATAC-seq, histone marks, RNA-seq metadata, BED/bigWig files, and SCREEN cCREs. Use for variant annotation, open chromatin analysis, and peak file download.
Analyzes chromatin state, histone modifications, ATAC-seq accessibility, and TF binding from ENCODE, Roadmap Epigenomics, and ChIP-Atlas. Use for regulatory landscape mapping and cCRE annotations.
Share bugs, ideas, or general feedback.
Discover and interpret functional validation data from CRISPR screens, MPRA (Massively Parallel Reporter Assays), and STARR-seq experiments in the ENCODE catalog. These assays directly test whether candidate regulatory elements have functional activity, complementing the correlative evidence from ChIP-seq, ATAC-seq, and Hi-C.
The question: "Which of the candidate regulatory elements identified by ENCODE actually have functional activity, and what genes do they regulate?"
The central challenge in regulatory genomics is that biochemical signatures (histone marks, chromatin accessibility, TF binding) are correlative — they identify candidate regulatory elements but cannot prove function. ENCODE Phase 4 addressed this gap by investing heavily in functional characterization: large-scale CRISPR perturbation screens, MPRA experiments testing thousands of candidate elements in parallel, and STARR-seq for genome-wide enhancer activity mapping.
ENCODE catalogs 926,535 human candidate cis-regulatory elements (cCREs). But how many of these are truly functional?
These functional assays provide the strongest evidence (short of genetic studies in humans) that a regulatory element has biological activity. ENCODE4 has scaled these approaches: the Functional Characterization Centers (Yao et al. 2024) performed 108 CRISPRi screens with >540,000 perturbations, targeting 3.27 million ENCODE SCREEN cCREs.
| Assay | Tests | Context | Scale | Confidence | Key Limitation |
|---|---|---|---|---|---|
| CRISPR screen (CRISPRi/CRISPRa) | Endogenous perturbation | Native chromatin | 5,000–500,000 elements | Highest | Limited to cell lines; delivery constraints |
| MPRA | Reporter activity | Episomal (plasmid) | 10,000–100,000 variants | High for activity | Removed from chromatin context |
| STARR-seq | Self-transcription | Episomal (plasmid) | Genome-wide library | High for activity | Episomal; position effects |
encode_search_experiments(assay_title="CRISPR screen")
ENCODE contains CRISPRi (inhibition) and CRISPRa (activation) screens targeting regulatory elements:
| Screen Type | Mechanism | Effect on Target | Use Case |
|---|---|---|---|
| CRISPRi (dCas9-KRAB) | Transcriptional repression | Silences enhancer/promoter | Loss-of-function; identifies required elements |
| CRISPRa (dCas9-VP64/p65) | Transcriptional activation | Activates latent elements | Gain-of-function; identifies sufficient elements |
| CRISPR knockout | Cas9 nuclease | Deletes element | Irreversible loss-of-function |
Typical ENCODE CRISPR screen outputs:
# List available files for a CRISPR screen experiment
encode_list_files(
experiment_accession="ENCSR...",
file_format="tsv",
assembly="GRCh38"
)
encode_search_experiments(assay_title="MPRA")
ENCODE MPRA experiments test candidate cis-regulatory elements for enhancer/promoter activity:
Typical MPRA outputs:
encode_search_experiments(assay_title="STARR-seq")
STARR-seq tests enhancer activity genome-wide using self-transcribing reporter constructs:
Typical STARR-seq outputs:
To find ALL functional characterization data for a tissue or cell type:
# All perturbation experiments
encode_search_experiments(perturbed=True, organ="...")
# Functional characterization in a specific cell line
encode_search_experiments(assay_title="CRISPR screen", biosample_term_name="K562")
encode_search_experiments(assay_title="MPRA", biosample_term_name="K562")
encode_search_experiments(assay_title="STARR-seq", biosample_term_name="K562")
| Data Type | Format | Description |
|---|---|---|
| Guide RNA counts | TSV | Raw or normalized sgRNA counts per sample |
| Element quantifications | TSV | Aggregated effect sizes per target element |
| Differential expression | TSV | Genes with significant expression changes |
Step 1: Guide counts → Quality filter low-representation guides
Step 2: Normalization → Median ratio or total count normalization
Step 3: Statistical test → MAGeCK, BAGEL2, or custom model
Step 4: Hit calling → FDR correction, effect size thresholds
Step 5: Integration → Overlay on ENCODE cCREs and epigenomic marks
The standard tool for CRISPR screen analysis:
# Count sgRNAs from FASTQ
mageck count -l library.tsv -n experiment \
--sample-label "control,treatment" \
--fastq control_R1.fastq.gz treatment_R1.fastq.gz
# Test for enrichment/depletion
mageck test -k experiment.count.txt \
-t treatment -c control \
-n results --remove-zero both
MAGeCK outputs:
results.gene_summary.txt — Gene/element-level results (RRA and MLE)results.sgrna_summary.txt — Individual guide-level resultsneg|score, neg|fdr, pos|score, pos|fdrBayesian framework for gene essentiality from CRISPR screens:
# Calculate Bayes Factors
BAGEL.py fc -i counts.txt -o foldchange.txt -c control_columns
BAGEL.py bf -i foldchange.txt -o bayes_factors.txt \
-e essential_genes.txt -n nonessential_genes.txt
BAGEL.py pr -i bayes_factors.txt -o precision_recall.txt
| QC Metric | Threshold | Description |
|---|---|---|
| Guide representation | >200 reads/guide | Minimum coverage for statistical power |
| Replicate correlation | Pearson r > 0.7 | Between biological replicates |
| Positive control enrichment | p < 0.01 | Known essential genes/elements should score |
| Negative control depletion | ~50% at FDR 0.1 | Random non-targeting guides show no effect |
| Gini index | <0.3 | Measures guide count distribution evenness |
| Mapping rate | >70% | Reads mapping to library sequences |
After identifying screen hits:
# Overlay CRISPRi hits on ENCODE cCREs
encode_search_experiments(assay_title="Histone ChIP-seq", target="H3K27ac", biosample_term_name="K562")
# Check chromatin accessibility at hit locations
encode_search_experiments(assay_title="ATAC-seq", biosample_term_name="K562")
# Download peak files for intersection
encode_list_files(
experiment_accession="ENCSR...",
file_format="bed",
output_type="IDR thresholded peaks",
assembly="GRCh38",
preferred_default=True
)
Intersection analysis:
# Intersect CRISPR hits with H3K27ac peaks
bedtools intersect -a crispr_hits.bed -b h3k27ac_peaks.bed -wa -wb > hits_in_enhancers.bed
# Calculate enrichment of hits in specific chromatin states
# Compare: (hits in enhancers / total hits) vs (all tested elements in enhancers / total tested)
| Data Type | Format | Description |
|---|---|---|
| Barcode counts (DNA) | TSV | Input library representation |
| Barcode counts (RNA) | TSV | Transcriptional output |
| Activity scores | TSV | RNA/DNA ratio per element |
Step 1: Barcode counting → Count barcodes in DNA and RNA libraries
Step 2: DNA normalization → Normalize RNA counts by DNA representation
Step 3: Activity scoring → Calculate RNA/DNA ratio per element
Step 4: Statistical test → Compare to negative controls
Step 5: Allelic comparison → Test for allele-specific activity (if applicable)
Standardized computational pipeline:
# Run MPRAflow
nextflow run MPRAflow/MPRAflow.nf \
--design design_file.txt \
--fastq_insert insert_reads/ \
--fastq_bc barcode_reads/ \
--outdir results/
MPRAflow handles:
The core MPRA measurement is the activity ratio:
Activity = log2(RNA_counts / DNA_counts)
| Activity Score | Interpretation |
|---|---|
| Activity >> 0 (e.g., >1.5) | Strong enhancer/promoter activity |
| Activity ~ 0 | No regulatory activity (or activity equal to minimal promoter) |
| Activity << 0 (e.g., <-1.0) | Potential silencer activity (reduces transcription) |
# Correlate MPRA activity with ENCODE histone mark signals
# Elements with strong MPRA activity should show:
# - High H3K27ac signal (active enhancer mark)
# - Chromatin accessibility (ATAC-seq/DNase-seq signal)
# - TF binding (ChIP-seq signal at element)
# Elements with NO MPRA activity despite H3K27ac may be:
# - Context-dependent (active only in specific conditions)
# - False-positive biochemical marks
# - Silencer elements (negative activity in MPRA)
| Data Type | Format | Description |
|---|---|---|
| Input library | BAM/FASTQ | Cloned genomic fragments |
| STARR-seq output | BAM/FASTQ | Self-transcribed fragments (enriched) |
| Signal tracks | bigWig | Enrichment over input |
| Peaks | BED | Identified enhancer elements |
Step 1: Alignment → Map input and STARR-seq reads to genome
Step 2: Enrichment → Calculate STARR-seq / input ratio
Step 3: Peak calling → Identify enriched regions (enhancers)
Step 4: Quantification → Measure enhancer strength per peak
Step 5: Integration → Compare with ENCODE cCRE predictions
Purpose-built peak caller for STARR-seq data:
# Call peaks from STARR-seq
starrpeaker \
--prefix output_prefix \
--chromsize hg38.chrom.sizes \
--bam input.bam starrseq.bam \
--threshold 0.05
Alternatively, standard peak callers can be applied to the enrichment:
# Using MACS2 on STARR-seq enrichment
macs2 callpeak -t starrseq.bam -c input.bam \
-f BAM -g hs --nomodel \
-n starr_enhancers -q 0.05
| STARR-seq Result | ENCODE cCRE Status | Interpretation |
|---|---|---|
| Active in STARR-seq | dELS or pELS | Validated enhancer |
| Active in STARR-seq | PLS | Promoter with enhancer activity |
| Active in STARR-seq | Not in cCRE catalog | Novel enhancer (or context-dependent) |
| Inactive in STARR-seq | dELS or pELS | Possible false positive cCRE, or context-dependent |
| Inactive in STARR-seq | Not in cCRE catalog | Confirmed non-enhancer |
For each functionally validated element, determine its chromatin state:
# Retrieve ChromHMM annotations for the cell type
# (Pre-computed by Roadmap Epigenomics for 111 reference epigenomes)
# Enrichment analysis: are screen hits preferentially in specific chromatin states?
# Expected: hits enriched in active enhancer (Enh, EnhG) and active promoter (TssA) states
# Unexpected enrichment in quiescent or heterochromatin states suggests novel regulatory mechanisms
| cCRE Class | Expected Screen Hit Rate | Rationale |
|---|---|---|
| PLS (Promoter-like) | 30–50% | Promoters are consistently active |
| pELS (Proximal enhancer-like) | 15–25% | Proximity to promoters increases detection |
| dELS (Distal enhancer-like) | 5–15% | Distal enhancers are often cell-type-specific |
| CTCF-only | <5% | Insulators rarely show enhancer activity |
| No cCRE overlap | 1–3% | Novel elements or context-dependent activity |
The Activity-By-Contact model predicts enhancer-gene links using ENCODE data. Cross-reference screen hits:
# For each CRISPR hit:
# 1. Check if the hit is predicted by ABC to regulate the observed target gene
# 2. ABC-predicted enhancers that are also CRISPR-validated are highest confidence
# 3. CRISPR hits NOT predicted by ABC may act through mechanisms ABC does not model
Screen-validated elements provide the strongest evidence for variant interpretation:
# Workflow:
# 1. Identify GWAS variants in LD (r2 > 0.8) from gwas-catalog skill
# 2. Intersect with functionally validated enhancers
# 3. Variants in CRISPR-validated enhancers are highest-priority causal candidates
# 4. Test for enrichment: are GWAS variants over-represented in screen hits?
bedtools intersect \
-a gwas_variants_ld.bed \
-b crispr_validated_enhancers.bed \
-wa -wb > gwas_in_validated_enhancers.bed
| Screen Type | Minimum Library Size | Recommended Coverage |
|---|---|---|
| CRISPR (gene-level) | 4–6 guides per gene | 500x per guide |
| CRISPR (element-tiling) | 1 guide per ~100bp | 200x per guide |
| MPRA | 10–20 barcodes per element | 100x per barcode |
| STARR-seq | Genome-wide fragmentation | 10x genome coverage |
Positive controls (expected to score):
Negative controls (expected to show no effect):
Critical: The functional screen must be performed in a cell type that is biologically relevant to the regulatory elements being tested. An enhancer active in hepatocytes may show no MPRA activity in HEK293 cells.
| ENCODE Screen Cell Types | Tissue Relevance |
|---|---|
| K562 (CML) | Hematopoietic lineage, Tier 1 |
| HepG2 | Liver / hepatocyte |
| GM12878 | B-lymphocyte, Tier 1 |
| WTC-11 (iPSC-derived) | Multiple differentiated cell types |
| A549 | Lung epithelial |
When integrating screen data with ENCODE epigenomic data:
# Ensure cell type match
encode_search_experiments(assay_title="Histone ChIP-seq", target="H3K27ac", biosample_term_name="K562")
# Match the screen cell type exactly for meaningful correlation
| Elements Tested | Guides/Barcodes Per | Replicates | Expected Power (FDR<0.05, |FC|>1.5) | |----------------|-------------------|-----------|--------------------------------------| | 5,000 | 4 guides | 3 bio | ~80% for strong effects | | 5,000 | 6 guides | 3 bio | ~90% for moderate effects | | 50,000 | 2 guides | 2 bio | ~60% for strong effects only |
Log all screen analysis operations:
encode_track_experiment(
accession="ENCSR...",
notes="CRISPR screen analysis for [cell type] regulatory elements"
)
encode_log_derived_file(
file_path="/path/to/screen_results.tsv",
source_accessions=["ENCSR...", "ENCFF..."],
description="CRISPR screen hit list: [N] significant elements at FDR<0.05 from [total] tested in [cell type]",
file_type="screen_results",
tool_used="MAGeCK v0.5.9.5",
parameters="mageck test -t treatment -c control --remove-zero both; FDR<0.05, |LFC|>0.5"
)
encode_log_derived_file(
file_path="/path/to/screen_encode_integration.tsv",
source_accessions=["ENCSR...(screen)", "ENCSR...(H3K27ac)", "ENCSR...(ATAC)"],
description="Integration of [N] CRISPR hits with ENCODE cCREs and H3K27ac peaks in [cell type]. [X] hits overlap cCRE-ELS, [Y] overlap H3K27ac peaks",
file_type="integrated_screen_annotation",
tool_used="bedtools intersect + custom enrichment",
parameters="GRCh38, bedtools v2.31.0, IDR thresholded peaks"
)
Link to relevant publications:
encode_link_reference(
experiment_accession="ENCSR...",
reference_type="pmid",
reference_id="30612741",
description="Gasperini et al. 2019 — CRISPRi screen methodology reference"
)
Goal: Analyze a CRISPR interference (CRISPRi) screen targeting candidate enhancers to identify those required for pluripotency gene expression in H1-hESC cells. Context: Functional genomics screens like CRISPRi directly test enhancer necessity, complementing observational epigenomic data.
encode_search_experiments(assay_title="CRISPR screen", biosample_term_name="H1", organism="Homo sapiens")
Expected output:
{
"total": 12,
"results": [
{"accession": "ENCSR000CRI", "assay_title": "CRISPR screen", "biosample_summary": "H1-hESC", "target": "enhancer screen"},
{"accession": "ENCSR001SGR", "assay_title": "CRISPR screen", "biosample_summary": "H1-hESC", "target": "gene-level growth screen"}
]
}
Interpretation: 12 CRISPR screens in H1-hESC. Filter for enhancer-targeting screens (vs. gene-level knockouts).
encode_get_experiment(accession="ENCSR000CRI")
Expected output:
{
"accession": "ENCSR000CRI",
"assay_title": "CRISPR screen",
"biosample_summary": "H1-hESC",
"description": "CRISPRi screen targeting 10,000 candidate enhancers with NANOG-GFP readout",
"replicates": 3,
"status": "released",
"lab": "/labs/jesse-engreitz/"
}
encode_list_files(accession="ENCSR000CRI", file_format="tsv", assembly="GRCh38")
Expected output:
{
"files": [
{"accession": "ENCFF500SCR", "output_type": "element quantifications", "file_format": "tsv", "file_size_mb": 15.2},
{"accession": "ENCFF501GDE", "output_type": "guide quantifications", "file_format": "tsv", "file_size_mb": 8.7}
]
}
Interpretation: "Element quantifications" contains per-enhancer effect sizes. "Guide quantifications" has per-guide data for QC.
encode_download_files(accessions=["ENCFF500SCR"], download_dir="/data/crispr_screen")
Analysis steps:
Overlay significant enhancer hits with:
Interpretation: Enhancers that are CRISPRi-sensitive AND marked by H3K27ac AND connected by chromatin loops represent high-confidence regulatory elements for pluripotency.
encode_get_facets(assay_title="CRISPR screen", facet_field="biosample_ontology.term_name", organism="Homo sapiens")
Expected output:
{
"facets": {
"biosample_ontology.term_name": {"K562": 45, "H1": 12, "GM12878": 8, "HepG2": 6}
}
}
encode_search_experiments(assay_title="MPRA", organism="Homo sapiens")
Expected output:
{
"total": 28,
"results": [
{"accession": "ENCSR100MPR", "assay_title": "MPRA", "biosample_summary": "K562", "status": "released"}
]
}
encode_search_experiments(assay_title="STARR-seq", organism="Homo sapiens")
Expected output:
{
"total": 15,
"results": [
{"accession": "ENCSR200STR", "assay_title": "STARR-seq", "biosample_summary": "HepG2", "status": "released"}
]
}
| This skill produces... | Feed into... | Purpose |
|---|---|---|
| Significant enhancer hits (BED) | peak-annotation | Assign target genes to validated enhancers |
| Screen effect sizes per element | regulatory-elements | Classify functional elements by regulatory category |
| Validated enhancer coordinates | histone-aggregation | Confirm H3K27ac/H3K4me1 marks at functional enhancers |
| CRISPRi-sensitive regions | accessibility-aggregation | Verify open chromatin at functional elements |
| Enhancer-gene pairs | hic-aggregation | Validate via chromatin loop support |
| Functional variant coordinates | variant-annotation | Annotate GWAS variants at validated enhancers |
| Screen results table | visualization-workflow | Generate volcano plots and Manhattan plots of screen hits |
| Validated regulatory elements | disease-research | Connect functional enhancers to disease mechanisms |
regulatory-elements — Characterizing the candidate regulatory elements that screens validate. Screens test cCRE predictions.search-encode — Finding CRISPR screen, MPRA, and STARR-seq experiments in the ENCODE catalogvariant-annotation — Screen-validated elements provide the strongest evidence for GWAS variant interpretationdisease-research — Functional screens identify disease-relevant regulatory elements for translational researchintegrative-analysis — Combining screen results with multi-omic ENCODE data layersepigenome-profiling — Building comprehensive epigenomic profiles to contextualize screen hitsquality-assessment — Evaluating screen quality metrics (guide representation, replicate correlation, control enrichment)gwas-catalog — GWAS variants in screen-validated enhancers are highest-priority causal candidateshistone-aggregation — Aggregated histone peaks provide the annotation layer for screen hit classificationaccessibility-aggregation — Chromatin accessibility at screen targets indicates guide delivery efficiencydata-provenance — Document screen analysis parameters, tool versions, and thresholds for reproducibilitypublication-trust — Verify literature claims backing analytical decisionsWhen reporting functional screen analysis results, present:
Example summary:
CRISPR Screen Analysis: K562 CRISPRi (ENCSR...)
Elements tested: 5,920 candidate enhancers
Significant hits: 664 (11.2%) at FDR < 0.05
- 423 overlap ENCODE cCRE-ELS (63.7% of hits)
- 141 overlap cCRE-PLS (21.2% of hits)
- 100 no cCRE overlap (15.1% — novel functional elements)
Median effect size: 22% reduction in target gene expression
Top hit: ENCSR... element → MYC (62% reduction, FDR = 1.2e-15)