Help us improve
Share bugs, ideas, or general feedback.
From encode-toolkit
Annotates genetic variants (GWAS hits, eQTLs, rare variants) with ENCODE data for regulatory impact, causal identification, enrichment testing, and gene linking.
npx claudepluginhub ammawla/encode-toolkit --plugin encode-toolkitHow this skill is triggered — by the user, by Claude, or both
Slash command
/encode-toolkit:variant-annotationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- User wants to annotate genetic variants with ENCODE regulatory element overlap and functional evidence
Annotates genetic variants (GWAS hits, eQTLs, rare variants) with ENCODE data for regulatory impact, causal identification, enrichment testing, and gene linking.
Analyzes chromatin state, histone modifications, ATAC-seq accessibility, and TF binding from ENCODE, Roadmap Epigenomics, and ChIP-Atlas. Use for regulatory landscape mapping and cCRE annotations.
Scores genetic variants for regulatory potential via RegulomeDB v2 REST API. Queries by rsID, position, or region and returns ranking (1a–7) with TF binding, histone marks, DNase, motifs, eQTLs, and chromatin state evidence. Use for GWAS hit prioritization and regulatory annotation.
Share bugs, ideas, or general feedback.
Interpret non-coding genetic variation by layering ENCODE functional genomics annotations to identify causal regulatory variants and link them to target genes.
The question: "Which of my GWAS/eQTL variants actually disrupt regulatory elements, and what genes do they affect?"
Over 90% of disease-associated variants from GWAS fall in non-coding regions of the genome. Without functional annotation, a GWAS locus is just a genomic coordinate — it does not tell you which variant is causal, what regulatory element it disrupts, or which gene it affects. ENCODE provides the richest catalog of functional elements for interpreting these variants.
A typical GWAS locus contains dozens to hundreds of variants in linkage disequilibrium (LD) with the lead SNP. The causal variant(s) may not be the one with the strongest association. Functional annotation helps distinguish causal from tag variants by asking: does this variant overlap a regulatory element that is active in disease-relevant tissue?
The ENCODE Phase 3 project (ENCODE Project Consortium 2020, Nature, ~1,656 citations) established a registry of 926,535 human candidate cis-regulatory elements (cCREs) covering 7.9% of the genome. These are classified into:
| cCRE Class | Abbreviation | Definition | Count (human) |
|---|---|---|---|
| Promoter-like | PLS | DNase + H3K4me3 ± H3K27ac near TSS | ~34,000 |
| Proximal enhancer-like | pELS | DNase + H3K27ac within 2kb of TSS | ~46,000 |
| Distal enhancer-like | dELS | DNase + H3K27ac >2kb from TSS | ~670,000 |
| CTCF-only | CTCF-only | DNase + CTCF, no H3K4me3/H3K27ac | ~83,000 |
| DNase-H3K4me3 | DNase-H3K4me3 | DNase + H3K4me3, not near TSS | ~93,000 |
These cCREs are accessible via the SCREEN web interface and provide the foundation for variant annotation.
Before any annotation, clarify the input:
| Input Type | Description | LD Consideration |
|---|---|---|
| GWAS lead SNPs | Top associations per locus | Must expand to LD proxies (r² > 0.8) |
| Fine-mapped credible sets | Post-fine-mapping variants with posterior probabilities | Already LD-aware — annotate directly |
| eQTL variants | Expression-associated SNPs | May need LD expansion depending on source |
| Rare variants (ClinVar) | Pathogenic/likely pathogenic | No LD concern — annotate directly |
| Candidate variants | User-curated list | Clarify whether LD expansion is needed |
A GWAS lead SNP is NOT necessarily causal. It is the variant with the strongest statistical signal, but the true causal variant may be any of the dozens to hundreds of SNPs in LD. Before annotation:
ENCODE regulatory elements are highly tissue-specific. An enhancer active in liver may be completely silent in brain. The disease context determines which ENCODE data to query.
| Disease Category | Primary Tissues | Key ENCODE Biosamples |
|---|---|---|
| Type 2 diabetes | Pancreas, liver, adipose, muscle | pancreas tissue, HepG2, adipose tissue |
| Alzheimer's disease | Brain (hippocampus, cortex) | brain tissue, astrocytes, neurons |
| Cardiovascular | Heart, blood vessels | heart tissue, HUVEC, aorta |
| Blood disorders | Blood, bone marrow | K562, GM12878, CD34+ cells |
| Autoimmune | Immune cells, thymus | GM12878, Treg, Th17, monocytes |
| Cancer | Tissue of origin | K562 (CML), HepG2 (liver), MCF-7 (breast), A549 (lung) |
| Inflammatory bowel | Intestine, colon | intestine tissue, sigmoid colon |
| Respiratory | Lung | lung tissue, A549 |
Check what ENCODE data exists for the relevant tissue:
encode_get_facets(organ="pancreas")
encode_get_facets(organ="liver")
If the relevant tissue has limited ENCODE data: check whether Tier 1 cell lines (K562, GM12878, H1-hESC) or Roadmap Epigenomics data can serve as proxies. Document the tissue mismatch explicitly.
For each disease-relevant tissue, systematically gather ENCODE data layers:
encode_search_experiments(assay_title="ATAC-seq", organ="...", biosample_type="tissue")
encode_search_experiments(assay_title="DNase-seq", organ="...", biosample_type="tissue")
# Active enhancers + promoters
encode_search_experiments(assay_title="Histone ChIP-seq", target="H3K27ac", organ="...")
# Active promoters specifically
encode_search_experiments(assay_title="Histone ChIP-seq", target="H3K4me3", organ="...")
# Enhancer mark (active + poised)
encode_search_experiments(assay_title="Histone ChIP-seq", target="H3K4me1", organ="...")
# Repressive marks (for context)
encode_search_experiments(assay_title="Histone ChIP-seq", target="H3K27me3", organ="...")
encode_search_experiments(assay_title="TF ChIP-seq", organ="...")
# CTCF specifically for insulator/boundary annotation
encode_search_experiments(assay_title="TF ChIP-seq", target="CTCF", organ="...")
encode_search_experiments(assay_title="Hi-C", organ="...")
encode_search_experiments(assay_title="ChIA-PET", organ="...")
encode_search_experiments(assay_title="total RNA-seq", organ="...", biosample_type="tissue")
For each experiment, get peak files:
encode_list_files(
experiment_accession="ENCSR...",
file_format="bed",
output_type="IDR thresholded peaks",
assembly="GRCh38",
preferred_default=True
)
Track all experiments used:
encode_track_experiment(accession="ENCSR...", notes="variant annotation - [disease]")
Pre-annotation filter: Before annotating, filter out variants overlapping ENCODE Blacklist regions (Amemiya et al. 2019, Scientific Reports, 1,372 citations). Blacklisted regions produce artifactual signal in functional genomics assays. Variants in these regions cannot be reliably annotated.
hg38-blacklist.v2.bed.gz from Boyle-Lab/Blacklistmm10-blacklist.v2.bed.gzFor each variant, determine overlap with functional elements using a tiered approach:
Does the variant fall within any ENCODE cCRE in the relevant tissue?
Using downloaded peak files, classify each variant's chromatin context:
| Variant Context | Interpretation | Confidence |
|---|---|---|
| DNase/ATAC peak + H3K27ac + H3K4me3 | Active promoter | High |
| DNase/ATAC peak + H3K27ac + H3K4me1 (no H3K4me3) | Active enhancer | High |
| DNase/ATAC peak + H3K4me1 + H3K27me3 | Poised/bivalent enhancer | Medium |
| DNase/ATAC peak only | Open chromatin, possible regulatory | Medium |
| H3K27me3 broad domain | Polycomb-repressed | Medium |
| TF ChIP-seq peak | Direct TF binding site | High (if peak is narrow) |
| No overlap with any mark | Not active in tested tissue | Low (may be active elsewhere) |
If the variant overlaps a TF ChIP-seq peak:
Compare the variant's annotation across multiple tissues:
If the user has GWAS summary statistics (not just lead SNPs), recommend fine-mapping to identify credible causal variants:
Fine-mapping BEFORE functional annotation is more powerful than annotation alone. A variant with posterior inclusion probability (PIP) > 0.5 that also overlaps a tissue-specific enhancer is a much stronger candidate than either line of evidence alone.
If fine-mapping is not feasible (no summary statistics available), document this limitation explicitly.
For variant sets (not single variants), test whether variants are enriched in specific regulatory annotations:
Partitioned heritability using GWAS summary statistics:
Enrichment testing with proper LD correction:
| Scenario | Recommended Method |
|---|---|
| Full GWAS summary statistics available | S-LDSC (most powerful) |
| Lead SNPs + p-values only | GARFIELD |
| Small variant set (<50 variants) | Direct overlap (enrichment testing underpowered) |
Identifying the target gene is often harder than identifying the regulatory element. The nearest gene is frequently NOT the target — enhancers can regulate genes over distances >1 Mb.
Activity-By-Contact model uses ENCODE ATAC-seq + H3K27ac + Hi-C to predict enhancer-gene links:
20-fold enrichment of causal variants in cell-type-specific enhancers
encode_search_experiments(assay_title="Hi-C", organ="...")
After completing all annotation layers, prioritize variants using a scoring approach:
| Evidence Layer | Points | Rationale |
|---|---|---|
| In fine-mapped credible set (PIP > 0.1) | +3 | Statistical evidence of causality |
| Overlaps tissue-specific cCRE | +2 | Active regulatory element in relevant tissue |
| Overlaps tissue-specific ATAC/DNase peak | +2 | Open chromatin in relevant tissue |
| Overlaps H3K27ac peak (tissue-specific) | +2 | Active enhancer/promoter |
| Overlaps TF ChIP-seq peak | +1 | Direct protein-DNA interaction |
| Disrupts known TF motif | +2 | Mechanistic evidence |
| ABC model links to known disease gene | +3 | Enhancer-gene connection |
| eQTL for nearby gene in relevant tissue | +2 | Expression association |
| Hi-C contact with gene promoter | +1 | Physical proximity |
| RegulomeDB score 1a–1f | +2 | Integrated regulatory evidence |
| CADD score > 15 (top 3% deleterious) | +1 | Evolutionary constraint |
This scoring is a guide, not a definitive ranking. A variant with a single strong piece of evidence (e.g., disrupts a motif for a known disease TF) may be more compelling than one with many weak overlaps.
Document the full annotation:
encode_log_derived_file(
file_path="/path/to/variant_annotations.tsv",
source_accessions=["ENCSR...", "ENCSR...", ...],
description="Functional annotation of [N] GWAS variants for [disease] using ENCODE H3K27ac, ATAC-seq, TF ChIP-seq, and Hi-C in [tissue]",
file_type="variant_annotation",
tool_used="bedtools intersect + custom scoring",
parameters="LD expansion r2>0.8 EUR, GRCh38, IDR thresholded peaks, priority scoring v1"
)
Link to external references:
encode_link_reference(
experiment_accession="ENCSR...",
reference_type="doi",
reference_id="10.xxxx/original_gwas_doi",
description="Original GWAS study providing variant set"
)
For the final variant annotation, report:
Goal: Comprehensively annotate variants in ENCODE-defined regulatory elements by combining Ensembl VEP, ClinVar, gnomAD, and GWAS Catalog data into a unified variant annotation pipeline. Context: Individual annotation databases provide partial information. Combining them creates a complete picture of variant function, clinical significance, and population context.
Start with ENCODE ATAC-seq peaks to define accessible regulatory regions:
encode_search_experiments(assay_title="ATAC-seq", organ="pancreas", organism="Homo sapiens")
Expected output:
{
"total": 6,
"results": [
{"accession": "ENCSR400PAN", "assay_title": "ATAC-seq", "biosample_summary": "islet of Langerhans", "status": "released"}
]
}
encode_download_files(accessions=["ENCFF500ISL"], download_dir="/data/variant_annotation")
Using Ensembl REST API:
POST https://rest.ensembl.org/vep/human/region
{"variants": ["10 114758349 . T C"]}
Result: regulatory_region_variant in active promoter region.
GET https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=clinvar&term=rs7903146
Result: Pathogenic for type 2 diabetes mellitus.
{
variant(dataset: gnomad_r4, variantId: "10-114758349-T-C") {
genome { af populations { id af } }
}
}
Result: AF=0.30 globally, AF=0.08 in East Asian populations.
GET https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/rs7903146
Result: Associated with type 2 diabetes (p=2.4×10⁻¹²), fasting glucose, insulin resistance.
Combine all layers:
| Field | Value |
|---|---|
| Variant | rs7903146 (chr10:114758349 T>C) |
| VEP consequence | regulatory_region_variant |
| ClinVar significance | Pathogenic |
| gnomAD AF | 0.30 (global) |
| GWAS trait | Type 2 diabetes (p=2.4e-12) |
| ENCODE context | In islet ATAC-seq peak (open chromatin) |
Interpretation: This is a well-characterized T2D risk variant in an islet-specific regulatory element. The 30% global AF confirms it's a common risk variant, consistent with its role in polygenic T2D risk.
encode_search_experiments(assay_title="ATAC-seq", organ="pancreas", organism="Homo sapiens")
Expected output:
{
"total": 6,
"results": [
{"accession": "ENCSR400PAN", "assay_title": "ATAC-seq", "biosample_summary": "islet of Langerhans"}
]
}
encode_list_files(accession="ENCSR400PAN", file_format="bed", output_type="IDR thresholded peaks", assembly="GRCh38")
Expected output:
{
"files": [
{"accession": "ENCFF500ISL", "output_type": "IDR thresholded peaks", "file_format": "bed narrowPeak", "file_size_mb": 0.8}
]
}
encode_track_experiment(accession="ENCSR400PAN", notes="Islet ATAC-seq for multi-source variant annotation pipeline")
Expected output:
{
"status": "tracked",
"accession": "ENCSR400PAN",
"notes": "Islet ATAC-seq for multi-source variant annotation pipeline"
}
| This skill produces... | Feed into... | Purpose |
|---|---|---|
| Unified variant annotation table | disease-research | Connect annotated variants to disease mechanisms |
| VEP-annotated variants | ensembl-annotation | Detailed consequence prediction for regulatory variants |
| ClinVar-annotated variants | clinvar-annotation | Clinical significance assessment |
| Frequency-filtered variants | gnomad-variants | Population frequency context |
| Trait-associated variants | gwas-catalog | GWAS evidence for variant-disease connections |
| Regulatory variant BED | peak-annotation | Assign annotated variants to target genes |
| Annotated variant reports | visualization-workflow | Generate variant annotation summary visualizations |
| Variant-gene links | gtex-expression | Validate expression effects of annotated variants |
regulatory-elements — Characterizing the regulatory elements themselves (not variant-specific)multi-omics-integration — Combining ENCODE data types for deeper regulatory analysisdisease-research — Broader disease-focused workflows using ENCODEquality-assessment — Evaluating quality of ENCODE experiments used in annotationhistone-aggregation — Aggregating histone ChIP-seq peaks across samples for annotationaccessibility-aggregation — Aggregating ATAC-seq/DNase-seq peaks across sampleshic-aggregation — Aggregated Hi-C data improves enhancer-gene linkage for variant interpretationsingle-cell-encode — Cell type-resolved scATAC-seq peaks improve variant annotation in heterogeneous tissuesepigenome-profiling — Build a complete epigenomic profile of the disease tissue to contextualize variantsdata-provenance — Document the full variant annotation pipeline for reproducibilitypipeline-guide — Guidance for running fine-mapping, S-LDSC, and other computational pipelinesgnomad-variants — Population frequency and gene constraint data for variant prioritizationensembl-annotation — VEP consequence prediction, CADD/REVEL scores, Regulatory Build overlapucsc-browser — Retrieve ENCODE tracks and sequence context from UCSC for variant regionsclinvar-annotation — Annotate variants with clinical significance, pathogenicity, and review statusgwas-catalog — Check GWAS associations for variants and retrieve trait-associated locigtex-expression — Check tissue expression of variant-associated genes across 54 GTEx tissuesjaspar-motifs — Check if variant disrupts transcription factor binding motifpublication-trust — Verify literature claims backing analytical decisions