Help us improve
Share bugs, ideas, or general feedback.
From encode-toolkit
Aggregates WGBS data across ENCODE experiments, donors, and labs to build tissue-level DNA methylation maps. Identifies HMRs, PMDs; filters coverage; handles cross-lab variation.
npx claudepluginhub ammawla/encode-toolkit --plugin encode-toolkitHow this skill is triggered — by the user, by Claude, or both
Slash command
/encode-toolkit:methylation-aggregationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- User wants to build a tissue-level DNA methylation landscape from multiple WGBS experiments
Aggregates WGBS data across ENCODE experiments, donors, and labs to build tissue-level DNA methylation maps. Identifies HMRs, PMDs; filters coverage; handles cross-lab variation.
Analyzes genomics and epigenomics data: DNA methylation (CpG, bisulfite, RRBS), m6A RNA modification (MeRIP-seq), ChIP-seq peaks, ATAC-seq, histone modifications, chromatin state, and multi-omics integration using pandas/scipy/pysam computation.
Queries the ENCODE Portal REST API to retrieve regulatory genomics data: TF ChIP-seq, ATAC-seq, histone marks, RNA-seq metadata, BED/bigWig files, and SCREEN cCREs. Use for variant annotation, open chromatin analysis, and peak file download.
Share bugs, ideas, or general feedback.
Build a comprehensive methylation landscape for a tissue/cell type by merging WGBS bedMethyl files from multiple ENCODE experiments.
The question: "What is the DNA methylation state across the genome in my tissue?"
DNA methylation is fundamentally different from histone marks and accessibility:
| Property | Histone/Accessibility | DNA Methylation |
|---|---|---|
| Signal type | Binary (bound/open or not) | Continuous (0-100% methylated) |
| Default state | Unmarked | ~70-80% methylated (CpG context) |
| Biology of interest | Where marks ARE present | Where methylation is ABSENT or REDUCED |
| Aggregation approach | Union of peak calls | Average/median of methylation levels per CpG |
The key insight: Unlike histone ChIP-seq where we want the union of all peaks, for methylation we want the average methylation level per CpG site across individuals. Methylation is a quantitative, continuous signal measured at every CpG dinucleotide.
However, for identifying regulatory regions, we focus on hypomethylated regions (HMRs) — stretches of low methylation that mark active regulatory elements. HMRs can be treated more like peaks for union-style aggregation.
encode_search_experiments(
assay_title="WGBS",
organ="pancreas", # user's tissue of interest
biosample_type="tissue",
limit=100
)
Present a summary to the user:
Use encode_get_facets to check availability:
encode_get_facets(assay_title="WGBS", organ="pancreas")
Note: WGBS is expensive to generate. Typical tissues have 2-5 experiments. Even 2 biological replicates are valuable for identifying consistent methylation patterns.
encode_get_experiment(accession="ENCSR...")
methylation state at CpG output files (bedMethyl format)Track all included experiments:
encode_track_experiment(accession="ENCSR...")
For each experiment:
encode_list_files(
experiment_accession="ENCSR...",
output_type="methylation state at CpG",
assembly="GRCh38"
)
bedMethyl format (ENCODE standard):
chr start end name score strand thickStart thickEnd color coverage percentMethylated
Prefer preferred_default=True files:
encode_download_files(
file_accessions=["ENCFF...", ...],
download_dir="/path/to/data/methylation",
organize_by="flat"
)
Low-coverage CpGs have unreliable methylation estimates. Filter per-sample:
# Keep only CpGs with >= 5x coverage (column 10)
# More stringent: >= 10x for quantitative analysis
awk '$10 >= 5' sample.bedMethyl > sample.covfiltered.bedMethyl
Coverage thresholds by use case:
| Threshold | Use Case | Typical CpGs Retained |
|---|---|---|
| >=3x | Exploratory / maximum retention | ~90% of CpGs |
| >=5x | Standard analysis | ~80% of CpGs |
| >=10x | High-confidence quantitative | ~60% of CpGs |
# Download from: https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz
bedtools intersect -a sample.covfiltered.bedMethyl -b hg38-blacklist.v2.bed -v > sample.filtered.bedMethyl
CpG methylation is typically symmetric (same on both strands). Merge strand-specific calls to increase per-CpG coverage. Caveat: Some ENCODE bedMethyl files may already be strand-merged — check if both strands are present before applying this step:
# Group CpGs by position (forward and reverse strand of same CpG)
# Sum coverage, calculate weighted average methylation
awk 'BEGIN{OFS="\t"} {
# CpG position (use the C position as canonical)
if ($6 == "+") pos = $2
else pos = $2 - 1
key = $1"\t"pos
cov[key] += $10
meth[key] += ($11/100) * $10
} END {
for (k in cov) {
split(k, a, "\t")
avg_meth = (meth[k] / cov[k]) * 100
print a[1], a[2], a[2]+2, "CpG", 0, ".", a[2], a[2]+2, "0,0,0", cov[k], avg_meth
}
}' sample.filtered.bedMethyl | sort -k1,1 -k2,2n > sample.merged_strands.bedMethyl
# Step 1: Find CpGs covered in at least M of N samples
# Extract positions from each sample
for f in sample*.merged_strands.bedMethyl; do
awk 'BEGIN{OFS="\t"} {print $1, $2, $3}' "$f"
done | sort -k1,1 -k2,2n | uniq -c | \
awk -v M=2 '$1 >= M {print $2, $3, $4}' OFS="\t" > shared_cpgs.bed
# Step 2: For each sample, extract methylation at shared CpGs
for f in sample*.merged_strands.bedMethyl; do
bedtools intersect -a shared_cpgs.bed -b "$f" -wa -wb | \
awk 'BEGIN{OFS="\t"} {print $1, $2, $3, $NF, $(NF-1)}' > "${f%.bedMethyl}.shared.txt"
# Columns: chr, start, end, percentMeth, coverage
done
Weighted average (recommended — accounts for coverage differences):
# Combine all samples, calculate coverage-weighted mean methylation per CpG
cat sample*.shared.txt | \
sort -k1,1 -k2,2n | \
awk 'BEGIN{OFS="\t"} {
key = $1"\t"$2"\t"$3
if (key != prev_key && NR > 1) {
avg = total_weighted_meth / total_cov
print prev_key, n_samples, total_cov, avg
n_samples = 0; total_cov = 0; total_weighted_meth = 0
}
prev_key = key
n_samples++
total_cov += $5
total_weighted_meth += ($4/100) * $5
} END {
avg = total_weighted_meth / total_cov
print prev_key, n_samples, total_cov, avg
}' > tissue_methylation_profile.bed
# Columns: chr, start, end, n_samples, total_coverage, mean_methylation_fraction
Simple average (alternative — equal weight per sample):
# Unweighted mean across samples
awk 'BEGIN{OFS="\t"} {
key = $1"\t"$2"\t"$3
meth[key] += $4
n[key]++
} END {
for (k in meth) {
print k, n[k], meth[k]/n[k]
}
}' <(cat sample*.shared.txt) | sort -k1,1 -k2,2n > tissue_methylation_simple.bed
Track inter-individual variation to identify tissue-variable CpGs:
# Add standard deviation column
# (compute in R or Python for large datasets)
HMRs mark active regulatory elements. Identify runs of low methylation. The 30% threshold is a commonly used cutoff (Schultz et al. 2015 used similar ranges), but the optimal threshold depends on your tissue and question — some studies use 20%, others 40%. Consider visualizing the methylation distribution first to identify a natural breakpoint:
# Find CpGs with average methylation < 30% (adjust threshold as needed)
awk '$6 < 0.30' tissue_methylation_profile.bed > hypo_cpgs.bed
# Merge adjacent hypomethylated CpGs into regions
# Require minimum 3 CpGs within 1kb of each other
bedtools merge -i hypo_cpgs.bed -d 1000 -c 1 -o count | \
awk '$4 >= 3' > tissue_HMRs.bed
# Columns: chr, start, end, n_hypomethylated_CpGs
Very low methylation (<10%) at CpG-dense regions:
awk '$6 < 0.10' tissue_methylation_profile.bed > unmeth_cpgs.bed
bedtools merge -i unmeth_cpgs.bed -d 500 -c 1 -o count | \
awk '$4 >= 5' > tissue_UMRs.bed
Large (>10kb) regions of intermediate methylation, often marking repressed regions:
# Find CpGs with methylation 30-70% (partially methylated)
awk '$6 >= 0.30 && $6 <= 0.70' tissue_methylation_profile.bed > partial_cpgs.bed
# Merge with large gap tolerance to find domains
bedtools merge -i partial_cpgs.bed -d 5000 -c 1 -o count | \
awk '$4 >= 20 && ($3-$2) >= 10000' > tissue_PMDs.bed
If comparing to another tissue, use DMRcate or similar:
# In R with DMRcate
library(DMRcate)
# Requires a methylation matrix (CpGs x samples with tissue labels)
# Identifies regions where methylation differs between tissues
For HMRs/UMRs (region-level features), annotate by sample support:
| Confidence | Criteria | Interpretation |
|---|---|---|
| High | Low methylation in >=50% of samples | Constitutive regulatory region |
| Supported | Low methylation in 2+ samples | Likely regulatory, some variation |
| Variable | High variance across samples | Cell-type heterogeneity or individual variation |
# Annotate HMRs with sample support
# Intersect each HMR with per-sample hypomethylated CpGs to count support
awk -v N=4 '{
# Using n_samples from the aggregation
if ($4 >= N*0.5) conf="HIGH";
else if ($4 >= 2) conf="SUPPORTED";
else conf="VARIABLE";
print $0"\t"conf"\t"$4"/"N
}' tissue_HMRs.bed > tissue_HMRs.annotated.bed
For the per-CpG profile, annotate by coverage confidence:
awk -v N=4 '{
if ($4 >= N) conf="ALL_SAMPLES";
else if ($4 >= N*0.5) conf="MAJORITY";
else conf="PARTIAL";
print $0"\t"conf
}' tissue_methylation_profile.bed > tissue_methylation.annotated.bed
Report to the user:
Methylation data is most powerful when integrated:
# Example: Find HMRs that overlap H3K27ac peaks (active enhancers)
bedtools intersect -a tissue_HMRs.bed -b union_H3K27ac_peaks.bed -wa -u > active_enhancer_HMRs.bed
encode_log_derived_file(
file_path="/path/to/tissue_methylation.annotated.bed",
source_accessions=["ENCSR...", "ENCSR...", ...],
description="Aggregated per-CpG methylation profile across N pancreas WGBS experiments",
file_type="aggregated_methylation",
tool_used="bedtools + custom aggregation",
parameters="coverage >= 5x per sample, shared CpGs in >= 2 samples, coverage-weighted mean, strand-merged"
)
encode_log_derived_file(
file_path="/path/to/tissue_HMRs.annotated.bed",
source_accessions=["ENCSR...", "ENCSR...", ...],
description="Hypomethylated regions from aggregated pancreas methylation profile",
file_type="aggregated_HMRs",
tool_used="bedtools merge",
parameters="mean methylation < 30%, >= 3 CpGs within 1kb, confidence annotated"
)
Bisulfite conversion rate is critical: Even 1% incomplete conversion creates false methylation at unmethylated CpGs. Always verify >=98% conversion. ENCODE reports this in QC metrics.
Coverage drives accuracy: A CpG with 3x coverage has wide confidence intervals (0-100% could easily be 0% or 30%). At 10x, estimates stabilize. At 30x, they are reliable. Always filter by coverage.
Non-CpG methylation: Present in some cell types (especially embryonic). ENCODE bedMethyl files typically report CpG context only. If non-CpG methylation is relevant, check experiment metadata.
Strand asymmetry: While CpG methylation is typically symmetric, it can be asymmetric at some sites. Strand merging loses this information. For most analyses, merging is appropriate.
Cell-type heterogeneity: Bulk WGBS from tissue captures methylation across ALL cell types. A CpG at 50% methylation could mean: (a) all cells are 50% methylated, or (b) half the cells are 0% and half are 100%. These are biologically different. Single-cell methylation data (if available) resolves this.
CpG islands vs. open sea: CpG-dense regions (islands) have very different methylation dynamics than CpG-sparse regions. Consider analyzing separately.
Do NOT mix assemblies: All files must be GRCh38 or all hg19. CpG positions are exact — even a 1bp offset from liftOver misaligns CpGs.
X chromosome: Males have one X (hemimethylation), females have two (one inactivated with different methylation). Handle sex chromosomes separately or filter them.
Imprinted regions: Some regions show ~50% methylation in all individuals due to genomic imprinting (one allele methylated, one not). These are normal, not noise.
Do NOT use union logic for per-CpG methylation: Unlike histone peaks where union is correct, methylation levels should be AVERAGED. Union logic only applies to the derived HMR/UMR/PMD regions.
RRBS is NOT the same as WGBS: Reduced Representation Bisulfite Sequencing (RRBS) covers only CpG-rich regions (~10% of CpGs). Do NOT mix RRBS and WGBS in per-CpG averaging — the CpG universe is different.
Sequencing platform matters: Liu et al. 2024 showed that NovaSeq and DNBSEQ-T7 give slightly different methylation estimates, especially in GC-rich regions. Note the platform in provenance if mixing experiments from different sequencers.
Goal: Aggregate whole-genome bisulfite sequencing (WGBS) data across tissues to identify tissue-invariant vs. tissue-specific methylation patterns at imprinted gene loci. Context: Imprinted genes show parent-of-origin-specific methylation. Comparing across tissues reveals which imprinting control regions (ICRs) maintain methylation universally.
encode_search_experiments(assay_title="WGBS", organism="Homo sapiens", limit=50)
Expected output:
{
"total": 147,
"results": [
{"accession": "ENCSR765JPC", "assay_title": "WGBS", "biosample_summary": "liver", "status": "released"},
{"accession": "ENCSR832HMR", "assay_title": "WGBS", "biosample_summary": "brain", "status": "released"},
{"accession": "ENCSR091ENJ", "assay_title": "WGBS", "biosample_summary": "lung", "status": "released"}
]
}
Interpretation: 147 WGBS experiments available. Select tissues with ≥2 replicates for reliable per-CpG averaging.
encode_list_files(accession="ENCSR765JPC", file_format="bed", output_type="methylation state at CpG", assembly="GRCh38")
Expected output:
{
"files": [
{"accession": "ENCFF123BED", "output_type": "methylation state at CpG", "file_format": "bed bedMethyl", "file_size_mb": 245.0}
]
}
encode_download_files(accessions=["ENCFF123BED", "ENCFF456MET", "ENCFF789CPG"], download_dir="/data/wgbs")
For each tissue:
# H19/IGF2 ICR: chr11:2,016,000-2,022,000
bedtools intersect -a merged_methylation.bed -b imprinted_icrs.bed -wa -wb | \
awk '{sum+=$4; n++} END {print sum/n}'
Interpretation: ICRs showing ~50% methylation across ALL tissues confirm maintained imprinting. Tissue-variable ICRs (range >20%) suggest tissue-specific imprinting loss.
encode_get_facets(assay_title="WGBS", facet_field="organ", organism="Homo sapiens")
Expected output:
{
"facets": {
"organ": {"brain": 32, "liver": 18, "heart": 12, "lung": 10, "blood": 8, "kidney": 6}
}
}
encode_get_experiment(accession="ENCSR765JPC")
Expected output:
{
"accession": "ENCSR765JPC",
"assay_title": "WGBS",
"biosample_summary": "liver",
"replicates": 2,
"status": "released",
"audit": {"WARNING": 0, "ERROR": 0},
"pipeline": "WGBS (GRCh38)"
}
encode_track_experiment(accession="ENCSR765JPC", notes="Liver WGBS for cross-tissue methylation atlas")
Expected output:
{
"status": "tracked",
"accession": "ENCSR765JPC",
"notes": "Liver WGBS for cross-tissue methylation atlas",
"tracked_at": "2025-03-08T12:00:00Z"
}
| This skill produces... | Feed into... | Purpose |
|---|---|---|
| Per-CpG β-value matrix | regulatory-elements | Identify methylation at cCREs and enhancers |
| Differentially methylated regions (DMRs) | peak-annotation | Assign DMRs to nearest genes |
| Tissue-specific hypomethylated regions | histone-aggregation | Correlate with H3K4me3 active promoter marks |
| Methylation at CpG islands | variant-annotation | Find variants disrupting CpG sites |
| HMR/UMR/PMD boundaries | accessibility-aggregation | Overlay open chromatin at unmethylated regions |
| Cross-tissue methylation atlas | visualization-workflow | Generate methylation heatmaps across tissues |
| CpG methylation at GWAS loci | gwas-catalog | Annotate trait-associated variants with methylation context |