Help us improve
Share bugs, ideas, or general feedback.
From encode-toolkit
Convert genomic coordinates between assemblies (GRCh37/hg19 to GRCh38/hg38, mm9 to mm10) using UCSC liftOver for BED files, CrossMap for VCF/bigWig. Handles unmapped regions with provenance logging.
npx claudepluginhub ammawla/encode-toolkit --plugin encode-toolkitHow this skill is triggered — by the user, by Claude, or both
Slash command
/encode-toolkit:liftover-coordinatesThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- User needs to convert genomic coordinates between assemblies (hg19↔hg38, mm9↔mm10)
Convert genomic coordinates between assemblies (GRCh37/hg19 to GRCh38/hg38, mm9 to mm10) using UCSC liftOver for BED files, CrossMap for VCF/bigWig. Handles unmapped regions with provenance logging.
Queries UCSC Genome Browser REST API for DNA sequences by region, annotation tracks, gene models, chromosome sizes, and conservation scores across 100+ genome assemblies.
Mandates invoking relevant skills via tools before any response in coding sessions. Covers access, priorities, and adaptations for Claude Code, Copilot CLI, Gemini CLI.
Share bugs, ideas, or general feedback.
Guide coordinate liftover between genome assemblies using UCSC liftOver, CrossMap, Ensembl REST API, and rtracklayer. Assembly conversion is one of the most common pitfalls in genomics — this skill provides the definitive workflow for safe, reproducible liftover with full provenance tracking.
The question: "How do I safely convert my genomic coordinates from one assembly to another without losing data or introducing errors?"
Assembly conversion is referenced as a critical step in 10+ other ENCODE Toolkit skills because ENCODE spans multiple data releases: some experiments were processed against hg19/GRCh37, while most current data uses GRCh38/hg38. Combining data across assemblies without proper liftover is one of the most common and most dangerous errors in computational genomics — coordinates that look valid in both assemblies may refer to completely different genomic locations.
Genome assemblies are updated to fix errors, fill gaps, add alternative haplotypes, and improve centromeric/telomeric sequence. Between hg19 and hg38, approximately 1,000 sequence gaps were closed, 8% of the genome was modified, and several regions were rearranged. A coordinate like chr17:41,197,694 in hg19 (BRCA1) maps to chr17:43,044,295 in GRCh38 — a shift of nearly 2 Mb. Using the wrong assembly silently produces incorrect results.
Common scenarios requiring coordinate conversion:
| Common Name | UCSC Name | NCBI/GRC Name | Species | Release Year |
|---|---|---|---|---|
| hg19 | hg19 | GRCh37 | Human | 2009 |
| hg38 | hg38 | GRCh38 | Human | 2013 |
| mm9 | mm9 | MGSCv37 | Mouse | 2007 |
| mm10 | mm10 | GRCm38 | Mouse | 2012 |
| mm39 | mm39 | GRCm39 | Mouse | 2020 |
The same assembly has different names depending on the source:
hg19, hg38, mm10 — used in filenames, chromosome prefixes (chr1)GRCh37, GRCh38, GRCm38 — used in publications, Ensemblchr prefix (1 instead of chr1)Always verify which naming convention your data uses. Mixing chr1 (UCSC) with 1 (Ensembl) causes silent failures in bedtools intersection and peak overlap analysis.
Chain files encode the alignment between assemblies and are the essential input for liftover.
https://hgdownload.soe.ucsc.edu/goldenPath/{from}/liftOver/{from}To{To}.over.chain.gz
Common chain files:
| Conversion | Chain File | URL |
|---|---|---|
| hg19 to hg38 | hg19ToHg38.over.chain.gz | https://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz |
| hg38 to hg19 | hg38ToHg19.over.chain.gz | https://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz |
| mm9 to mm10 | mm9ToMm10.over.chain.gz | https://hgdownload.soe.ucsc.edu/goldenPath/mm9/liftOver/mm9ToMm10.over.chain.gz |
| mm10 to mm39 | mm10ToMm39.over.chain.gz | https://hgdownload.soe.ucsc.edu/goldenPath/mm10/liftOver/mm10ToMm39.over.chain.gz |
| mm10 to hg38 | mm10ToHg38.over.chain.gz | https://hgdownload.soe.ucsc.edu/goldenPath/mm10/liftOver/mm10ToHg38.over.chain.gz |
ftp://ftp.ensembl.org/pub/assembly_mapping/
Ensembl provides chain files for their coordinate system (without chr prefix). Useful when working with Ensembl VEP output or Ensembl gene annotations.
NCBI Genome Remapping Service: https://www.ncbi.nlm.nih.gov/genome/tools/remap
Always verify chain file integrity after download:
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz
md5sum hg19ToHg38.over.chain.gz
# Verify against UCSC md5sum.txt in the same directory
gunzip -t hg19ToHg38.over.chain.gz # Test archive integrity
The standard tool for BED-format coordinate conversion.
liftOver input.bed hg19ToHg38.over.chain.gz output.bed unmapped.bed
| Parameter | Default | Description |
|---|---|---|
-minMatch | 0.95 | Minimum ratio of bases that must remap (0.0–1.0) |
-minBlocks | 1 | Minimum number of alignment blocks |
-fudgeThick | off | If thickStart/thickEnd not mapped, use mapped region |
-multiple | off | Allow mapping to multiple output regions |
-minChainT | 0 | Minimum chain target coverage |
-minChainQ | 0 | Minimum chain query coverage |
| Data Type | -minMatch | Notes |
|---|---|---|
| SNP positions (1bp) | 0.95 (default) | Point coordinates almost always map cleanly |
| Narrow peaks (100–500bp) | 0.95 (default) | Short regions map well |
| Broad peaks (1–50kb) | 0.50–0.80 | Large regions may partially overlap rearrangements |
| Regulatory elements | 0.90 | Balance between completeness and accuracy |
| TAD boundaries (5–50kb) | 0.50 | Large-scale organization is approximate anyway |
UCSC liftOver expects standard BED (3–12 columns). For narrowPeak files (BED6+4):
# Step 1: Extract BED6 columns + preserve extra columns as name
awk 'BEGIN{OFS="\t"} {print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10}' input.narrowPeak > input_full.bed
# Step 2: Liftover (liftOver handles extra columns)
liftOver input_full.bed hg19ToHg38.over.chain.gz output.bed unmapped.bed
# Step 3: Verify column count is preserved
awk '{print NF}' output.bed | sort -u
Peak summit recalculation: After liftover, the summit position (column 10 in narrowPeak = offset from start) may no longer accurately represent the signal maximum. For critical analyses, re-calculate summits from signal data in the new assembly rather than relying on lifted summit positions.
# Count unmapped regions
wc -l unmapped.bed # Note: comment lines start with #
# Calculate loss rate
total=$(wc -l < input.bed)
unmapped=$(grep -v '^#' unmapped.bed | wc -l)
loss_pct=$(echo "scale=2; $unmapped * 100 / $total" | bc)
echo "Lost $unmapped of $total regions ($loss_pct%)"
# Investigate reasons for unmapping
grep '^#' unmapped.bed | sort | uniq -c | sort -rn
# Common reasons:
# "Partially deleted in new" — region spans a deletion
# "Deleted in new" — region fully removed
# "Split in new" — region maps to multiple locations
CrossMap (Zhao et al. 2014) handles file formats that UCSC liftOver cannot process natively.
CrossMap vcf hg19ToHg38.over.chain.gz input.vcf hg38.fa output.vcf
Critical VCF considerations:
bcftools norm# Post-liftover VCF validation
bcftools norm -f hg38.fa -c ws output.vcf -o output.normalized.vcf 2> norm_warnings.log
# -c ws: warn about and set incorrect REF alleles
CrossMap bigwig hg19ToHg38.over.chain.gz input.bw output.bw
Signal track caveats:
CrossMap bam hg19ToHg38.over.chain.gz input.bam output.bam
BAM liftover is generally NOT recommended:
CrossMap gff hg19ToHg38.over.chain.gz input.gff output.gff
Useful for lifting gene annotations, but prefer downloading the native annotation for the target assembly from GENCODE or Ensembl.
For programmatic conversion of individual coordinates without installing local tools.
GET https://rest.ensembl.org/map/human/GRCh37/{region}/GRCh38?content-type=application/json
import requests
def liftover_ensembl(chrom, start, end, source="GRCh37", target="GRCh38", species="human"):
"""Convert coordinates using Ensembl REST API."""
region = f"{chrom}:{start}..{end}:1"
url = f"https://rest.ensembl.org/map/{species}/{source}/{region}/{target}"
headers = {"Content-Type": "application/json"}
response = requests.get(url, headers=headers)
if response.status_code == 200:
mappings = response.json()["mappings"]
return mappings
return None
# Example: BRCA1 region
mappings = liftover_ensembl("17", 41197694, 41276113)
for m in mappings:
mapped = m["mapped"]
print(f" {mapped['seq_region_name']}:{mapped['start']}-{mapped['end']}")
Ensembl uses chromosomes WITHOUT chr prefix:
17:41197694-41276113chr17:41197694-41276113Convert between conventions:
# Add 'chr' prefix (Ensembl to UCSC)
sed 's/^/chr/' input.bed > input_ucsc.bed
# Remove 'chr' prefix (UCSC to Ensembl)
sed 's/^chr//' input.bed > input_ensembl.bed
For R-based workflows, rtracklayer provides native liftover support.
library(rtracklayer)
library(GenomicRanges)
# Import chain file
chain <- import.chain("hg19ToHg38.over.chain")
# Create GRanges object from your coordinates
gr <- GRanges(
seqnames = c("chr17", "chr7", "chr1"),
ranges = IRanges(
start = c(41197694, 55086725, 11873),
end = c(41276113, 55275031, 14409)
),
name = c("BRCA1", "EGFR", "DDX11L1")
)
# Perform liftover
lifted <- liftOver(gr, chain)
# liftOver returns a GRangesList (1:many mapping possible)
# Convert to GRanges (keeping only 1:1 mappings)
lifted_1to1 <- unlist(lifted[elementNROWS(lifted) == 1])
# Check for unmapped
n_unmapped <- sum(elementNROWS(lifted) == 0)
n_multimapped <- sum(elementNROWS(lifted) > 1)
cat(sprintf("Mapped: %d, Unmapped: %d, Multi-mapped: %d\n",
length(lifted_1to1), n_unmapped, n_multimapped))
| Package | Purpose |
|---|---|
rtracklayer | Core liftover functionality |
liftOver (AnnotationHub) | Pre-packaged chain files |
GenomicRanges | GRanges manipulation pre/post liftover |
VariantAnnotation | VCF-aware liftover |
| Conversion | Typical Loss | High-Loss Regions | Notes |
|---|---|---|---|
| hg19 to hg38 | 1–3% | Centromeric, telomeric, segmental duplications | Most reliable conversion |
| hg38 to hg19 | 2–5% | New alt haplotypes, gap-filled regions in hg38 | Higher loss due to new hg38 sequences |
| mm9 to mm10 | 3–5% | Significant rearrangements on multiple chromosomes | Document chromosome-level changes |
| mm10 to mm39 | 1–2% | Minor scaffold updates | Relatively clean conversion |
| mm10 to hg38 | N/A | Cross-species: use synteny, not liftover | Requires different approach (e.g., UCSC synteny maps) |
-multiple flag to detect split mappings.--refgenome but UCSC liftOver does not — always validate.Log every liftover operation with encode_log_derived_file for full reproducibility:
encode_log_derived_file(
file_path="/path/to/lifted_peaks_hg38.bed",
source_accessions=["ENCSR...", "ENCFF..."],
description="Lifted [N] narrowPeak regions from hg19 to GRCh38. [X] unmapped ([Y]% loss). Original: [source description]",
file_type="lifted_coordinates",
tool_used="UCSC liftOver v377",
parameters="minMatch=0.95, chain=hg19ToHg38.over.chain.gz (MD5: abc123...), unmapped=[X]/[N] ([Y]%)"
)
Every liftover log entry should include:
-minMatch or equivalent parameter1. Identify assemblies: Check input assembly → check target assembly
2. Get chain file: Download from UCSC → verify MD5
3. Select tool: BED → liftOver | VCF → CrossMap | single → Ensembl API | R → rtracklayer
4. Convert: Run liftover with appropriate parameters
5. Check loss: Count unmapped, flag if >5%
6. Validate: Verify output assembly, check chromosome names
7. Post-process: Re-center peaks, normalize VCF, re-annotate genes
8. Log provenance: Record all parameters, tools, loss rates
Goal: Convert ENCODE BED peak files from hg19 to GRCh38 (or vice versa) for cross-study integration when experiments use different genome builds. Context: Older ENCODE experiments may be aligned to hg19, while newer ones use GRCh38. LiftOver enables coordinate conversion for combined analysis.
encode_search_experiments(assay_title="Histone ChIP-seq", organ="liver", target="H3K27ac", organism="Homo sapiens")
Expected output:
{
"total": 8,
"results": [
{"accession": "ENCSR100OLD", "assay_title": "Histone ChIP-seq", "assembly": "hg19"},
{"accession": "ENCSR200NEW", "assay_title": "Histone ChIP-seq", "assembly": "GRCh38"}
]
}
Interpretation: ENCSR100OLD uses hg19 — needs liftover before merging with ENCSR200NEW (GRCh38).
encode_list_files(accession="ENCSR100OLD", file_format="bed", assembly="hg19")
liftOver ENCSR100OLD_peaks.bed hg19ToHg38.over.chain.gz peaks_GRCh38.bed unmapped.bed
Count converted vs. unmapped:
encode_log_derived_file(
source_accessions=["ENCFF100OLD"],
derived_file="/data/peaks_GRCh38.bed",
description="Lifted from hg19 to GRCh38 using UCSC liftOver",
tool="liftOver (UCSC, chain: hg19ToHg38.over.chain.gz)"
)
encode_get_file_info(accession="ENCFF100OLD")
Expected output:
{
"accession": "ENCFF100OLD",
"assembly": "hg19",
"file_format": "bed narrowPeak"
}
encode_list_files(accession="ENCSR100OLD", file_format="bed", assembly="GRCh38")
Expected output:
{
"files": []
}
Interpretation: No GRCh38 files available — liftover is required.
encode_log_derived_file(
source_accessions=["ENCFF100OLD"],
derived_file="/data/peaks_GRCh38.bed",
description="hg19→GRCh38 liftOver",
tool="UCSC liftOver"
)
Expected output:
{"status": "logged", "derived_file": "/data/peaks_GRCh38.bed", "source_count": 1}
variant-annotation — Variants often need liftover before annotation with ENCODE data (hg19 GWAS variants to GRCh38)gwas-catalog — GWAS Catalog coordinates may be in GRCh37; liftover needed for ENCODE GRCh38 integrationensembl-annotation — Ensembl REST API provides coordinate mapping; Ensembl uses non-chr chromosome namingucsc-browser — UCSC provides chain files and the liftOver tool; retrieve assembly-specific tracksgnomad-variants — gnomAD v4 uses GRCh38; v2 uses GRCh37; liftover needed for cross-version analysishistone-aggregation — Aggregating peaks across samples requires all peaks in the same assemblyaccessibility-aggregation — ATAC-seq/DNase-seq peak union requires assembly-consistent coordinatesdata-provenance — Every liftover operation must be logged with chain file, tool version, and loss ratepublication-trust — Verify literature claims backing analytical decisionsWhen reporting liftover results, always present:
Example output summary:
Liftover: hg19 -> GRCh38
Input: 45,231 narrowPeak regions
Mapped: 44,012 (97.3%)
Unmapped: 1,219 (2.7%) — 847 partially deleted, 312 split, 60 fully deleted
Chain: hg19ToHg38.over.chain.gz (UCSC, MD5: 7a42e...)
Tool: UCSC liftOver v377, minMatch=0.95