From sciagent-skills
Performs de novo and known TF motif enrichment plus peak annotation in ChIP-seq/ATAC-seq peaks using HOMER's findMotifsGenome.pl and annotatePeaks.pl. Identifies enriched TFs and nearest genes after MACS3.
npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skillsThis skill uses the workspace's default tool permissions.
HOMER (Hypergeometric Optimization of Motif EnRichment) is a suite of Perl/C++ tools for analyzing genomic regulatory elements. Its two primary commands are `findMotifsGenome.pl`, which performs de novo motif discovery and known motif enrichment against JASPAR/HOMER databases, and `annotatePeaks.pl`, which maps each peak to the nearest gene, distance to TSS, and genomic feature class (promoter,...
Calls peaks in ChIP-seq/ATAC-seq BAM files using MACS3 callpeak. Outputs BED narrowPeak/broadPeak for TF binding, open chromatin, or broad histone marks vs input control.
Processes NGS sequencing data with deepTools: BAM to bigWig conversion, QC (correlation, PCA, fingerprints), heatmaps/profiles for ChIP-seq, RNA-seq, ATAC-seq visualization.
Processes NGS data using deepTools: BAM to bigWig conversion, QC (correlation, PCA, fingerprints), heatmaps/profiles around TSS/peaks for ChIP-seq, RNA-seq, ATAC-seq.
Share bugs, ideas, or general feedback.
HOMER (Hypergeometric Optimization of Motif EnRichment) is a suite of Perl/C++ tools for analyzing genomic regulatory elements. Its two primary commands are findMotifsGenome.pl, which performs de novo motif discovery and known motif enrichment against JASPAR/HOMER databases, and annotatePeaks.pl, which maps each peak to the nearest gene, distance to TSS, and genomic feature class (promoter, intron, intergenic, repeat). HOMER takes BED-format peak files from MACS3 or similar peak callers and a reference genome assembly as input, and outputs HTML/text reports ranking enriched motifs by p-value and fold enrichment over a matched background.
macs3-peak-calling first to generate the peak BED files that serve as input to HOMERjaspar-database to cross-reference HOMER-discovered motifs with JASPAR IDs and additional TF metadataMEME-CHIP (web or local) when you need a more probabilistic ZOOPS/TCM model or the MEME Suite ecosystemAME (part of MEME Suite) as a faster alternative for known motif scanning without de novo discoveryinstallGenome.pl after HOMER installpandas, matplotlib, seabornCheck before installing: The tool may already be available in the current environment (e.g., inside a
pixi/condaenv). Runcommand -v findMotifsGenome.plfirst and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool viapixi run findMotifsGenome.plrather than barefindMotifsGenome.pl.
# Install HOMER via conda (recommended — handles Perl dependencies)
conda install -c bioconda homer
# Verify installation
findMotifsGenome.pl 2>&1 | head -3
# Usage: findMotifsGenome.pl <peak/BED file> <genome> <output directory> [options]
annotatePeaks.pl 2>&1 | head -3
# Usage: annotatePeaks.pl <peak/BED file> <genome> [options]
# Install reference genomes (downloads 2-way masker + sequence; ~3–10 GB each)
installGenome.pl hg38
installGenome.pl mm10
# Install Python parsing dependencies
pip install pandas matplotlib seaborn
# Run de novo + known motif enrichment on TF ChIP-seq peaks (hg38, 200 bp window)
findMotifsGenome.pl peaks/tf_chip_summits.bed hg38 motif_output/ \
-size 200 -mask -p 4
# Annotate peaks with nearest genes and genomic features
annotatePeaks.pl peaks/tf_chip_peaks.narrowPeak hg38 > annotated_peaks.txt
echo "Top known motif:"
head -2 motif_output/knownResults.txt | tail -1 | cut -f1-4
echo "Annotated peaks: $(wc -l < annotated_peaks.txt) lines"
Install HOMER and download the reference genome sequence required for motif analysis.
# Activate conda environment (or use existing env)
conda create -n homer_env -c bioconda homer python=3.10 -y
conda activate homer_env
# List available genomes
installGenome.pl list
# Install human (hg38) and mouse (mm10) genomes
# Downloads masked genome sequence and annotation files
installGenome.pl hg38
# Output: Installing hg38... Done. (3-5 min, ~3 GB)
installGenome.pl mm10
# Output: Installing mm10... Done. (3-5 min, ~2.5 GB)
# Verify genome is installed
ls ~/.homer/data/genomes/hg38/
# genome.fa chrom.sizes ...
# Check HOMER motif database
ls ~/.homer/data/knownTFs/
# vertebrates.motifs jaspar.motifs ...
Prepare a summit-centered BED file from MACS3 output for optimal motif resolution.
# Option A: Use MACS3 summit file directly (already 1 bp summit positions)
# Expand summits to ±100 bp (200 bp total) centered on summit
awk 'BEGIN{OFS="\t"} {
start = ($2 - 100 < 0) ? 0 : $2 - 100;
print $1, start, $2 + 100, $4, $5
}' peaks/tf_chip_summits.bed > peaks/tf_chip_200bp.bed
echo "Summit-centered peaks: $(wc -l < peaks/tf_chip_200bp.bed)"
# Summit-centered peaks: 12453
# Option B: Use narrowPeak file directly (HOMER accepts multi-column BED)
# HOMER uses columns 1-3 (chr, start, end) and centers internally with -size
cp peaks/tf_chip_peaks.narrowPeak peaks/input_peaks.bed
# Option C: Prepare a custom background region file (matched GC content)
# HOMER auto-generates background if not provided, but explicit background
# is recommended when comparing two peak sets
# Use the control peak set or random genomic regions as background:
bedtools shuffle -i peaks/tf_chip_peaks.narrowPeak \
-g ~/.homer/data/genomes/hg38/chrom.sizes \
-excl peaks/tf_chip_peaks.narrowPeak > peaks/background_regions.bed
echo "Background regions: $(wc -l < peaks/background_regions.bed)"
# Background regions: 12453
Run findMotifsGenome.pl for de novo motif discovery and known motif enrichment simultaneously.
mkdir -p motif_output/
# Full run: de novo + known motif enrichment
# -size 200: use 200 bp window centered on peak midpoint
# -mask: mask repetitive elements (recommended for clean motifs)
# -p 4: use 4 CPU threads
# -S 25: find top 25 de novo motifs (default)
findMotifsGenome.pl peaks/tf_chip_200bp.bed hg38 motif_output/ \
-size 200 \
-mask \
-p 4 \
-S 25
# Check progress output:
# Reading genome sizes for hg38 ...
# Scanning for motifs...
# Optimizing 25 motifs...
# Done! Output in motif_output/
echo "Known results: $(wc -l < motif_output/knownResults.txt) motifs"
echo "De novo motifs: $(ls motif_output/homerResults/*.motif 2>/dev/null | wc -l) motifs"
# Known results: 392 motifs
# De novo motifs: 25 motifs
# For mouse peaks (mm10)
# findMotifsGenome.pl peaks/atac_peaks.bed mm10 motif_output_mm10/ \
# -size 200 -mask -p 4
Scan peaks for occurrences of a specific known motif or skip de novo discovery for speed.
# Skip de novo discovery (faster when you only need known motifs)
findMotifsGenome.pl peaks/tf_chip_200bp.bed hg38 motif_output_known/ \
-size 200 \
-mask \
-p 4 \
-nomotif
echo "Known motif results: $(wc -l < motif_output_known/knownResults.txt)"
# Known motif results: 392
# Find occurrences of a specific motif across peaks (outputs peak-level annotation)
# Extract the motif matrix file for the TF of interest from homerResults/
findMotifsGenome.pl peaks/tf_chip_200bp.bed hg38 motif_scan_out/ \
-size 200 \
-mask \
-find motif_output/homerResults/motif1.motif \
> peaks_with_motif1.txt
echo "Peaks containing motif1: $(wc -l < peaks_with_motif1.txt)"
# Peaks containing motif1: 8941
# Custom background: compare treated vs. control peak sets
findMotifsGenome.pl peaks/treated_peaks.bed hg38 motif_treated_vs_ctrl/ \
-size 200 \
-mask \
-p 4 \
-bg peaks/control_peaks.bed
Use annotatePeaks.pl to assign each peak to a genomic feature and nearest gene.
# Annotate peaks with nearest gene and TSS distance
# Outputs a tab-delimited file with genomic context for each peak
annotatePeaks.pl peaks/tf_chip_peaks.narrowPeak hg38 \
> annotated_peaks.txt
echo "Annotated peaks: $(($(wc -l < annotated_peaks.txt) - 1)) peaks"
# Annotated peaks: 12453 peaks
# Preview column headers and first peak
head -2 annotated_peaks.txt | cut -f1-10
# Annotate ATAC-seq peaks (same command, different input)
annotatePeaks.pl peaks/atac_sample_peaks.narrowPeak hg38 \
> annotated_atac.txt
# Generate TSS-distance histogram (for tag density plots)
# annotatePeaks.pl can compute read density around peaks with -d flag
# annotatePeaks.pl tss hg38 -size 4000 -hist 10 \
# -d chip_tagdir/ > tss_histogram.txt
Read knownResults.txt and de novo motif files into pandas for downstream analysis.
import pandas as pd
import subprocess
import io
# --- Parse known motif enrichment results ---
# knownResults.txt columns:
# Motif Name | Consensus | P-value | Log P-value | q-value | # Target Seqs w/ motif | % Target | # Bg Seqs w/ motif | % Bg
known_cols = [
"motif_name", "consensus", "pvalue", "log_pvalue",
"qvalue", "n_target_seqs", "pct_target",
"n_bg_seqs", "pct_bg"
]
known = pd.read_csv(
"motif_output/knownResults.txt",
sep="\t", header=0, names=known_cols
)
# Convert string percentages to floats
known["pct_target"] = known["pct_target"].str.rstrip("%").astype(float)
known["pct_bg"] = known["pct_bg"].str.rstrip("%").astype(float)
known["fold_enrichment"] = known["pct_target"] / known["pct_bg"].replace(0, 0.001)
known["-log10_pvalue"] = -known["log_pvalue"] / 2.303 # log10 from natural log
print(f"Total known motifs tested: {len(known)}")
print(f"Significant (p < 1e-5): {(known['pvalue'].astype(float) < 1e-5).sum()}")
print("\nTop 5 enriched motifs:")
print(
known[["motif_name", "pct_target", "pct_bg", "fold_enrichment", "-log10_pvalue"]]
.head(5)
.to_string(index=False)
)
# Total known motifs tested: 391
# Significant (p < 1e-5): 47
# Top 5 enriched motifs:
# motif_name pct_target pct_bg fold_enrichment -log10_pvalue
# CTCF(Zf)/GM12878-CTCF-ChIP-Seq... 78.21 8.32 9.40 312.4
# ZNF143(Zf)/... 42.11 5.10 8.26 188.7
# ...
Read annotated peak output, summarize genomic feature distribution, and plot motif enrichments.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
# --- Parse annotated peaks ---
# annotatePeaks.pl output header: PeakID chr start end strand score ...
# Key columns: "Annotation", "Distance to TSS", "Nearest RefSeq", "Gene Name"
annot = pd.read_csv("annotated_peaks.txt", sep="\t", header=0, low_memory=False)
annot.columns = annot.columns.str.strip()
# Simplify annotation categories
def simplify_annotation(ann):
if pd.isna(ann):
return "Other"
ann = str(ann)
if "promoter" in ann.lower():
return "Promoter (<2 kb)"
elif "exon" in ann.lower():
return "Exon"
elif "intron" in ann.lower():
return "Intron"
elif "tts" in ann.lower() or "downstream" in ann.lower():
return "Downstream"
elif "intergenic" in ann.lower():
return "Intergenic"
else:
return "Other"
annot["simple_ann"] = annot["Annotation"].apply(simplify_annotation)
ann_counts = annot["simple_ann"].value_counts()
print("Peak annotation summary:")
print(ann_counts.to_string())
# Peak annotation summary:
# Intron 5832
# Intergenic 3102
# Promoter (<2 kb) 2218
# Exon 891
# Downstream 287
# Other 123
# --- Plot 1: Annotation pie chart ---
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
colors = ["#E41A1C", "#377EB8", "#4DAF4A", "#984EA3", "#FF7F00", "#A65628"]
axes[0].pie(
ann_counts.values,
labels=ann_counts.index,
colors=colors[:len(ann_counts)],
autopct="%1.1f%%",
startangle=140,
textprops={"fontsize": 9}
)
axes[0].set_title("Peak Genomic Annotation\n(n=12,453 peaks)", fontsize=11)
# --- Plot 2: Top known motif enrichments ---
known = pd.read_csv("motif_output/knownResults.txt", sep="\t", header=0)
known.columns = [
"motif_name", "consensus", "pvalue", "log_pvalue",
"qvalue", "n_target_seqs", "pct_target", "n_bg_seqs", "pct_bg"
]
known["pct_target"] = known["pct_target"].str.rstrip("%").astype(float)
known["pct_bg"] = known["pct_bg"].str.rstrip("%").astype(float)
known["-log10_p"] = (-known["log_pvalue"].astype(float)) / 2.303
known["short_name"] = known["motif_name"].str.split("/").str[0]
top_motifs = known.head(15).copy()
sns.barplot(
data=top_motifs,
x="-log10_p",
y="short_name",
palette="Blues_r",
ax=axes[1]
)
axes[1].set_xlabel("-log10(p-value)", fontsize=10)
axes[1].set_ylabel("Motif", fontsize=10)
axes[1].set_title("Top 15 Enriched Known Motifs\n(HOMER/JASPAR database)", fontsize=11)
axes[1].axvline(x=5, color="red", linestyle="--", linewidth=0.8, label="p=1e-5")
axes[1].legend(fontsize=8)
plt.tight_layout()
plt.savefig("homer_summary.png", dpi=150, bbox_inches="tight")
plt.close()
print("Saved: homer_summary.png")
| Parameter | Default | Range / Options | Effect |
|---|---|---|---|
-size | 200 | 50–1000 | Window size (bp) centered on peak midpoint for motif search; 200 for TF ChIP-seq, 150 for ATAC-seq |
-mask | off | flag | Mask repetitive elements (N-mask); strongly recommended to reduce false positives |
-p | 1 | 1–64 | Number of parallel CPU threads; use 4–8 for typical datasets |
-S | 25 | 1–100 | Number of de novo motifs to find; 25 is usually sufficient |
-nomotif | off | flag | Skip de novo motif discovery; only perform known motif enrichment (10× faster) |
-bg | auto-generated | BED file path | Custom background region file; auto-background is GC-matched if omitted |
-find | — | .motif file path | Scan peaks for occurrences of a specific motif matrix; outputs per-peak annotation |
-mis | 2 | 0–4 | Maximum mismatches allowed when matching known motifs |
-len | 8,10,12 | comma-separated integers | Motif lengths to try for de novo search; adding 6 finds shorter motifs |
genome | required | hg38, mm10, hg19, dm6, ce11, etc. | Reference genome assembly; must be installed with installGenome.pl |
Run HOMER from a Python script and capture completion status.
import subprocess
import sys
from pathlib import Path
def run_homer_motifs(
peaks_bed: str,
genome: str,
output_dir: str,
size: int = 200,
mask: bool = True,
cpus: int = 4,
n_motifs: int = 25,
denovo: bool = True,
bg_bed: str = None
) -> int:
"""Run findMotifsGenome.pl and return exit code."""
Path(output_dir).mkdir(parents=True, exist_ok=True)
cmd = [
"findMotifsGenome.pl", peaks_bed, genome, output_dir,
"-size", str(size),
"-p", str(cpus),
"-S", str(n_motifs),
]
if mask:
cmd.append("-mask")
if not denovo:
cmd.append("-nomotif")
if bg_bed:
cmd.extend(["-bg", bg_bed])
print(f"Running: {' '.join(cmd)}")
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode != 0:
print(f"HOMER stderr:\n{result.stderr[-2000:]}", file=sys.stderr)
return result.returncode
print(f"Done. Output in {output_dir}/")
return 0
# Usage
exit_code = run_homer_motifs(
peaks_bed="peaks/tf_chip_200bp.bed",
genome="hg38",
output_dir="motif_output/",
size=200, mask=True, cpus=4, n_motifs=25
)
print(f"Exit code: {exit_code}")
# Running: findMotifsGenome.pl peaks/tf_chip_200bp.bed hg38 motif_output/ -size 200 -p 4 -S 25 -mask
# Done. Output in motif_output/
# Exit code: 0
Run HOMER on multiple ChIP-seq samples in a loop.
#!/bin/bash
# Batch motif analysis for several ChIP-seq experiments (same genome)
GENOME="hg38"
PEAK_DIR="peaks"
MOTIF_DIR="motif_results"
mkdir -p "$MOTIF_DIR"
SAMPLES=(CTCF H3K4me3 FOXA2 RUNX1)
for sample in "${SAMPLES[@]}"; do
echo "=== Processing $sample ==="
peak_file="${PEAK_DIR}/${sample}_summits.bed"
if [ ! -f "$peak_file" ]; then
echo " Skipping $sample: $peak_file not found"
continue
fi
# Center on summit ±100 bp
awk 'BEGIN{OFS="\t"} {s=$2-100<0?0:$2-100; print $1,s,$2+100,$4,$5}' \
"$peak_file" > "${PEAK_DIR}/${sample}_200bp.bed"
findMotifsGenome.pl "${PEAK_DIR}/${sample}_200bp.bed" "$GENOME" \
"${MOTIF_DIR}/${sample}/" \
-size 200 -mask -p 4 -nomotif \
2> "${MOTIF_DIR}/${sample}.log"
n=$(wc -l < "${MOTIF_DIR}/${sample}/knownResults.txt")
echo " $sample: $n motifs tested. Top hit: $(sed -n '2p' "${MOTIF_DIR}/${sample}/knownResults.txt" | cut -f1)"
done
# === Processing CTCF ===
# CTCF: 392 motifs tested. Top hit: CTCF(Zf)/GM12878-CTCF-ChIP-Seq(GSE32465)/Homer
# === Processing FOXA2 ===
# FOXA2: 392 motifs tested. Top hit: Foxa2(Forkhead)/Liver-Foxa2-ChIP-Seq(GSE25694)/Homer
Load de novo motif PWM matrices for downstream comparison or plotting.
import os
import re
import pandas as pd
def parse_homer_motif(motif_file: str) -> dict:
"""Parse a HOMER .motif file into name, log_odds_threshold, and PWM."""
with open(motif_file) as f:
header = f.readline().strip() # >motif_name\tlog_odds\tlog_p-value\t0\tnucs
rows = []
for line in f:
line = line.strip()
if line:
rows.append([float(x) for x in line.split("\t")])
parts = header.lstrip(">").split("\t")
name = parts[0]
log_odds = float(parts[1]) if len(parts) > 1 else 0.0
log_p = float(parts[2]) if len(parts) > 2 else 0.0
pwm = pd.DataFrame(rows, columns=["A", "C", "G", "T"])
return {"name": name, "log_odds": log_odds, "log_p": log_p, "pwm": pwm}
# Load all de novo motifs
motif_dir = "motif_output/homerResults/"
motifs = []
for fn in sorted(os.listdir(motif_dir)):
if fn.endswith(".motif"):
motif = parse_homer_motif(os.path.join(motif_dir, fn))
motifs.append(motif)
print(f"{fn}: {motif['name']} ({len(motif['pwm'])} positions, log_p={motif['log_p']:.1f})")
print(f"\nLoaded {len(motifs)} de novo motifs")
# motif1.motif: CTCF-motif (19 positions, log_p=-8234.1)
# motif2.motif: CTCFL-motif (17 positions, log_p=-3421.7)
# ...
# Loaded 25 de novo motifs
Combine peak annotations with RNA-seq DE results to find regulated genes near peaks.
import pandas as pd
# Load annotated peaks
annot = pd.read_csv("annotated_peaks.txt", sep="\t", header=0, low_memory=False)
annot.columns = annot.columns.str.strip()
# Key columns from HOMER annotation
peak_genes = annot[["PeakID (cmd=annotatePeaks.pl peaks.bed hg38)",
"Chr", "Start", "End",
"Annotation", "Distance to TSS",
"Nearest RefSeq", "Gene Name"]].copy()
peak_genes.columns = ["peak_id", "chr", "start", "end",
"annotation", "tss_dist", "refseq", "gene_name"]
# Filter promoter-proximal peaks (within 2 kb of TSS)
promoter_peaks = peak_genes[peak_genes["tss_dist"].abs() < 2000].copy()
print(f"Promoter-proximal peaks (|TSS| < 2kb): {len(promoter_peaks)}")
# Promoter-proximal peaks (|TSS| < 2kb): 2218
# Load DESeq2 results (gene_name, log2FC, padj)
de_results = pd.read_csv("deseq2_results.csv")
de_sig = de_results[de_results["padj"] < 0.05].copy()
# Merge: find DE genes with a nearby peak
merged = promoter_peaks.merge(de_sig, on="gene_name", how="inner")
print(f"DE genes with promoter-proximal peak: {len(merged)}")
print(merged[["gene_name", "tss_dist", "annotation", "log2FC", "padj"]].head())
# DE genes with promoter-proximal peak: 347
| Output | Format | Description |
|---|---|---|
motif_output/knownResults.txt | TSV | All known motifs tested: name, p-value, q-value, % target, % background; primary result file |
motif_output/knownResults.html | HTML | Interactive HTML report with motif logos, statistics, and links |
motif_output/homerResults/ | Directory | De novo motifs: motifN.motif (PWM matrix), motifN.logo.png, similar.motifs.txt |
motif_output/homerResults.html | HTML | Interactive HTML report for de novo motifs |
annotated_peaks.txt | TSV | One row per peak: chr, start, end, annotation, TSS distance, nearest gene, RefSeq ID |
peaks_with_motif1.txt | TSV | Per-peak motif occurrence scores (from -find mode) |
| Problem | Cause | Solution |
|---|---|---|
ERROR: Genome not found | Genome not installed via installGenome.pl | Run installGenome.pl hg38; confirm ~/.homer/data/genomes/hg38/ exists |
command not found: findMotifsGenome.pl | HOMER binaries not in $PATH | Add HOMER bin directory to PATH: export PATH=~/.homer/bin:$PATH (or activate conda env) |
| Poor de novo motif quality (GC-rich artifacts) | Unmasked repetitive elements | Always use -mask; also try increasing -size to 300 or reducing to 150 |
| No significant known motifs (all p > 0.01) | Too few peaks or wrong peak size | Ensure ≥500 peaks; check -size matches expected binding footprint (150–200 for TFs) |
| Conda HOMER conflicts (Perl module errors) | HOMER's Perl scripts conflict with conda environment Perl | Install HOMER in a dedicated conda env: conda create -n homer_only -c bioconda homer |
annotatePeaks.pl gives all "Intergenic" | Genome annotation not installed | HOMER annotation requires installGenome.pl which downloads GTF; check ~/.homer/data/genomes/hg38/ contains *.ann files |
| Very slow run time (>2 hr) | Single-threaded or large peak set | Add -p 8; reduce -S to 10 for speed; use -nomotif for known-only runs |
-bg custom background gives poor motifs | Background regions have very different GC content | Use a GC-matched background; run bedtools shuffle with -noOverlapping to generate random same-size regions |