From sciagent-skills
Executes bedtools commands for genomic interval operations on BED/BAM/GFF/VCF: find overlaps, merge intervals, compute coverage, extract FASTA, nearest features. For ChIP-seq peak annotation, ATAC-seq merging, region filtering.
npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skillsThis skill uses the workspace's default tool permissions.
bedtools is the standard toolkit for operating on genomic intervals in BED, BAM, GFF, and VCF formats. It solves the core problem of genome arithmetic: finding overlaps between feature sets, computing coverage, extracting sequences, merging adjacent regions, and annotating features with nearest neighbors. bedtools operates on sorted coordinate lists and runs at C speed, making it practical for ...
Processes genomic intervals for overlap detection, coverage tracks, ML tokenization, and fragment analysis using Rust toolkit with Python bindings. For BED files, ChIP-seq, ATAC-seq in computational genomics.
Analyzes genomic intervals: detects overlaps with IGD, generates BigWig coverage tracks, tokenizes regions for ML models. Rust library/CLI with Python bindings.
Processes large BED files with Rust-backed high-performance I/O and interval operations (intersect, merge, complement, subtract). Tokenizes genomic regions against universes for ML preprocessing.
Share bugs, ideas, or general feedback.
bedtools is the standard toolkit for operating on genomic intervals in BED, BAM, GFF, and VCF formats. It solves the core problem of genome arithmetic: finding overlaps between feature sets, computing coverage, extracting sequences, merging adjacent regions, and annotating features with nearest neighbors. bedtools operates on sorted coordinate lists and runs at C speed, making it practical for whole-genome analyses.
tabix instead for fast indexed queries of a single genomic regiondeeptools bamCoverage insteadmosdepth instead for whole-genome per-base depth (10× faster)getfasta; genome file (chromosome sizes) for slop/flank/genomecovCheck before installing: The tool may already be available in the current environment (e.g., inside a
pixi/condaenv). Runcommand -v bedtoolsfirst and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool viapixi run bedtoolsrather than barebedtools.
# Bioconda (recommended)
conda install -c bioconda bedtools
# Homebrew (macOS)
brew install bedtools
# Verify
bedtools --version
# bedtools v2.31.0
# Create genome file from FASTA index
samtools faidx reference.fa
cut -f1,2 reference.fa.fai > genome.txt # chr → size table
# Find peaks overlapping genes, then merge overlapping peaks
bedtools intersect -a peaks.bed -b genes.bed -wa -wb > peaks_with_genes.bed
bedtools merge -i peaks.bed > merged_peaks.bed
bedtools coverage -a genes.bed -b reads.bam > gene_coverage.bed
Find regions that overlap between two feature sets.
# Basic intersection: output overlapping regions
bedtools intersect -a peaks.bed -b genes.bed
# Report original A and B features for each overlap
bedtools intersect -a peaks.bed -b genes.bed -wa -wb
# Count B overlaps per A feature (adds column)
bedtools intersect -a peaks.bed -b genes.bed -c
# Output: chr1 1000 2000 peak1 gene_count
# Peaks with ANY overlap (report each peak once)
bedtools intersect -a peaks.bed -b genes.bed -u
# Peaks with NO overlap in B (invert filter)
bedtools intersect -a peaks.bed -b blacklist.bed -v
# Require reciprocal 50% overlap both ways
bedtools intersect -a exp1.bed -b exp2.bed -f 0.5 -F 0.5 -r
# Same-strand intersections only
bedtools intersect -a peaks.bed -b genes.bed -s
# Multiple database files with overlap counts per file
bedtools intersect -a query.bed -b enhancers.bed promoters.bed \
-names enh prom -C
# Memory-efficient mode for pre-sorted large files
bedtools intersect -a sorted_peaks.bed -b sorted_genes.bed -sorted
Combine overlapping intervals and perform set operations.
# Merge overlapping and adjacent intervals
sort -k1,1 -k2,2n peaks.bed | bedtools merge -i stdin
# Merge intervals within 500 bp of each other
bedtools merge -i peaks.bed -d 500
# Merge and count original features
bedtools merge -i peaks.bed -c 1 -o count
# Output: chr1 1000 5000 3 (3 original peaks merged)
# Merge and collapse feature names
bedtools merge -i peaks.bed -c 4 -o collapse -delim ";"
# Output: chr1 1000 5000 peak1;peak2;peak3
# Subtract B from A (remove covered bases)
bedtools subtract -a peaks.bed -b blacklist.bed
# Remove entire A feature if ANY B overlap
bedtools subtract -a peaks.bed -b exclusion.bed -A
# Find genomic gaps (complement of covered regions)
bedtools complement -i merged.bed -g genome.txt
Calculate depth and breadth of read coverage over features.
# Coverage stats per feature (count, bases covered, % covered)
bedtools coverage -a target_genes.bed -b aligned.bam
# Output: chr start end gene n_overlapping_reads bases_covered feature_len fraction_covered
# Per-base depth within each feature
bedtools coverage -a targets.bed -b aligned.bam -d
# Output: chr start end name position depth
# Coverage histogram per feature
bedtools coverage -a features.bed -b aligned.bam -hist
# Genome-wide BEDGRAPH (coverage per bin)
bedtools genomecov -ibam aligned.bam -bg -o coverage.bedgraph
# Include zero-coverage regions (for whole-genome coverage)
bedtools genomecov -ibam aligned.bam -bga > full_coverage.bedgraph
# Per-base depth for whole genome
bedtools genomecov -ibam aligned.bam -d > depth.txt
# Scaled BEDGRAPH (RPM normalization: total=50M reads → scale=1/50)
bedtools genomecov -ibam aligned.bam -bg -scale 0.00000002 > rpm.bedgraph
# Strand-specific coverage tracks
bedtools genomecov -ibam rnaseq.bam -bg -strand + > forward.bedgraph
bedtools genomecov -ibam rnaseq.bam -bg -strand - > reverse.bedgraph
Extract genomic sequences and annotate features with neighbors.
# Extract FASTA sequences for each BED region
bedtools getfasta -fi genome.fa -bed regions.bed -fo sequences.fasta
# Strand-aware extraction (reverse complement - strand)
bedtools getfasta -fi genome.fa -bed regions.bed -s -fo stranded.fasta
# Custom FASTA headers (name + coords)
bedtools getfasta -fi genome.fa -bed peaks.bed -name -fo named.fasta
# Extract and concatenate exons (BED12 spliced transcripts)
bedtools getfasta -fi genome.fa -bed transcripts.bed12 -split -fo exons.fasta
# Find nearest gene to each peak (with distance)
bedtools closest -a peaks.bed -b genes.bed -d
# Output: peak fields... | gene fields... | distance_bp
# Nearest feature on same strand only
bedtools closest -a peaks.bed -b genes.bed -s -d
# Ignore overlapping features (find nearest non-overlapping)
bedtools closest -a peaks.bed -b genes.bed -io -d
# Multiple annotation databases
bedtools closest -a query.bed -b genes.bed enhancers.bed \
-names genes enhancers -d
Expand, contract, and shift genomic intervals.
# Expand regions by 500 bp on each side
bedtools slop -i peaks.bed -g genome.txt -b 500
# Asymmetric: 2000 bp upstream, 500 bp downstream of TSS
bedtools slop -i tss.bed -g genome.txt -l 2000 -r 500
# Strand-aware expansion (upstream = 5' side)
bedtools slop -i genes.bed -g genome.txt -l 1000 -r 200 -s
# Create flanking regions (not overlapping the feature)
bedtools flank -i genes.bed -g genome.txt -b 1000
bedtools flank -i genes.bed -g genome.txt -l 2000 -r 0 -s # upstream only
BED files use 0-based half-open intervals: start is 0-indexed (like Python), end is exclusive. A region chr1:1000-2000 in BED covers bases 1000–1999 (1000 bases).
chr1 1000 2000 peak1 ← covers positions 1000,1001,...,1999
# BED: 0-based start, exclusive end
# VCF: 1-based position (POS)
# GFF: 1-based start and end (both inclusive)
bedtools converts internally — input format is auto-detected. Problems arise when mixing tools with different conventions.
Most bedtools operations require coordinate-sorted input. Pre-sort with:
sort -k1,1 -k2,2n input.bed > sorted.bed
# For large files, use -S 4G for 4 GB sort buffer
sort -k1,1 -k2,2n -S 4G --parallel=8 input.bed > sorted.bed
The -sorted flag in bedtools intersect uses a sweep algorithm that requires sorted input but uses O(1) memory instead of O(N).
Goal: Annotate peaks with overlapping genes, distances to TSS, and filter blacklisted regions.
#!/bin/bash
PEAKS="peaks.bed"
GENES="refseq_genes.bed"
TSS="refseq_tss.bed" # BED with TSS positions
BLACKLIST="encode_blacklist_hg38.bed"
GENOME="hg38.genome"
# 1. Remove blacklisted regions
bedtools subtract -a $PEAKS -b $BLACKLIST -A > peaks_clean.bed
echo "After blacklist filter: $(wc -l < peaks_clean.bed) peaks"
# 2. Annotate with overlapping gene (allow 2 kb from gene body)
bedtools slop -i $GENES -g $GENOME -b 2000 > genes_padded.bed
bedtools intersect -a peaks_clean.bed -b genes_padded.bed -wa -wb \
> peaks_gene_overlap.bed
# 3. For non-overlapping peaks: find nearest gene
bedtools intersect -a peaks_clean.bed -b genes_padded.bed -v > peaks_distal.bed
bedtools closest -a peaks_distal.bed -b $TSS -d > peaks_distal_nearest.bed
echo "Promoter peaks: $(wc -l < peaks_gene_overlap.bed)"
echo "Distal peaks: $(wc -l < peaks_distal.bed)"
Goal: Calculate on-target read depth and coverage breadth for exome sequencing QC.
#!/bin/bash
BAM="sample.deduped.bam"
TARGETS="capture_targets.bed"
# Per-target coverage statistics
bedtools coverage -a $TARGETS -b $BAM > per_target_coverage.bed
# Summary: total targets, mean depth, % at ≥20×
awk 'BEGIN{n=0; depth=0; covered=0}
{n++; depth+=$7; if($8>=20) covered++}
END{printf "Targets: %d\nMean depth: %.1f×\n%% at 20×: %.1f%%\n",
n, depth/n, covered/n*100}' per_target_coverage.bed
# Per-base depth for IGV visualization
bedtools coverage -a $TARGETS -b $BAM -d > per_base_depth.txt
echo "Per-base depth written to per_base_depth.txt"
| Parameter | Command | Default | Range/Options | Effect |
|---|---|---|---|---|
-f | intersect, coverage | 1e-9 | 0.0–1.0 | Min fraction of A that must overlap |
-F | intersect, coverage | 1e-9 | 0.0–1.0 | Min fraction of B that must overlap |
-r | intersect | — | flag | Require reciprocal overlap (-f AND -F) |
-s | Most | — | flag | Strand-aware (same strand only) |
-v | intersect | — | flag | Report features with NO overlap (invert) |
-c | intersect | — | flag | Append overlap count per A feature |
-d | merge | 0 | integer | Max gap to merge (bp) |
-bg | genomecov | — | flag | BEDGRAPH output format |
-scale | genomecov | 1.0 | float | Multiply coverage by constant (for RPM) |
-sorted | intersect, closest | — | flag | Use sweep algorithm (sorted input required) |
-b | slop | — | integer | Expand interval by N bp on both sides |
-D | closest | — | ref/a/b | Report signed distance (upstream negative) |
Always sort before bedtools: Most bedtools commands fail silently on unsorted input. Sort with sort -k1,1 -k2,2n input.bed before any bedtools operation.
Use -sorted for large files: For pre-sorted files, -sorted reduces memory from O(N) to O(1). Required when intersecting multi-gigabyte BED files.
Check chromosome naming consistency: The single most common failure — some tools use chr1, others use 1. Verify with cut -f1 file.bed | sort -u before running intersections.
Apply blacklist early: Run bedtools subtract -b blacklist.bed -A before any peak analysis. ENCODE blacklists remove artifactual signal in repetitive/high-copy regions.
Use -f 0.5 -r for peak reproducibility: When intersecting peaks across replicates, require 50% reciprocal overlap to avoid spurious short overlaps at interval boundaries.
Validate BED format: Malformed BED (wrong column count, text in numeric columns) causes silent failures. Test with bedtools merge -i file.bed 2>&1 | head -5.
# Report how many peaks overlap each category (genes, promoters, enhancers)
for category in genes.bed promoters.bed enhancers.bed repeats.bed; do
label=$(basename $category .bed)
count=$(bedtools intersect -a peaks.bed -b $category -u | wc -l)
total=$(wc -l < peaks.bed)
echo "$label: $count/$total ($(echo "scale=1; $count*100/$total" | bc)%)"
done
# Extract 2kb upstream of TSS for ChIP annotation
# For genes on + strand: TSS = start; on - strand: TSS = end
awk 'BEGIN{OFS="\t"} $6=="+" {print $1,$2,$2+1,$4,$5,$6}
$6=="-" {print $1,$3-1,$3,$4,$5,$6}' genes.bed > tss.bed
bedtools slop -i tss.bed -g genome.txt -l 2000 -r 500 -s > promoters.bed
echo "Created $(wc -l < promoters.bed) promoter regions"
# Jaccard similarity between two peak sets (0=no overlap, 1=identical)
bedtools sort -i set1.bed > s1.bed
bedtools sort -i set2.bed > s2.bed
bedtools jaccard -a s1.bed -b s2.bed
# Output: intersection union jaccard n_intersections
# 423456 2345678 0.1804 892
| Problem | Cause | Solution |
|---|---|---|
| Empty intersect output | Chromosome name mismatch (chr1 vs 1) | Check: cut -f1 a.bed | sort -u vs cut -f1 b.bed | sort -u |
| Memory error on large files | Not using -sorted flag | Pre-sort inputs and add -sorted to intersect/closest |
getfasta: sequence not found | FASTA headers differ from BED chr names | Index FASTA: samtools faidx genome.fa; match names exactly |
| Zero coverage everywhere | BAM not indexed or BED/BAM chr mismatch | Run samtools index file.bam; verify chr naming |
| Merge doesn't merge expected features | Input not sorted by coordinate | Sort: sort -k1,1 -k2,2n file.bed | bedtools merge -i stdin |
getfasta produces wrong-strand sequence | Using -s without strand column in BED | Ensure BED col 6 has +/-; add strand: awk '{$6="+"; print}' OFS="\t" |
| Off-by-one in coordinates | Mixing 0-based BED and 1-based VCF/GFF | Convert GFF to BED: subtract 1 from start |
| Slow on large genomes | Processing unsorted files | Sort both files; use -sorted; pipe through sort without writing temp files |