From sciagent-skills
Processes SAM/BAM/CRAM alignments: sort, index, convert, filter, QC (flagstat, stats, depth), markdup, merge. Essential post-alignment step in NGS pipelines before variant calling.
npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skillsThis skill uses the workspace's default tool permissions.
samtools is the standard command-line toolkit for processing sequence alignment files in SAM, BAM, and CRAM formats. It handles the complete alignment file lifecycle: format conversion, coordinate sorting, index creation, quality control statistics, read filtering, duplicate marking, and multi-file merging. samtools is a near-universal component of NGS pipelines between alignment (STAR, BWA) an...
Reads/writes SAM/BAM/CRAM, VCF/BCF, FASTA/FASTQ genomic files using pysam. Supports region queries, pileups, variant filtering, read groups for NGS pipelines.
NGS read QC, alignment, and BAM processing pipeline. Wraps FastQC, BWA/Bowtie2/Minimap2, SAMtools, and MultiQC for automated read-to-BAM workflows.
Reads/writes SAM/BAM/CRAM alignments, VCF/BCF variants, FASTA/FASTQ sequences in Python. Extracts regions, calculates coverage, queries tabix files for NGS pipelines.
Share bugs, ideas, or general feedback.
samtools is the standard command-line toolkit for processing sequence alignment files in SAM, BAM, and CRAM formats. It handles the complete alignment file lifecycle: format conversion, coordinate sorting, index creation, quality control statistics, read filtering, duplicate marking, and multi-file merging. samtools is a near-universal component of NGS pipelines between alignment (STAR, BWA) and downstream analysis (variant calling, peak calling, coverage).
pysam instead for Python-native BAM manipulation in custom scriptsdeeptools bamCoverage instead when you need normalized bigWig coverage tracksmosdepth instead for whole-genome per-base depth (faster, parallelized)samtools faidx for FASTA indexing; samtools sort before samtools indexCheck before installing: The tool may already be available in the current environment (e.g., inside a
pixi/condaenv). Runcommand -v samtoolsfirst and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool viapixi run samtoolsrather than baresamtools.
# Bioconda (recommended)
conda install -c bioconda samtools
# Homebrew (macOS)
brew install samtools
# Verify
samtools --version | head -1
# Typical post-alignment workflow: sort → index → QC
samtools sort -@ 8 -o sorted.bam input.bam
samtools index sorted.bam
samtools flagstat sorted.bam
Convert between SAM/BAM/CRAM formats and extract subsets.
# SAM → BAM (saves ~75% disk space)
samtools view -b -h input.sam -o output.bam
# BAM → CRAM (saves additional 40-50%)
samtools view -C -T reference.fa input.bam -o output.cram
# Filter: mapping quality ≥20, exclude unmapped (-F 4)
samtools view -q 20 -F 4 input.bam -o filtered.bam
# Extract specific region (requires index)
samtools view -h sorted.bam "chr1:1000000-2000000" -o region.bam
# Count reads matching filter
samtools view -c -F 4 input.bam
# Output: 45231923 (number of mapped reads)
# Extract reads as FASTQ (for realignment or de novo assembly)
samtools fastq -@ 4 -1 R1.fastq.gz -2 R2.fastq.gz -0 unpaired.fastq.gz input.bam
# Extract reads as FASTA
samtools fasta input.bam > reads.fasta
# Filter by read group
samtools view -r SAMPLE_001 multi_rg.bam -o sample001.bam
Organize BAM files for efficient random access.
# Sort by coordinate (required before indexing)
samtools sort -@ 8 -m 2G input.bam -o sorted.bam
# Sort by read name (required for fixmate/markdup)
samtools sort -n -@ 8 input.bam -o namesorted.bam
# Index sorted BAM (creates sorted.bam.bai)
samtools index sorted.bam
# For chromosomes > 512 Mbp: use CSI index instead
samtools index -c sorted.bam
# Group reads by name (fast, for fixmate — no full sort needed)
samtools collate -o collated.bam input.bam
Generate alignment QC metrics and coverage reports.
# Quick summary: total, mapped, paired, properly paired
samtools flagstat sorted.bam
# Example output:
# 50000000 + 0 in total (QC-passed reads + QC-failed reads)
# 48523111 + 0 mapped (97.05% : N/A)
# 50000000 + 0 paired in sequencing
# 48490234 + 0 properly paired (96.98% : N/A)
# Per-chromosome mapped/unmapped read counts
samtools idxstats sorted.bam
# chr1 248956422 12345678 0
# chr2 242193529 11234567 0
# Comprehensive stats (insert sizes, GC content, base quality)
samtools stats -r reference.fa sorted.bam > full_stats.txt
grep "^SN" full_stats.txt | cut -f2,3 # Summary Numbers only
# Coverage report (min/max/mean per region/chromosome)
samtools coverage sorted.bam
# Per-base read depth for specific regions
samtools depth -b target_regions.bed sorted.bam > depth.txt
# Output: chr pos depth (e.g., chr1 1000 45)
# Statistics split by read group
samtools stats -S RG sorted.bam > per_rg_stats.txt
Filter reads using SAM FLAG bits for specific subsets.
# FLAG reference — common masks:
# 1 = paired 4 = unmapped
# 2 = proper pair 8 = mate unmapped
# 16 = reverse strand 64 = R1 (first in pair)
# 128 = R2 256= secondary alignment
# 1024 = PCR duplicate 2048= supplementary
# Extract properly paired, mapped reads (FLAG 2 set, 4 unset)
samtools view -f 2 -F 4 sorted.bam -o proper_pairs.bam
# Extract R1 reads only
samtools view -f 64 sorted.bam -o R1.bam
# Remove secondary and supplementary alignments
samtools view -F 2304 sorted.bam -o primary.bam
# Extract reads from BED file regions
samtools view -L regions.bed -b sorted.bam -o regions.bam
Mark or remove PCR duplicates before variant calling.
# Full duplicate marking workflow (collate → fixmate → sort → markdup)
samtools collate -@ 8 -o collated.bam input.bam
samtools fixmate -m -@ 8 collated.bam fixmated.bam
samtools sort -@ 8 -o sorted.bam fixmated.bam
samtools markdup -@ 8 sorted.bam marked.bam
samtools index marked.bam
# Check duplication rate
samtools flagstat marked.bam | grep "duplicates"
# Output: 2345678 + 0 duplicates (4.83%)
# NovaSeq optical duplicate detection (2500 pixel distance)
samtools markdup -d 2500 sorted.bam marked_novaseq.bam
# Remove duplicates instead of marking
samtools markdup -r sorted.bam deduped.bam
# Get duplication stats without writing output
samtools markdup -s sorted.bam /dev/null
Merge BAM files and perform region-level analysis.
# Merge multiple BAM files (all must be sorted)
samtools merge -@ 8 merged.bam lane1.bam lane2.bam lane3.bam
# Merge files listed in a text file (one per line)
samtools merge -b bam_list.txt -@ 8 merged.bam
# Merge with read group tags from filenames
samtools merge -r merged.bam sample1.bam sample2.bam
# Extract specific chromosome region from merged output
samtools view -h merged.bam chr1 -b -o chr1.bam
FLAGS encode read properties as a sum of bit values. Common filtering patterns:
| Common Filter | -f (require) | -F (exclude) | Selects |
|---|---|---|---|
| Mapped reads | — | 4 | All aligned reads |
| Proper pairs | 2 | — | Properly paired, both mapped |
| Unique primary | — | 2308 | No secondary/supplementary/duplicate |
| R1 only | 64 | — | First-in-pair reads |
| Unmapped | 4 | — | Failed to align |
| Format | Size | Speed | Requires |
|---|---|---|---|
| SAM | ~10× BAM | Slow I/O | Nothing |
| BAM | 1× | Fast | .bai index for random access |
| CRAM | ~0.6× BAM | Slightly slower | Reference FASTA + index |
Use CRAM for long-term storage; BAM for active analysis.
Goal: Convert aligner output to analysis-ready BAM with QC metrics.
#!/bin/bash
SAMPLE="sample_001"
REF="reference.fa"
THREADS=8
# 1. Sort and index (aligner often outputs unsorted SAM/BAM)
samtools sort -@ $THREADS -o ${SAMPLE}.sorted.bam ${SAMPLE}.bam
samtools index ${SAMPLE}.sorted.bam
# 2. QC metrics
samtools flagstat ${SAMPLE}.sorted.bam > ${SAMPLE}.flagstat.txt
samtools stats -r $REF ${SAMPLE}.sorted.bam > ${SAMPLE}.stats.txt
samtools coverage ${SAMPLE}.sorted.bam > ${SAMPLE}.coverage.txt
# 3. Per-chromosome stats
samtools idxstats ${SAMPLE}.sorted.bam > ${SAMPLE}.idxstats.txt
echo "QC complete: $(grep 'mapped (' ${SAMPLE}.flagstat.txt | head -1)"
Goal: Prepare BAM for GATK or other variant callers requiring deduplicated input.
#!/bin/bash
INPUT="aligned.bam"
FINAL="deduped.bam"
THREADS=8
# Collate → fixmate → sort → markdup
samtools collate -@ $THREADS -o collated.bam $INPUT
samtools fixmate -m -@ $THREADS collated.bam fixmated.bam
samtools sort -@ $THREADS -o sorted.bam fixmated.bam
samtools markdup -@ $THREADS -s sorted.bam $FINAL
# Clean up intermediates
rm collated.bam fixmated.bam sorted.bam
# Index and verify
samtools index $FINAL
samtools flagstat $FINAL | grep "duplic"
# Expected: 3-15% duplicates (WGS); 10-30% for amplicon
| Parameter | Command | Default | Range/Options | Effect |
|---|---|---|---|---|
-@ | Most | 0 | 1–N cores | Additional compression/I/O threads |
-m | sort | 768M | e.g., 2G, 4G | Memory per thread for sorting |
-q | view | 0 | 0–60 | Minimum mapping quality filter |
-f | view | 0 | FLAG bits | Include reads with ALL bits set |
-F | view | 0 | FLAG bits | Exclude reads with ANY bit set |
-b | view | — | flag | Output BAM format |
-C | view | — | flag | Output CRAM (requires -T) |
-T | view | — | FASTA path | Reference for CRAM output |
-d | markdup | 0 | 0–2500 | Optical duplicate pixel distance |
-r | markdup | — | flag | Remove duplicates (vs just mark) |
-n | sort | — | flag | Sort by read name instead of position |
-c | index | — | flag | Create CSI index (needed for chr > 512 Mb) |
Always sort before indexing: samtools index requires coordinate-sorted input. Attempting to index an unsorted BAM will fail or produce incorrect results.
Use -@ for all production runs: Most samtools commands are I/O-bound. Adding -@ 8 provides near-linear speedup for compression/decompression with minimal overhead.
Run flagstat before any analysis: samtools flagstat runs in seconds and catches alignment failures (low mapping rate, unexpected paired-end rates) before wasting time on downstream steps.
Use the collate → fixmate → sort → markdup pipeline: Running samtools markdup directly on coordinate-sorted BAM without fixmate produces incorrect duplicate detection. The mate information added by fixmate -m is essential.
Prefer CRAM for archiving: CRAM reduces storage 40-50% vs BAM with no loss. Always store the reference FASTA alongside CRAM files.
Use -L bed_file for targeted analyses: Restricting samtools view to BED-defined target regions (WES capture, amplicons) dramatically reduces I/O for downstream steps.
# Process all BAM files in directory
for bam in *.sorted.bam; do
echo "=== $bam ==="
samtools flagstat $bam | grep -E "mapped|properly paired|duplicates"
done
# Pull both unmapped reads (useful for pathogen detection)
samtools view -f 4 -b input.bam -o unmapped.bam
samtools fastq -@ 4 -1 unmapped_R1.fastq -2 unmapped_R2.fastq unmapped.bam
echo "Unmapped pairs ready for de novo assembly"
# Estimate current depth, then subsample to ~30×
TOTAL=$(samtools flagstat input.bam | grep "mapped (" | head -1 | awk '{print $1}')
GENOME_SIZE=3100000000 # hg38
READ_LEN=150
CURRENT_COV=$(echo "scale=1; $TOTAL * $READ_LEN / $GENOME_SIZE" | bc)
TARGET_FRAC=$(echo "scale=3; 30 / $CURRENT_COV" | bc)
echo "Current: ${CURRENT_COV}×; subsample fraction: $TARGET_FRAC"
samtools view -b -s $TARGET_FRAC input.bam -o downsampled.bam
samtools index downsampled.bam
| Problem | Cause | Solution |
|---|---|---|
[bam_index_build2] fail to index | BAM not sorted by coordinate | Sort first: samtools sort -o sorted.bam input.bam |
BAI index too large for chromosome | Chromosome > 512 Mbp | Use CSI index: samtools index -c input.bam |
CRAM: reference not found | Missing or wrong reference FASTA | Set REF_PATH env var or use -T ref.fa |
| Duplicate marking incorrect | fixmate step skipped | Run full pipeline: collate → fixmate → sort → markdup |
flagstat shows 0% properly paired | Paired-end BAM missing mate info | Run samtools fixmate to populate mate coordinates |
| Very slow sorting | Low memory per thread | Increase -m 4G; reduce -@ if memory-limited |
| Region query returns nothing | BAM not indexed or wrong coords | Run samtools index; use 1-based coords: chr1:1000-2000 |
[E::hts_open_format] fail to open | File path wrong or BAM corrupt | Verify path; test with samtools quickcheck file.bam |