Core bioinformatics concepts including SAM/BAM format, AGP genome assembly format, sequencing technologies (Hi-C, HiFi, Illumina), quality metrics, and common data processing patterns. Essential for debugging alignment, filtering, pairing issues, and AGP coordinate validation.
npx claudepluginhub joshuarweaver/cascade-ai-ml-engineering --plugin delphine-l-claude-globalThis skill is limited to using the following tools:
Foundation knowledge for genomics and bioinformatics workflows. Provides essential understanding of file formats, sequencing technologies, and common data processing patterns.
Creates isolated Git worktrees for feature branches with prioritized directory selection, gitignore safety checks, auto project setup for Node/Python/Rust/Go, and baseline verification.
Executes implementation plans in current session by dispatching fresh subagents per independent task, with two-stage reviews: spec compliance then code quality.
Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.
Foundation knowledge for genomics and bioinformatics workflows. Provides essential understanding of file formats, sequencing technologies, and common data processing patterns.
Flags are additive - a read can have multiple flags set simultaneously.
Common Flags:
0x0001 (1): Read is paired in sequencing0x0002 (2): Each segment properly aligned (proper pair)0x0004 (4): Read unmapped0x0008 (8): Mate unmapped0x0010 (16): Read mapped to reverse strand0x0020 (32): Mate mapped to reverse strand0x0040 (64): First in pair (R1/forward)0x0080 (128): Second in pair (R2/reverse)0x0100 (256): Secondary alignment0x0400 (1024): PCR or optical duplicate0x0800 (2048): Supplementary alignmentFlag Combinations:
99 (0x63 = 1 + 2 + 32 + 64)147 (0x93 = 1 + 2 + 16 + 128)48See
reference.mdfor complete flag tables, CIGAR operations, optional tags, and SAM mandatory fields.
What "proper pair" means:
Important: Different aligners have different criteria for proper pairs!
Formula: MAPQ = -10 * log10(P(mapping is wrong))
Common Thresholds:
MAPQ >= 60: High confidence (error probability < 0.0001%)MAPQ >= 30: Good quality (error probability < 0.1%)MAPQ >= 20: Acceptable (error probability < 1%)MAPQ >= 10: Low confidence (error probability < 10%)MAPQ = 0: Multi-mapper or unmappedNote: MAPQ=0 can mean either unmapped OR equally good multiple mappings.
Represents alignment between read and reference:
M: Match or mismatch (alignment match)I: Insertion in read vs referenceD: Deletion in read vs referenceS: Soft clipping (bases in read not aligned)H: Hard clipping (bases not in read sequence)N: Skipped region (for RNA-seq splicing)Example: 100M = perfect 100bp match
Example: 50M5I45M = 50bp match, 5bp insertion, 45bp match
Characteristics:
Best Mappers:
map-pb, map-hifiTypical Use Cases:
Characteristics:
Best Mappers:
Critical Concept: Hi-C read pairs intentionally map to distant loci. Region filtering can easily break pairs!
Typical Use Cases:
Characteristics:
Best Mappers:
See
reference.mdfor detailed technical specs, error profiles, and quality metrics per technology.
Purpose: Filter, convert, and view SAM/BAM files
Key Flags:
-b: Output BAM format-h: Include header-f INT: Require flags (keep reads WITH these flags)-F INT: Filter flags (remove reads WITH these flags)-q INT: Minimum MAPQ threshold-L FILE: Keep reads overlapping regions in BED fileImportant Behavior:
-L (region filtering) checks each read individually, not pairs-f, -F) are applied before region filters (-L)Example - Proper pairs in regions (correct order):
samtools view -b -f 2 -L regions.bed input.bam > proper_pairs_in_regions.bam
Purpose: Advanced filtering with complex criteria
Common Filters:
isPaired: true - Read is from paired-end sequencingisProperPair: true - Read is part of proper pairisMapped: true - Read is mappedmapQuality: >=30 - Mapping quality thresholdImportant Difference from samtools:
isProperPair is more strict than samtools -f 2Purpose: Convert SAM/BAM to FASTQ/FASTA
Critical: Use appropriate filters to ensure R1/R2 files match!
See
reference.mdfor complete tool command reference with all options and examples.
WRONG WAY (breaks pairs):
# Region filter first -> breaks pairs when mates are in different regions
samtools view -b -L regions.bed input.bam | bamtools filter -isPaired -isProperPair
# Result: Empty output (all pairs broken)
RIGHT WAY (preserves pairs):
# Proper pair filter FIRST, then region filter
samtools view -b -f 2 -L regions.bed input.bam > output.bam
For Paired-End:
samtools fastx -1 R1.fq.gz -2 R2.fq.gz \
--i1-flags 2 \ # Require proper pair
input.bam
For Single-End:
samtools fastx -0 output.fq.gz input.bam
Conservative (high quality):
samtools view -b -q 30 -f 2 -F 256 -F 2048 input.bam
# MAPQ >= 30, proper pairs, no secondary/supplementary
Permissive (for low-coverage data):
samtools view -b -q 10 -F 4 input.bam
# MAPQ >= 10, mapped reads
Region filter (samtools view -L) breaks read pairs. One mate in region, other outside. Proper pair flag lost. Apply proper pair filter BEFORE region filtering:
samtools view -b -f 2 -L regions.bed input.bam > output.bam
Improper filtering broke some pairs. Require proper pairs during extraction:
samtools fastx -1 R1.fq -2 R2.fq --i1-flags 2 input.bam
This is normal for Hi-C due to chimeric reads. Use Hi-C-specific pipelines (HiC-Pro, Juicer). Don't filter too aggressively on MAPQ.
Check insert size distribution, reference mismatch, or incorrect orientation flags.
samtools stats input.bam | grep "insert size"
samtools flagstat input.bam
See
common-issues.mdfor comprehensive troubleshooting with detailed solutions, including AGP processing issues, HiFi-specific problems, and diagnostic commands.
N50: Length of the shortest contig at which 50% of total assembly is contained in contigs of that length or longer
Related Metrics:
Coverage: Percentage of reference bases covered by at least one read Depth: Average number of reads covering each base
Recommended Depths:
See
reference.mdfor complete coverage calculations, BUSCO interpretation, QV scores, and assembly quality metrics.
>sequence_id description
ATCGATCGATCG
>@read_id
ATCGATCGATCG
+
IIIIIIIIIIII
chr1 1000 2000 feature_name score +
chr1 1 5000 1 W contig_1 1 5000 +
chr1 5001 5100 2 U 100 scaffold yes proximity_ligation
obj_end - obj_beg + 1 == comp_end - comp_beg + 1See
reference.mdfor complete AGP specification, coordinate systems, validation rules, and processing patterns. Seecommon-issues.mdfor AGP coordinate debugging and unloc processing issues.
1-based (SAM, VCF, GFF, AGP): First base is position 1. Interval [2,5] includes positions 2,3,4,5. 0-based (BED, BAM binary): First base is position 0. Interval [2,5) includes positions 2,3,4 (excludes 5).
Conversion: BED_start = SAM_start - 1; BED_end = SAM_end.
map-hifi or map-pb