Skill

snakemake-workflow-engine

Manages reproducible, scalable Python workflows with file-based rules, automatic parallelism, and per-rule conda/Singularity environments. Scales from local to SLURM/AWS/GCP for NGS/ML pipelines.

Python

AWS

GCP

data-engineering

automation

npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skills

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Snakemake is a Python-based workflow management system that scales analyses from laptop to HPC and cloud. Workflows are defined as rules with explicit input/output file dependencies; Snakemake resolves the execution order automatically and runs independent steps in parallel. Rules can call shell commands, Python/R/Julia scripts, or inline Python. Per-rule conda or Singularity environments make ...

SKILL.md

Similar Skills

nextflow-workflow-engine

135

Orchestrates scalable bioinformatics pipelines using Nextflow's dataflow model with containerized processes on local, HPC (SLURM/SGE), AWS/GCP/Azure, or Kubernetes.

sciagent-skills

Bioinformatics Workflows

Provides patterns for building, maintaining, and scaling bioinformatics workflows using Nextflow, Snakemake, WDL/Cromwell, container orchestration, and reproducible computational biology best practices.

3 files

omer-metin-skills-for-antigravity-2

repro-enforcer

778

Exports bioinformatics analyses as reproducible bundles with Conda environment, Singularity definition, Nextflow pipeline, Snakemake workflow, Docker Compose, checksums, and README. Useful for portable workflows.

clawbio

Stats

Stars135

Forks16

Last CommitApr 28, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Snakemake — Python Workflow Engine

Overview

When to Use

Building reproducible multi-step bioinformatics pipelines (align → sort → call variants → annotate)
Scaling the same workflow from local development to SLURM cluster without code changes
Processing multiple samples identically using wildcard-based rules
Managing dependencies automatically — only rerun steps whose inputs changed
Deploying per-rule conda or Singularity environments for tool isolation
Generating visual DAGs and dry-run previews before committing computational resources
Use Nextflow instead when you need Groovy DSL + dataflow channels, or when leveraging the nf-core community pipeline library
For simple shell loops, use bash scripts; Snakemake is worth the overhead only for 3+ sequential steps with branching
Use Prefect or Airflow instead for data engineering workflows with dynamic task graphs or time-based scheduling

Prerequisites

Python packages: snakemake, graphviz (for DAG visualization)
Environment: Python 3.11+; conda/mamba recommended for per-rule environments
Data requirements: Input files, reference files; output paths defined as rules

Check before installing: The tool may already be available in the current environment (e.g., inside a pixi / conda env). Run command -v snakemake first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via pixi run snakemake rather than bare snakemake.

# Install via conda (includes optional dependencies)
conda install -c conda-forge -c bioconda snakemake

# Minimal pip install
pip install snakemake

# Verify
snakemake --version
# 8.x.x

Quick Start

# Snakefile — minimal 2-rule pipeline
SAMPLES = ["sampleA", "sampleB"]

rule all:             # Target rule: request final outputs
    input:
        expand("results/{sample}.sorted.bam", sample=SAMPLES)

rule align:
    input:
        fastq="data/{sample}.fastq",
        ref="refs/genome.fa"
    output:
        bam="results/{sample}.sorted.bam"
    threads: 4
    shell:
        "bwa mem -t {threads} {input.ref} {input.fastq} "
        "| samtools sort -@ {threads} -o {output.bam}"

# Run: dry-run first, then execute
snakemake -n            # dry-run: show what would run
snakemake --cores 8     # execute with 8 cores

Core API

Module 1: Rule Definition

Each rule defines one analysis step with inputs, outputs, and an execution method.

# Shell rule: run a command with {input} and {output} placeholders
rule fastqc:
    input:
        fastq="data/{sample}.fastq"
    output:
        html="qc/{sample}_fastqc.html",
        zip="qc/{sample}_fastqc.zip"
    log:
        "logs/fastqc/{sample}.log"
    shell:
        "fastqc {input.fastq} -o qc/ 2> {log}"

# Run rule: inline Python for logic-heavy steps
rule parse_stats:
    input:
        txt="results/{sample}.flagstat.txt"
    output:
        csv="results/{sample}.stats.csv"
    run:
        import re, csv
        lines = open(input.txt).readlines()
        mapped = re.search(r"(\d+) mapped", "".join(lines)).group(1)
        with open(output.csv, "w") as f:
            csv.writer(f).writerow([wildcards.sample, mapped])

# Script rule: delegate to external R/Python/Julia script
rule plot_coverage:
    input:
        depth="results/{sample}.depth.txt"
    output:
        pdf="results/{sample}.coverage.pdf"
    script:
        "scripts/plot_coverage.R"
    # In the R script, access via snakemake object:
    # depth_file <- snakemake@input[["depth"]]
    # pdf_path <- snakemake@output[["pdf"]]

Module 2: Wildcards and Pattern Expansion

Wildcards let one rule process any number of samples; expand() generates all required file paths.

# Define sample list (from config or glob)
SAMPLES = ["ctrl_rep1", "ctrl_rep2", "treat_rep1", "treat_rep2"]

rule all:
    input:
        # expand() generates: qc/ctrl_rep1_fastqc.html, qc/ctrl_rep2_fastqc.html, ...
        expand("qc/{sample}_fastqc.html", sample=SAMPLES),
        expand("results/{sample}.bam", sample=SAMPLES)

# Access wildcard values inside shell/run
rule align:
    input:
        "data/{sample}.fastq"
    output:
        "results/{sample}.bam"
    shell:
        "echo Processing {wildcards.sample}; "
        "bwa mem refs/genome.fa {input} | samtools view -b > {output}"

# Wildcard constraints prevent ambiguous matches
rule process:
    input:
        "data/{sample}_{rep}.fastq"
    output:
        "results/{sample}_{rep}.txt"
    wildcard_constraints:
        sample="[A-Za-z]+",   # letters only
        rep="\d+"             # digits only

# multiext: multiple outputs sharing a common path base
rule bwa_index:
    input:
        "refs/genome.fa"
    output:
        multiext("refs/genome.fa", ".amb", ".ann", ".bwt", ".pac", ".sa")
    shell:
        "bwa index {input}"

Module 3: Configuration and Parameters

Config files externalize settings; params passes rule-level values without file dependencies.

# Snakefile: declare config file
configfile: "config/config.yaml"

# config/config.yaml:
# samples: [ctrl, treat]
# threads:
#   align: 8
#   sort: 4
# min_mapq: 20

SAMPLES = config["samples"]

rule filter_reads:
    input:
        "results/{sample}.bam"
    output:
        "results/{sample}.filtered.bam"
    params:
        mapq=config["min_mapq"]    # from config, not a file
    threads:
        config["threads"]["sort"]
    shell:
        "samtools view -q {params.mapq} -b {input} > {output}"

# Dynamic params via lambda functions
rule trim:
    input:
        fastq="data/{sample}.fastq"
    output:
        trimmed="trimmed/{sample}.fastq"
    params:
        # Adapt quality threshold based on sample name
        quality=lambda wildcards: 25 if "ctrl" in wildcards.sample else 20
    shell:
        "fastp -q {params.quality} -i {input.fastq} -o {output.trimmed}"

Module 4: Resources and Environments

Declare computational resources for scheduler integration; use conda/Singularity for tool isolation.

# Resource declaration (used by SLURM/LSF profiles)
rule variant_calling:
    input:
        bam="results/{sample}.deduped.bam",
        ref="refs/genome.fa"
    output:
        vcf="variants/{sample}.vcf.gz"
    resources:
        mem_mb=16000,       # memory in MB
        runtime=240,        # max walltime in minutes
        disk_mb=20000       # scratch disk space
    threads: 8
    shell:
        "bcftools mpileup -f {input.ref} {input.bam} "
        "| bcftools call -m -Oz -o {output.vcf}"

# Conda environment per rule (for reproducibility)
rule star_align:
    input:
        reads="data/{sample}.fastq",
        genome_dir="refs/star_index/"
    output:
        bam="star_out/{sample}/Aligned.sortedByCoord.out.bam"
    conda:
        "envs/star.yaml"
    # envs/star.yaml:
    # channels:
    #   - bioconda
    # dependencies:
    #   - star=2.7.10b
    #   - samtools=1.17
    threads: 8
    shell:
        "STAR --runThreadN {threads} --genomeDir {input.genome_dir} "
        "--readFilesIn {input.reads} --outSAMtype BAM SortedByCoordinate"

# Singularity/Apptainer container
rule gatk_haplotypecaller:
    input:
        bam="results/{sample}.bam",
        ref="refs/genome.fa"
    output:
        gvcf="gvcfs/{sample}.g.vcf.gz"
    container:
        "docker://broadinstitute/gatk:4.4.0.0"
    shell:
        "gatk HaplotypeCaller -I {input.bam} -R {input.ref} "
        "-O {output.gvcf} -ERC GVCF"

Module 5: Execution and Cluster Profiles

Execute locally, on clusters, or in cloud; profiles configure executors without changing the Snakefile.

# Local execution
snakemake --cores 8                   # use 8 CPU cores
snakemake --cores all                 # use all available cores

# Dry run: show tasks without executing
snakemake -n --cores 8
# Output: 12 of 24 steps are complete. 12 jobs to run.

# Force rerun (ignore existing outputs)
snakemake --forceall --cores 8

# Visualize DAG as PDF
snakemake --dag | dot -Tpdf > workflow_dag.pdf

# SLURM cluster profile (profiles/slurm/config.yaml)
# executor: slurm
# jobs: 50
# default-resources:
#   mem_mb: 2000
#   runtime: 60
# use-conda: true

# Run with profile (cluster submit + monitor)
snakemake --profile profiles/slurm --cores 128

# Override resources at runtime
snakemake --profile profiles/slurm \
    --set-resources variant_calling:mem_mb=32000 --cores 128

# Override threads
snakemake --set-threads align=16 --cores 64

Module 6: Special Output Types and Utilities

Handle temporary files, protected outputs, checkpoints, and output validation.

# temp: auto-delete after downstream rules consume it
rule sort_bam:
    input:
        "results/{sample}.raw.bam"
    output:
        temp("results/{sample}.sorted_temp.bam")  # deleted after indexing
    shell:
        "samtools sort {input} -o {output}"

# protected: write-protect final outputs (prevent overwrite)
rule final_report:
    input:
        "results/{sample}.vcf.gz"
    output:
        protected("reports/{sample}.final.vcf.gz")
    shell:
        "cp {input} {output}"

# directory: rule that outputs a directory
rule denovo_assembly:
    input:
        fastq="data/{sample}.fastq"
    output:
        directory("assemblies/{sample}/")
    shell:
        "spades.py -s {input.fastq} -o {output}"

# touch: create empty flag file (for ordering-only dependencies)
rule validate_bam:
    input:
        "results/{sample}.bam"
    output:
        touch("checkpoints/{sample}.validated")
    shell:
        "samtools quickcheck {input} && echo OK"

# ensure: validate output properties before considering rule complete
rule download_reference:
    output:
        ensure("refs/genome.fa", min_size=1_000_000)
    shell:
        "wget -O {output} https://example.com/genome.fa"

Key Concepts

Rule Resolution and DAG

Snakemake works backward from targets: given a list of desired output files, it builds a DAG of rules needed to produce them. Rules not needed for the current targets are ignored.

# rule all: declare all final outputs here
# Without this, snakemake runs only the first rule
rule all:
    input:
        expand("results/{sample}.vcf.gz", sample=SAMPLES),
        expand("qc/{sample}_fastqc.html", sample=SAMPLES)

Wildcards vs Expand

{sample} in rule input/output = wildcard: filled by Snakemake at execution time
expand("results/{sample}.bam", sample=SAMPLES) = Python: generates a list of strings NOW (used in rule all)

Common Workflows

Workflow 1: Standard NGS QC Pipeline

Goal: FastQC → trim → align → sort → dedup → flagstat for multiple samples.

configfile: "config/config.yaml"
SAMPLES = config["samples"]

rule all:
    input:
        expand("qc/{sample}_fastqc.html", sample=SAMPLES),
        expand("results/{sample}.flagstat.txt", sample=SAMPLES)

rule fastqc:
    input:  "data/{sample}.fastq"
    output: "qc/{sample}_fastqc.html", "qc/{sample}_fastqc.zip"
    shell:  "fastqc {input} -o qc/"

rule trim:
    input:  "data/{sample}.fastq"
    output: "trimmed/{sample}.fastq"
    shell:  "fastp -q 20 -i {input} -o {output}"

rule align:
    input:
        fastq="trimmed/{sample}.fastq",
        ref="refs/genome.fa"
    output: temp("results/{sample}.raw.bam")
    threads: 8
    shell:
        "bwa mem -t {threads} {input.ref} {input.fastq} | samtools view -b > {output}"

rule sort_dedup:
    input:  "results/{sample}.raw.bam"
    output:
        bam="results/{sample}.bam",
        bai="results/{sample}.bam.bai"
    threads: 4
    shell:
        "samtools sort -@ {threads} {input} | samtools markdup -r - {output.bam} "
        "&& samtools index {output.bam}"

rule flagstat:
    input:  "results/{sample}.bam"
    output: "results/{sample}.flagstat.txt"
    shell:  "samtools flagstat {input} > {output}"

Workflow 2: Running on a SLURM Cluster

Goal: Deploy the same Snakefile to HPC with per-job resource allocation.

# 1. Create profiles/slurm/config.yaml
mkdir -p profiles/slurm
cat > profiles/slurm/config.yaml << 'EOF'
executor: slurm
jobs: 100
default-resources:
  mem_mb: 4000
  runtime: 60
use-conda: true
latency-wait: 30
rerun-incomplete: true
EOF

# 2. Add resources to compute-heavy rules in Snakefile
# resources:
#   mem_mb=16000, runtime=120

# 3. Submit
snakemake --profile profiles/slurm --cores 256 -n  # dry-run
snakemake --profile profiles/slurm --cores 256      # submit

# 4. Monitor
snakemake --profile profiles/slurm --report report.html  # after completion

Key Parameters

Parameter	Context	Default	Range/Options	Effect
`--cores`	CLI	1	1–N or `all`	Max concurrent jobs/threads
`threads:`	Rule	1	1–N	Threads per rule (scales to `--cores`)
`mem_mb:`	`resources:`	None	integer	Memory in MB (used by SLURM profile)
`runtime:`	`resources:`	None	integer (min)	Max walltime per job
`--profile`	CLI	None	path	YAML profile for executor config
`--use-conda`	CLI	False	flag	Activate per-rule conda environments
`--use-apptainer`	CLI	False	flag	Enable Singularity/Apptainer containers
`-n`	CLI	False	flag	Dry-run (show tasks, don't execute)
`--forceall`	CLI	False	flag	Rerun all rules regardless of status
`--rerun-incomplete`	CLI	False	flag	Rerun rules with partial outputs
`configfile:`	Snakefile	None	YAML path	Load config dictionary from YAML

Best Practices

Always define rule all: Without it, only the first rule in the Snakefile runs. rule all collects all final outputs; Snakemake runs everything needed to produce them.
Use temp() for large intermediates: BAM files before deduplication, unsorted BAMs, and intermediate assemblies can be marked temp() to auto-delete after consumption — saves significant disk.
Separate config from code: Put sample lists, thread counts, file paths, and thresholds in config.yaml. Hard-coded values in Snakefiles make pipelines brittle and non-reusable.
Test with snakemake -n first: The dry-run shows exactly which rules will run and in what order. Run it before every production execution to confirm the DAG is correct.
Use log: for every shell rule: Redirect tool stderr/stdout to per-rule log files (2> {log}). Without logs, debugging cluster job failures is nearly impossible.
Benchmark rules in production: Add benchmark: "benchmarks/{rule}/{sample}.txt" to measure actual runtime and memory — essential data for tuning SLURM resource requests.

Common Recipes

Recipe: Generate Sample List from Files

# Auto-discover samples from input directory (no hardcoded list)
from pathlib import Path

SAMPLES = [p.stem.replace(".fastq", "") for p in Path("data/").glob("*.fastq")]
print(f"Found {len(SAMPLES)} samples: {SAMPLES[:3]}...")

rule all:
    input:
        expand("results/{sample}.bam", sample=SAMPLES)

Recipe: Conditional Execution Based on Config

configfile: "config/config.yaml"

# Only run deduplication for WGS (not amplicon) data
rule dedup:
    input:
        "results/{sample}.sorted.bam"
    output:
        "results/{sample}.deduped.bam"
    run:
        if config.get("assay_type") == "WGS":
            shell("samtools markdup -r {input} {output}")
        else:
            shell("cp {input} {output}")

Recipe: Aggregate Multiple Samples

# Collect all per-sample stats into one summary table
rule multiqc:
    input:
        expand("qc/{sample}_fastqc.zip", sample=SAMPLES),
        expand("results/{sample}.flagstat.txt", sample=SAMPLES)
    output:
        "multiqc/multiqc_report.html"
    shell:
        "multiqc qc/ results/ -o multiqc/"

Troubleshooting

Problem	Cause	Solution
`AmbiguousRuleException`	Multiple rules match same output	Add `wildcard_constraints:`, use `ruleorder rule_a > rule_b`, or rename outputs
`MissingOutputException`	Rule completed but output file absent	Check working directory in shell; verify output path; check disk space
`TargetFileException`	`rule all` requests a file no rule can produce	Verify `expand()` args match wildcard names; use `-n` to trace resolution
Cluster jobs all fail	Resources too low for tool	Increase `mem_mb` or `runtime`; check cluster queue with `squeue`
Conda env build fails	Package conflict or wrong channel	Add `conda-forge` before `bioconda`; pin package versions
Rule reruns unexpectedly	Output file timestamp older than input	Touch output files with `snakemake --touch`; or delete and rerun
`PermissionError` on protected output	`protected()` wrapper applied	Remove protection with `--force`; or delete and regenerate without `protected()`

Related Skills

samtools-bam-processing — BAM sorting and indexing rules commonly used in Snakemake pipelines
bedtools-genomic-intervals — interval operations in downstream annotation rules
neuropixels-analysis — example of a complex multi-step pipeline that benefits from Snakemake

References

Snakemake documentation — rules, wildcards, profiles, API reference
Snakemake GitHub — source, releases, issue tracker
Mölder et al. (2021) "Sustainable data analysis with Snakemake" — F1000Research 10:33
Snakemake workflow catalog — community-maintained reference pipelines