From sciagent-skills
Builds reproducible workflows with Snakemake: define rules for file dependencies, automatic parallelism, and per-rule conda/Singularity envs. Scales NGS/ML pipelines from local to SLURM/AWS/GCP.
How this skill is triggered — by the user, by Claude, or both
Slash command
/sciagent-skills:snakemake-workflow-engineThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Snakemake is a Python-based workflow management system that scales analyses from laptop to HPC and cloud. Workflows are defined as rules with explicit input/output file dependencies; Snakemake resolves the execution order automatically and runs independent steps in parallel. Rules can call shell commands, Python/R/Julia scripts, or inline Python. Per-rule conda or Singularity environments make ...
Snakemake is a Python-based workflow management system that scales analyses from laptop to HPC and cloud. Workflows are defined as rules with explicit input/output file dependencies; Snakemake resolves the execution order automatically and runs independent steps in parallel. Rules can call shell commands, Python/R/Julia scripts, or inline Python. Per-rule conda or Singularity environments make workflows fully reproducible. Widely used in bioinformatics for NGS, genome assembly, and variant-calling pipelines.
Nextflow instead when you need Groovy DSL + dataflow channels, or when leveraging the nf-core community pipeline libraryPrefect or Airflow instead for data engineering workflows with dynamic task graphs or time-based schedulingsnakemake, graphviz (for DAG visualization)Check before installing: The tool may already be available in the current environment (e.g., inside a
pixi/condaenv). Runcommand -v snakemakefirst and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool viapixi run snakemakerather than baresnakemake.
# Install via conda (includes optional dependencies)
conda install -c conda-forge -c bioconda snakemake
# Minimal pip install
pip install snakemake
# Verify
snakemake --version
# 8.x.x
# Snakefile — minimal 2-rule pipeline
SAMPLES = ["sampleA", "sampleB"]
rule all: # Target rule: request final outputs
input:
expand("results/{sample}.sorted.bam", sample=SAMPLES)
rule align:
input:
fastq="data/{sample}.fastq",
ref="refs/genome.fa"
output:
bam="results/{sample}.sorted.bam"
threads: 4
shell:
"bwa mem -t {threads} {input.ref} {input.fastq} "
"| samtools sort -@ {threads} -o {output.bam}"
# Run: dry-run first, then execute
snakemake -n # dry-run: show what would run
snakemake --cores 8 # execute with 8 cores
Each rule defines one analysis step with inputs, outputs, and an execution method.
# Shell rule: run a command with {input} and {output} placeholders
rule fastqc:
input:
fastq="data/{sample}.fastq"
output:
html="qc/{sample}_fastqc.html",
zip="qc/{sample}_fastqc.zip"
log:
"logs/fastqc/{sample}.log"
shell:
"fastqc {input.fastq} -o qc/ 2> {log}"
# Run rule: inline Python for logic-heavy steps
rule parse_stats:
input:
txt="results/{sample}.flagstat.txt"
output:
csv="results/{sample}.stats.csv"
run:
import re, csv
lines = open(input.txt).readlines()
mapped = re.search(r"(\d+) mapped", "".join(lines)).group(1)
with open(output.csv, "w") as f:
csv.writer(f).writerow([wildcards.sample, mapped])
# Script rule: delegate to external R/Python/Julia script
rule plot_coverage:
input:
depth="results/{sample}.depth.txt"
output:
pdf="results/{sample}.coverage.pdf"
script:
"scripts/plot_coverage.R"
# In the R script, access via snakemake object:
# depth_file <- snakemake@input[["depth"]]
# pdf_path <- snakemake@output[["pdf"]]
Wildcards let one rule process any number of samples; expand() generates all required file paths.
# Define sample list (from config or glob)
SAMPLES = ["ctrl_rep1", "ctrl_rep2", "treat_rep1", "treat_rep2"]
rule all:
input:
# expand() generates: qc/ctrl_rep1_fastqc.html, qc/ctrl_rep2_fastqc.html, ...
expand("qc/{sample}_fastqc.html", sample=SAMPLES),
expand("results/{sample}.bam", sample=SAMPLES)
# Access wildcard values inside shell/run
rule align:
input:
"data/{sample}.fastq"
output:
"results/{sample}.bam"
shell:
"echo Processing {wildcards.sample}; "
"bwa mem refs/genome.fa {input} | samtools view -b > {output}"
# Wildcard constraints prevent ambiguous matches
rule process:
input:
"data/{sample}_{rep}.fastq"
output:
"results/{sample}_{rep}.txt"
wildcard_constraints:
sample="[A-Za-z]+", # letters only
rep="\d+" # digits only
# multiext: multiple outputs sharing a common path base
rule bwa_index:
input:
"refs/genome.fa"
output:
multiext("refs/genome.fa", ".amb", ".ann", ".bwt", ".pac", ".sa")
shell:
"bwa index {input}"
Config files externalize settings; params passes rule-level values without file dependencies.
# Snakefile: declare config file
configfile: "config/config.yaml"
# config/config.yaml:
# samples: [ctrl, treat]
# threads:
# align: 8
# sort: 4
# min_mapq: 20
SAMPLES = config["samples"]
rule filter_reads:
input:
"results/{sample}.bam"
output:
"results/{sample}.filtered.bam"
params:
mapq=config["min_mapq"] # from config, not a file
threads:
config["threads"]["sort"]
shell:
"samtools view -q {params.mapq} -b {input} > {output}"
# Dynamic params via lambda functions
rule trim:
input:
fastq="data/{sample}.fastq"
output:
trimmed="trimmed/{sample}.fastq"
params:
# Adapt quality threshold based on sample name
quality=lambda wildcards: 25 if "ctrl" in wildcards.sample else 20
shell:
"fastp -q {params.quality} -i {input.fastq} -o {output.trimmed}"
Declare computational resources for scheduler integration; use conda/Singularity for tool isolation.
# Resource declaration (used by SLURM/LSF profiles)
rule variant_calling:
input:
bam="results/{sample}.deduped.bam",
ref="refs/genome.fa"
output:
vcf="variants/{sample}.vcf.gz"
resources:
mem_mb=16000, # memory in MB
runtime=240, # max walltime in minutes
disk_mb=20000 # scratch disk space
threads: 8
shell:
"bcftools mpileup -f {input.ref} {input.bam} "
"| bcftools call -m -Oz -o {output.vcf}"
# Conda environment per rule (for reproducibility)
rule star_align:
input:
reads="data/{sample}.fastq",
genome_dir="refs/star_index/"
output:
bam="star_out/{sample}/Aligned.sortedByCoord.out.bam"
conda:
"envs/star.yaml"
# envs/star.yaml:
# channels:
# - bioconda
# dependencies:
# - star=2.7.10b
# - samtools=1.17
threads: 8
shell:
"STAR --runThreadN {threads} --genomeDir {input.genome_dir} "
"--readFilesIn {input.reads} --outSAMtype BAM SortedByCoordinate"
# Singularity/Apptainer container
rule gatk_haplotypecaller:
input:
bam="results/{sample}.bam",
ref="refs/genome.fa"
output:
gvcf="gvcfs/{sample}.g.vcf.gz"
container:
"docker://broadinstitute/gatk:4.4.0.0"
shell:
"gatk HaplotypeCaller -I {input.bam} -R {input.ref} "
"-O {output.gvcf} -ERC GVCF"
Execute locally, on clusters, or in cloud; profiles configure executors without changing the Snakefile.
# Local execution
snakemake --cores 8 # use 8 CPU cores
snakemake --cores all # use all available cores
# Dry run: show tasks without executing
snakemake -n --cores 8
# Output: 12 of 24 steps are complete. 12 jobs to run.
# Force rerun (ignore existing outputs)
snakemake --forceall --cores 8
# Visualize DAG as PDF
snakemake --dag | dot -Tpdf > workflow_dag.pdf
# SLURM cluster profile (profiles/slurm/config.yaml)
# executor: slurm
# jobs: 50
# default-resources:
# mem_mb: 2000
# runtime: 60
# use-conda: true
# Run with profile (cluster submit + monitor)
snakemake --profile profiles/slurm --cores 128
# Override resources at runtime
snakemake --profile profiles/slurm \
--set-resources variant_calling:mem_mb=32000 --cores 128
# Override threads
snakemake --set-threads align=16 --cores 64
Handle temporary files, protected outputs, checkpoints, and output validation.
# temp: auto-delete after downstream rules consume it
rule sort_bam:
input:
"results/{sample}.raw.bam"
output:
temp("results/{sample}.sorted_temp.bam") # deleted after indexing
shell:
"samtools sort {input} -o {output}"
# protected: write-protect final outputs (prevent overwrite)
rule final_report:
input:
"results/{sample}.vcf.gz"
output:
protected("reports/{sample}.final.vcf.gz")
shell:
"cp {input} {output}"
# directory: rule that outputs a directory
rule denovo_assembly:
input:
fastq="data/{sample}.fastq"
output:
directory("assemblies/{sample}/")
shell:
"spades.py -s {input.fastq} -o {output}"
# touch: create empty flag file (for ordering-only dependencies)
rule validate_bam:
input:
"results/{sample}.bam"
output:
touch("checkpoints/{sample}.validated")
shell:
"samtools quickcheck {input} && echo OK"
# ensure: validate output properties before considering rule complete
rule download_reference:
output:
ensure("refs/genome.fa", min_size=1_000_000)
shell:
"wget -O {output} https://example.com/genome.fa"
Snakemake works backward from targets: given a list of desired output files, it builds a DAG of rules needed to produce them. Rules not needed for the current targets are ignored.
# rule all: declare all final outputs here
# Without this, snakemake runs only the first rule
rule all:
input:
expand("results/{sample}.vcf.gz", sample=SAMPLES),
expand("qc/{sample}_fastqc.html", sample=SAMPLES)
{sample} in rule input/output = wildcard: filled by Snakemake at execution timeexpand("results/{sample}.bam", sample=SAMPLES) = Python: generates a list of strings NOW (used in rule all)Goal: FastQC → trim → align → sort → dedup → flagstat for multiple samples.
configfile: "config/config.yaml"
SAMPLES = config["samples"]
rule all:
input:
expand("qc/{sample}_fastqc.html", sample=SAMPLES),
expand("results/{sample}.flagstat.txt", sample=SAMPLES)
rule fastqc:
input: "data/{sample}.fastq"
output: "qc/{sample}_fastqc.html", "qc/{sample}_fastqc.zip"
shell: "fastqc {input} -o qc/"
rule trim:
input: "data/{sample}.fastq"
output: "trimmed/{sample}.fastq"
shell: "fastp -q 20 -i {input} -o {output}"
rule align:
input:
fastq="trimmed/{sample}.fastq",
ref="refs/genome.fa"
output: temp("results/{sample}.raw.bam")
threads: 8
shell:
"bwa mem -t {threads} {input.ref} {input.fastq} | samtools view -b > {output}"
rule sort_dedup:
input: "results/{sample}.raw.bam"
output:
bam="results/{sample}.bam",
bai="results/{sample}.bam.bai"
threads: 4
shell:
"samtools sort -@ {threads} {input} | samtools markdup -r - {output.bam} "
"&& samtools index {output.bam}"
rule flagstat:
input: "results/{sample}.bam"
output: "results/{sample}.flagstat.txt"
shell: "samtools flagstat {input} > {output}"
Goal: Deploy the same Snakefile to HPC with per-job resource allocation.
# 1. Create profiles/slurm/config.yaml
mkdir -p profiles/slurm
cat > profiles/slurm/config.yaml << 'EOF'
executor: slurm
jobs: 100
default-resources:
mem_mb: 4000
runtime: 60
use-conda: true
latency-wait: 30
rerun-incomplete: true
EOF
# 2. Add resources to compute-heavy rules in Snakefile
# resources:
# mem_mb=16000, runtime=120
# 3. Submit
snakemake --profile profiles/slurm --cores 256 -n # dry-run
snakemake --profile profiles/slurm --cores 256 # submit
# 4. Monitor
snakemake --profile profiles/slurm --report report.html # after completion
| Parameter | Context | Default | Range/Options | Effect |
|---|---|---|---|---|
--cores | CLI | 1 | 1–N or all | Max concurrent jobs/threads |
threads: | Rule | 1 | 1–N | Threads per rule (scales to --cores) |
mem_mb: | resources: | None | integer | Memory in MB (used by SLURM profile) |
runtime: | resources: | None | integer (min) | Max walltime per job |
--profile | CLI | None | path | YAML profile for executor config |
--use-conda | CLI | False | flag | Activate per-rule conda environments |
--use-apptainer | CLI | False | flag | Enable Singularity/Apptainer containers |
-n | CLI | False | flag | Dry-run (show tasks, don't execute) |
--forceall | CLI | False | flag | Rerun all rules regardless of status |
--rerun-incomplete | CLI | False | flag | Rerun rules with partial outputs |
configfile: | Snakefile | None | YAML path | Load config dictionary from YAML |
Always define rule all: Without it, only the first rule in the Snakefile runs. rule all collects all final outputs; Snakemake runs everything needed to produce them.
Use temp() for large intermediates: BAM files before deduplication, unsorted BAMs, and intermediate assemblies can be marked temp() to auto-delete after consumption — saves significant disk.
Separate config from code: Put sample lists, thread counts, file paths, and thresholds in config.yaml. Hard-coded values in Snakefiles make pipelines brittle and non-reusable.
Test with snakemake -n first: The dry-run shows exactly which rules will run and in what order. Run it before every production execution to confirm the DAG is correct.
Use log: for every shell rule: Redirect tool stderr/stdout to per-rule log files (2> {log}). Without logs, debugging cluster job failures is nearly impossible.
Benchmark rules in production: Add benchmark: "benchmarks/{rule}/{sample}.txt" to measure actual runtime and memory — essential data for tuning SLURM resource requests.
# Auto-discover samples from input directory (no hardcoded list)
from pathlib import Path
SAMPLES = [p.stem.replace(".fastq", "") for p in Path("data/").glob("*.fastq")]
print(f"Found {len(SAMPLES)} samples: {SAMPLES[:3]}...")
rule all:
input:
expand("results/{sample}.bam", sample=SAMPLES)
configfile: "config/config.yaml"
# Only run deduplication for WGS (not amplicon) data
rule dedup:
input:
"results/{sample}.sorted.bam"
output:
"results/{sample}.deduped.bam"
run:
if config.get("assay_type") == "WGS":
shell("samtools markdup -r {input} {output}")
else:
shell("cp {input} {output}")
# Collect all per-sample stats into one summary table
rule multiqc:
input:
expand("qc/{sample}_fastqc.zip", sample=SAMPLES),
expand("results/{sample}.flagstat.txt", sample=SAMPLES)
output:
"multiqc/multiqc_report.html"
shell:
"multiqc qc/ results/ -o multiqc/"
| Problem | Cause | Solution |
|---|---|---|
AmbiguousRuleException | Multiple rules match same output | Add wildcard_constraints:, use ruleorder rule_a > rule_b, or rename outputs |
MissingOutputException | Rule completed but output file absent | Check working directory in shell; verify output path; check disk space |
TargetFileException | rule all requests a file no rule can produce | Verify expand() args match wildcard names; use -n to trace resolution |
| Cluster jobs all fail | Resources too low for tool | Increase mem_mb or runtime; check cluster queue with squeue |
| Conda env build fails | Package conflict or wrong channel | Add conda-forge before bioconda; pin package versions |
| Rule reruns unexpectedly | Output file timestamp older than input | Touch output files with snakemake --touch; or delete and rerun |
PermissionError on protected output | protected() wrapper applied | Remove protection with --force; or delete and regenerate without protected() |
npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skillsBuilds scalable containerized bioinformatics pipelines with Nextflow dataflow model: processes connected by channels, runnable on local, HPC (SLURM/SGE), cloud (AWS/GCP/Azure), or Kubernetes via config profiles. Powers nf-core pipelines.
Develops and deploys serverless bioinformatics workflows on LatchBio using the Latch SDK. Author workflows with decorators, manage cloud I/O, and wrap Nextflow/Snakemake pipelines.
Latch platform for bioinformatics workflows. Build pipelines with Latch SDK, @workflow/@task decorators, deploy serverless workflows, LatchFile/LatchDir, Nextflow/Snakemake integration.