VGP assembly pipeline - Galaxy workflow selection, execution patterns, QC checkpoints, and batch orchestration
npx claudepluginhub joshuarweaver/cascade-ai-ml-engineering --plugin delphine-l-claude-globalThis skill uses the workspace's default tool permissions.
The Vertebrate Genome Project (VGP) assembly pipeline consists of Galaxy workflows for producing high-quality, phased, chromosome-level genome assemblies. This skill covers workflow selection, execution patterns, and quality control checkpoints.
Creates isolated Git worktrees for feature branches with prioritized directory selection, gitignore safety checks, auto project setup for Node/Python/Rust/Go, and baseline verification.
Executes implementation plans in current session by dispatching fresh subagents per independent task, with two-stage reviews: spec compliance then code quality.
Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.
The Vertebrate Genome Project (VGP) assembly pipeline consists of Galaxy workflows for producing high-quality, phased, chromosome-level genome assemblies. This skill covers workflow selection, execution patterns, and quality control checkpoints.
Supporting files (detailed reference material):
When documenting workflow selection in publications:
"For species with available parental data (trio datasets), we employed
VGP2 -> VGP5 workflows. For species without parental data (non-trio datasets),
we performed VGP1 -> VGP3 workflows."
| Workflow | Name | Description |
|---|---|---|
| WF0 | Mitochondrial Assembly | MitoHiFi assembly (runs in parallel, may fail if no mito reads) |
| WF1 | K-mer Profiling | Genome size, heterozygosity estimation (HiFi) |
| WF2 | Trio K-mer Profiling | K-mer profiling with parental data |
| WF3 | Hifiasm | HiFi-only assembly |
| WF4 | Hifiasm + HiC | HiC-phased assembly |
| WF5 | Hifiasm Trio | Trio-phased assembly |
| WF6 | Purge Duplicates | Remove haplotypic duplications |
| Deprecated - no longer used | ||
| WF8 | Hi-C Scaffolding | YAHS chromosome scaffolding |
| WF9 | Decontamination | Remove contaminants |
| PreCuration | Pretext Snapshot | Prepare files for manual curation |
| Workflow | IWC Repo | Latest Version | Dockstore ID |
|---|---|---|---|
| WF1 | kmer-profiling-hifi-VGP1 | v0.6 | github.com/iwc-workflows/kmer-profiling-hifi-VGP1/main |
| WF4 | Assembly-Hifi-HiC-phasing-VGP4 | v0.5 | github.com/iwc-workflows/Assembly-Hifi-HiC-phasing-VGP4/main |
| WF8 | Scaffolding-HiC-VGP8 | v3.3 | github.com/iwc-workflows/Scaffolding-HiC-VGP8/main |
BUSCO -> Compleasm (WF4 v0.5, WF8 v3.3):
0.2.5+galaxy0) replaced BUSCO for gene completeness assessmentHi-C reads format change (WF4 v0.5, WF8 v3.3):
New required inputs across all workflows:
WF4 additional new inputs: Trim Hi-C reads? (boolean), Name for Haplotype 1/2 (defaults: Hap1/Hap2), Bits for bloom filter (default: 37) WF8 additional new inputs: Haplotype (restricted: Haplotype 1/2, Maternal/Paternal, Primary/Alternate), Trim Hi-C Data? (boolean), Minimum Mapping Quality (default: 10)
Check latest versions via Dockstore API:
https://dockstore.org/api/ga4gh/trs/v2/tools/%23workflow%2Fgithub.com%2Fiwc-workflows%2F{REPO}%2Fmain/versions
Check workflow inputs by fetching the .ga file from GitHub:
https://raw.githubusercontent.com/iwc-workflows/{REPO}/main/{WORKFLOW_NAME}.ga
if trajectory == "C" (HiFi only):
WF6 is REQUIRED
WF6 border: solid
else: # Trajectory A or B
WF6 is OPTIONAL
WF6 border: dashed
Can skip directly to WF8
When to skip WF6 (Trajectories A/B):
When to run WF6 (Trajectories A/B):
| Data Type | Minimum Coverage | Notes |
|---|---|---|
| HiFi | 30x | Diploid genome |
| Hi-C | 60x | Diploid genome |
Problem: VGP assemblies often place both sex chromosomes (X+Y or Z+W) in the main haplotype, requiring adjustment to expected chromosome counts.
Solution: When both sex chromosomes present, expected = n + 1 (not n)
Implementation:
# Adjust haploid expected when BOTH sex chromosomes in main haplotype
df['num_chromosomes_haploid_adjusted'] = df['num_chromosomes_haploid'].copy()
both_sex_chr_patterns = [
'Has X and Y',
'Has Z and W',
'has Z and W',
'Has X1, X2, and Y',
'Has Z1, Z2, and W',
'Has 5X and 5Y'
]
if 'Sex chromosomes main haploptype' in df.columns:
has_both_sex = df['Sex chromosomes main haploptype'].isin(both_sex_chr_patterns)
df.loc[has_both_sex & df['num_chromosomes_haploid'].notna(),
'num_chromosomes_haploid_adjusted'] = \
df.loc[has_both_sex & df['num_chromosomes_haploid'].notna(),
'num_chromosomes_haploid'] + 1
Biological Reasoning:
Impact: Improved perfect match rate from 0% to ~90% in validation analyses
Validation Metrics:
# Use adjusted counts for validation
achieved = df['total_number_of_chromosomes']
expected = df['num_chromosomes_haploid_adjusted']
perfect_matches = (achieved == expected).sum()
within_1 = ((achieved - expected).abs() <= 1).sum()
ratio = achieved / expected
Wrong: Compare diploid expected (2n) to haploid assembly
Wrong: Use haploid (n) when both sex chromosomes present
Correct: Use adjusted haploid (n or n+1 depending on sex chromosome configuration)
WF0 runs in parallel with the main pipeline and may fail if:
def check_mitohifi_failure(wf0_result):
"""Distinguish biological vs technical failure"""
if "no_mito_reads" in wf0_result.log:
return "biological" # Expected for some samples
else:
return "technical" # Investigate further
When creating workflow diagrams:
#fff3e0)#e8f5e9)#f3e5f5)#e3f2fd)#e8f5e9)#fce4ec)#e3f2fd): "x2 per haplotype" - runs separately#e8f5e9): "both haplotypes" - runs together#4285f4)#34a853)#ea4335)| Trajectory | Inputs | K-mer | Assembly | Purge | Scaffold | Finish | Output |
|---|---|---|---|---|---|---|---|
| A | HiFi+HiC | WF1 | WF4 | [WF6] | WF8 | WF9->Pre | hap1/hap2 |
| B | HiFi+Trio | WF2 | WF5 | [WF6] | WF8 | WF9->Pre | mat/pat |
| C | HiFi only | WF1 | WF3 | WF6 | - | WF9->Pre | pri/alt |
[WF6] = optional, WF6 = required, - = skipped
GCA_011100685.1 - Frequently used reference genome for RagTag scaffolding in canid genome assemblies.
When documenting scaffolding in methods sections:
For reproducibility: