Access and navigate GenomeArk AWS S3 bucket - VGP assemblies, QC data, and species directory structure
npx claudepluginhub joshuarweaver/cascade-ai-ml-engineering --plugin delphine-l-claude-globalThis skill is limited to using the following tools:
Comprehensive guide for accessing and navigating the GenomeArk AWS S3 public bucket containing Vertebrate Genomes Project (VGP) assemblies and quality control data.
Creates isolated Git worktrees for feature branches with prioritized directory selection, gitignore safety checks, auto project setup for Node/Python/Rust/Go, and baseline verification.
Executes implementation plans in current session by dispatching fresh subagents per independent task, with two-stage reviews: spec compliance then code quality.
Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.
Comprehensive guide for accessing and navigating the GenomeArk AWS S3 public bucket containing Vertebrate Genomes Project (VGP) assemblies and quality control data.
Supporting files (read as needed for detailed code and strategies):
Use this skill when:
GenomeArk is a public AWS S3 bucket (s3://genomeark/) hosting:
Access Method: Public bucket requiring no AWS credentials when using --no-sign-request
Critical Discovery: GenomeArk structure has evolved over time (2022 -> 2024+). Always implement fallback path patterns for reliability.
s3://genomeark/
└── species/
└── {Genus_species}/ # e.g., Rhinolophus_ferrumequinum
└── {ToLID}/ # e.g., mRhiFer1 (VGP specimen ID)
├── assembly_vgp_{type}_{version}/
│ ├── evaluation/ # QC metrics (MAIN ACCESS POINT)
│ │ ├── genomescope/
│ │ ├── busco/
│ │ ├── merqury/
│ │ └── ...
│ └── intermediates/ # K-mer databases, temp files
│ └── meryl/
└── genomic_data/ # Raw sequencing data folders
assembly_vgp_{type}_{version} - Standard VGP Patterns:
assembly_vgp_HiC_2.0 - Hi-C phased assembly (case-sensitive!)assembly_vgp_standard_2.0 - Standard assembly without Hi-Cassembly_vgp_hic_2.0 - Alternative Hi-C namingassembly_vgp_trio_2.0 - Trio-binned assemblyLegacy Versions (2019-2021 assemblies):
assembly_vgp_standard_1.6 - Version 1.6 (common in fish, birds)assembly_vgp_standard_1.0 - Version 1.0 (early assemblies)assembly_vgp_HiC_1.6 - Hi-C version 1.6assembly_vgp_HiC_1.0 - Hi-C version 1.0assembly_vgp_HiC_1.4 - Hi-C version 1.4Verkko Assemblies (diploid assemblies):
assembly_verkko_1.4/ - Verkko version 1.4assembly_verkko_1.1-0.1/ - Verkko version 1.1-0.1assembly_verkko_1.1-0.1-freeze/ - Frozen versionassembly_verkko_1.1-0.2/ - Version 1.1-0.2assembly_verkko_1.4.1r/ - Revised version 1.4.1Clade-Specific Directories (2023+ specialized assemblies):
assembly_primate_v1.4.2/ - Primate-specific pipelineassembly_fish_* - Fish-specific (potential)assembly_bird_* - Bird-specific (potential)Institution-Specific Directories:
assembly_rockefeller/ - Rockefeller University assembliesassembly_cambridge/ - Cambridge assembliesassembly_MT_rockefeller/ - Case variationassembly_mt_rockefeller/ - Lowercase variationassembly_mt_milan/ - Milan instituteDirectories Without "assembly_" Prefix (rare):
vgp_standard_1.6/ - Standard v1.6 without prefixvgp_standard_1.0/ - Standard v1.0 without prefixvgp_HiC_1.6/ - Hi-C v1.6 without prefixCurated Assemblies (post-manual curation):
assembly_curated/ - Exclude for date extraction (post-curation dates)CRITICAL CASE SENSITIVITY:
assembly_vgp_hic_2.0 (lowercase)assembly_vgp_HiC_2.0 (mixed case!)COMPREHENSIVE PATTERN MATCHING:
For detailed fetching code and parsing logic, see qc-data-fetching.md.
| Data Type | Location | Key Notes |
|---|---|---|
| GenomeScope | evaluation/genomescope/ | 3 filename patterns (double/single/no underscore); validate heterozygosity ranges |
| BUSCO | evaluation/busco/{subdir}/ | Dynamic subdir search (c/, p/, c1/, p1/); parse C:XX.X% |
| Merqury | evaluation/merqury/ | Two path layouts (direct vs nested); QV in column 4 |
| Meryl hist | intermediates/meryl/ | Use .hist file only (~700KB), not full database (~10GB) |
| Assembly dates | FASTA filenames | YYYYMMDD stamps; see assembly-date-extraction.md |
| Technology | genomic_data/ subfolders | pacbio_hifi/ -> HiFi, ont/ -> ONT, etc. |
def normalize_s3_path(s3_path):
"""Normalize path for GenomeArk (case sensitivity!)"""
if not s3_path:
return None
s3_path = s3_path.replace('/assembly_vgp_hic_2.0/', '/assembly_vgp_HiC_2.0/')
if not s3_path.endswith('/'):
s3_path += '/'
return s3_path
{ToLID}_genomescope__Summary.txt (double underscore, most common){ToLID}_genomescope_Summary.txt (single underscore, easily missed){ToLID}_Summary.txt (no prefix, older assemblies)Checking only A and B causes ~30-40% of data to be missed.
Reject failed runs where heterozygosity range > 50% or max > 95%. A range of 0%-100% indicates complete model failure.
https://genomeark.s3.amazonaws.com/species/{species}/{tolid}/assembly_vgp_standard_1.0/intermediates/meryl/{tolid}.cut.meryl.hist
AWS CLI pattern (prefer over boto3 for public buckets):
cmd = ['aws', 's3', 'cp', s3_path, '-', '--no-sign-request']
result = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
Rate limiting: 0.2s delay between requests.
Common pitfalls: Case sensitivity (hic vs HiC), directory evolution (2022 vs 2024 layouts), downloading full meryl databases instead of .hist files. See best-practices.md for full list.