Process bulk RNA-seq datasets for VEuPathDB resources
Guides bulk RNA-seq dataset curation for VEuPathDB by fetching ENA/BioSample metadata, analyzing samples for experimental factors and strandedness, curating contacts, generating presenter XML, and creating pipeline delivery outputs. Use when processing new bulk RNA-seq datasets for VEuPathDB resources.
/plugin marketplace add VEuPathDB/dataset-curator/plugin install curation-skills@dataset-curatorThis skill inherits all available tools. When active, it can use any tool Claude has access to.
TODO.mdresources/editing-large-xml.mdresources/pdf-extraction.mdresources/step-1-fetch-metadata.mdresources/step-2-analyze-samples.mdresources/step-3-curate-contacts.mdresources/step-4-generate-presenter.mdresources/step-5-generate-outputs.mdresources/valid-projects.jsonscripts/check-delivery-dirs.shscripts/check-repos.shscripts/fetch-miniml.jsscripts/fetch-sra-metadata.jsscripts/generate-analysis-config.jsscripts/generate-presenter-xml.jsscripts/generate-samplesheet.jsThis skill guides processing of bulk RNA-seq datasets for VEuPathDB resources.
This workflow requires the following repositories in veupathdb-repos/:
First, run the repository status check to verify repositories are present:
Note: this script is located in the skill directory
bash scripts/check-repos.sh ApiCommonPresenters EbrcModelCommon
If repositories are missing, the script will provide clone instructions.
Branch Confirmation: After verifying repositories exist, check their current branches and status using git -C <path>, then confirm with the user before proceeding.
Example:
git -C veupathdb-repos/ApiCommonPresenters branch --show-current
git -C veupathdb-repos/ApiCommonPresenters status -sb
IMPORTANT: All commands in this workflow must be run from your curation workspace directory (the directory that contains veupathdb-repos/ as a subdirectory).
For Claude Code:
cd commands to change into subdirectoriesgit -C <path> for git operations in subdirectoriesThe workflow creates:
tmp/ - Intermediate files (gitignored)delivery/bulk-rnaseq/<BIOPROJECT>/ - Pipeline outputs (gitignored)Gather the following before starting:
PRJNA1018599)If a journal article is available for this dataset, providing it enhances the curation workflow:
To include a PDF:
tmp/<BIOPROJECT>_article.pdf (e.g., tmp/PRJNA1018599_article.pdf)The PDF will be processed by a subagent once in Step 1 and extracted data saved to tmp/<BIOPROJECT>_pdf_extracted.json for use throughout the workflow.
Fetch run-level metadata from ENA and sample attributes from NCBI BioSample. If a journal article PDF is available, extract key information for use in later steps.
Commands:
node scripts/fetch-sra-metadata.js <BIOPROJECT>
Output: tmp/<BIOPROJECT>_sra_metadata.json
Optional - Fetch MINiML for GEO-linked datasets:
node scripts/fetch-miniml.js <BIOPROJECT>
Output: tmp/<GSE>_family.xml (if GEO-linked)
Optional - Extract PDF data:
If tmp/<BIOPROJECT>_article.pdf is present, a subagent will extract it (do not read it yourself).
Output (on success): tmp/<BIOPROJECT>_pdf_extracted.json
Detailed instructions: Step 1 - Fetch Metadata
Claude analyzes the fetched metadata to:
Output: tmp/<BIOPROJECT>_sample_annotations.json
Detailed instructions: Step 2 - Analyze Samples
Identify and curate contact entries from GEO contributors or BioProject submitters.
Actions:
veupathdb-repos/EbrcModelCommon/Model/lib/xml/datasetPresenters/contacts/allContacts.xmlDetailed instructions: Step 3 - Curate Contacts
Generate the datasetPresenter XML, review/edit it, then insert into the presenter file.
Command:
node scripts/generate-presenter-xml.js <BIOPROJECT> <PROJECT> <PRIMARY_CONTACT_ID> [ADDITIONAL_CONTACT_IDS...]
Output: tmp/<BIOPROJECT>_presenter.xml
Workflow:
Target file: veupathdb-repos/ApiCommonPresenters/Model/lib/xml/datasetPresenters/<PROJECT>.xml
Detailed instructions: Step 4 - Generate Presenter
Generate pipeline configuration files for the data processing team.
Commands:
bash scripts/check-delivery-dirs.sh bulk-rnaseq <BIOPROJECT>
node scripts/generate-analysis-config.js <BIOPROJECT> [--strand-specific]
node scripts/generate-samplesheet.js <BIOPROJECT> [strandedness]
The strandedness argument accepts: stranded, unstranded, or auto. If omitted, the script checks _pdf_extracted.json and _sample_annotations.json before falling back to auto.
Outputs in delivery/bulk-rnaseq/<BIOPROJECT>/:
analysisConfig.xml - Pipeline configurationsamplesheet.csv - Also for the processing pipelineDetailed instructions: Step 5 - Generate Outputs
After completing this workflow:
delivery/bulk-rnaseq/<BIOPROJECT>/ to data processing teamscripts/fetch-sra-metadata.js - Fetches SRA run metadata from ENA + BioSample attributes from NCBIscripts/fetch-miniml.js - Fetches MINiML XML for GEO-linked datasetsscripts/generate-presenter-xml.js - Generates RNA-seq datasetPresenter XMLscripts/generate-analysis-config.js - Generates analysisConfig.xml for pipelinescripts/generate-samplesheet.js - Generates/delivers samplesheet.csv and sampleAnnotations.jsonscripts/check-repos.sh - Validates veupathdb-repos/ repository setup (synced from shared/)scripts/check-delivery-dirs.sh - Creates delivery directory structure (synced from shared/)Use when working with Payload CMS projects (payload.config.ts, collections, fields, hooks, access control, Payload API). Use when debugging validation errors, security issues, relationship queries, transactions, or hook behavior.