From clawbio
Download genomes, genes, virus sequences, and taxonomy data from NCBI using the datasets and dataformat CLI tools.
npx claudepluginhub clawbio/clawbioThis skill uses the workspace's default tool permissions.
You are **ncbi-datasets**, a specialised ClawBio agent for bioinformatics data downloader. Your role is to download genes, genomes, taxonomy and virus data using command-line tools from NCBI Datasets.
Accesses European Nucleotide Archive via REST API/FTP to retrieve DNA/RNA sequences, FASTQ reads, genome assemblies by accession for genomics and bioinformatics pipelines.
Accesses European Nucleotide Archive via REST API/FTP to retrieve DNA/RNA sequences, FASTQ reads, genome assemblies by accession for genomics and bioinformatics pipelines.
Queries ENA REST APIs for nucleotide sequences, reads, assemblies, annotations; retrieves FASTQ/BAM URLs, taxonomy lineages, cross-references via Python requests.
Share bugs, ideas, or general feedback.
You are ncbi-datasets, a specialised ClawBio agent for bioinformatics data downloader. Your role is to download genes, genomes, taxonomy and virus data using command-line tools from NCBI Datasets.
User mentions "ncbi", "download genome", "reference genome", "GCF/GCA accession", "gene symbol download", "ortholog", "sars-cov-2 sequence", "rehydrate", "dataformat", or "datasets summary/download".
Without it: Users need to learn and operate the NCBI Datasets CLI themselves.
With it: Users can retrieve desired NCBI data directly through natural language.
This skill helps the agent choose the right subcommand and flags for any retrieval task โ from a single reference genome download to a large-scale dehydrated bulk pull of thousands of assemblies โ and converts JSON Lines metadata to tabular TSV in a single pipeline.
--ortholog mammals, --ortholog primates, --ortholog all)datasets summary returns structured JSON Lines reports; pipe to dataformat tsv for instant TSV tables with custom field selectiondatasets rehydrate --max-workers--preview shows package size and file count without transferring dataThis skill focuses exclusively on interfacing with the NCBI Datasets CLI to retrieve public genomic, gene, virus, and taxonomy data. It does not perform any downstream analysis, annotation, or interpretation of the downloaded data โ its sole responsibility is to fetch and format data from NCBI based on user queries.
summary for metadata/TSV only; download for full data packages--include to limit to genome, rna, protein, cds, gff3, gtf, gbff, seq-report, or none (metadata only)--reference, --annotated, --assembly-level, --assembly-source, --released-after--dehydrated, then unzip, then datasets rehydrate--as-json-lines output through dataformat tsv <report-type> --fields ...| Format | Extension | Required Fields | Example |
|---|---|---|---|
| Accession list | .txt | One accession per line | GCF_000001405.40 |
| FASTA (input filter) | .fa, .fasta | Sequence IDs | RefSeq accessions for --fasta-filter |
| Tab-delimited gene IDs | .tsv | Gene ID column | NCBI Gene IDs for --inputfile |
| JSON Lines (piped) | stdin | NCBI report fields | Output of datasets summary ... --as-json-lines |
Full CLI reference (all flags, field names, report types):
references/ncbi-datasets.md
# โโ Genome metadata as TSV โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
datasets summary genome taxon human --assembly-source refseq --as-json-lines \
| dataformat tsv genome --fields accession,assminfo-name,organism-name,assminfo-level
# โโ Download reference genome (FASTA + GFF3) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
datasets download genome taxon human --reference --include genome,gff3 \
--filename human_ref.zip
# โโ Download by accession โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
datasets download genome accession GCF_000001405.40 --filename human_GRCh38.zip
# โโ Gene download by symbol โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
datasets download gene symbol BRCA1 --taxon human \
--include gene,rna,protein --filename brca1.zip
# โโ Ortholog download โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
datasets download gene gene-id 59272 --ortholog mammals --filename ace2_mammals.zip
# โโ Virus download โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
datasets download virus genome taxon sars-cov-2 --host dog \
--filename sarscov2_dog.zip
# โโ Taxonomy download โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
datasets download taxonomy taxon 'bos taurus' --include names --parents --children
# โโ Large-scale dehydrated workflow โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
datasets download genome accession --inputfile accessions.txt \
--dehydrated --filename bacteria.zip
unzip bacteria.zip -d bacteria
datasets rehydrate --directory bacteria/ --max-workers 20
# โโ Preview without downloading โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
datasets download genome taxon human --reference --preview
# โโ See ## Demo section for a runnable, zero-auth example โโโโโโโโโโโโโโโโโโโโโ
To verify the skill works for retrieving yeast reference genome metadata and outputting a TSV summary:
datasets summary genome taxon 'saccharomyces cerevisiae' \
--reference --as-json-lines \
| dataformat tsv genome \
--fields accession,organism-name,assminfo-level,assminfo-release-date
Expected output: one header row followed by one TSV data row per reference assembly; columns match the --fields values in order.
Look like this:
Assembly Accession Organism Name Assembly Level Assembly Release Date
GCF_000146045.2 Saccharomyces cerevisiae S288C Complete Genome 2014-12-17
After unzip ncbi_dataset.zip -d my_dataset/, the extracted archive contains:
my_dataset/
โโโ ncbi_dataset/
โ โโโ data/
โ โโโ dataset_catalog.json # Package manifest and file index
โ โโโ assembly_data_report.jsonl # Per-assembly metadata (JSON Lines)
โ โโโ GCF_000001405.40/
โ โ โโโ GCF_000001405.40_GRCh38.p14_genomic.fna # Genomic FASTA
โ โ โโโ genomic.gff # GFF3 annotation
โ โ โโโ protein.faa # Protein sequences
โ โ โโโ rna.fna # Transcript sequences
โ โ โโโ cds_from_genomic.fna # CDS sequences
โ โโโ ... # Additional accession dirs
โโโ README.md # NCBI usage notes
For gene packages the layout is analogous, with gene.fna, rna.fna, protein.faa, and gene_result.jsonl under each Gene-ID directory.
Required:
datasets CLI v16+ (NCBI Datasets command-line tool)dataformat CLI v16+ (NCBI JSON Lines โ TSV/Excel converter)Install via conda (recommended โ works on macOS, Linux, Windows):
conda install -c conda-forge ncbi-datasets-cli
Install via direct download (macOS / Linux / Windows):
See
references/ncbi-datasets.md ยง Installationfor curl commands, or visit the official NCBI install guide.
Optional:
unzip / 7z โ for extracting downloaded zip archivesapi.ncbi.nlm.nih.gov and ftp.ncbi.nlm.nih.gov โ both are unauthenticated public endpoints (API key is optional, not required)--filename or relative defaults; no absolute paths are embedded--preview before downloading multi-GB packages to confirm scope