From sciagent-skills
Accesses NCBI GEO via GEOparse and E-utilities. Searches datasets by keyword/organism/platform, downloads GSE matrices, parses GPL annotations/GSM metadata, loads expression data into pandas.
npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skillsThis skill uses the workspace's default tool permissions.
GEO (Gene Expression Omnibus) is NCBI's public repository for high-throughput functional genomics data, containing 200,000+ datasets (series) from microarrays, RNA-seq, ChIP-seq, methylation, and proteomics experiments. GEOparse provides a Python interface for downloading and parsing GEO records (GSE series, GPL platforms, GSM samples) while NCBI E-utilities enables programmatic search across G...
Accesses NCBI GEO to search/download gene expression datasets (microarray/RNA-seq GSE/GSM/GPL), retrieve SOFT/Matrix files for transcriptomics and genomics analysis workflows.
Accesses NCBI GEO to search and download microarray/RNA-seq gene expression datasets (GSE, GSM, GPL). Retrieves SOFT/Matrix files for transcriptomics analysis.
Retrieves gene expression and omics datasets from ArrayExpress and BioStudies with gene disambiguation, quality assessment, and structured profiles including metadata, samples, and download links. Useful for queries on expression data or accessions like E-MTAB, E-GEOD, S-BSST.
Share bugs, ideas, or general feedback.
GEO (Gene Expression Omnibus) is NCBI's public repository for high-throughput functional genomics data, containing 200,000+ datasets (series) from microarrays, RNA-seq, ChIP-seq, methylation, and proteomics experiments. GEOparse provides a Python interface for downloading and parsing GEO records (GSE series, GPL platforms, GSM samples) while NCBI E-utilities enables programmatic search across GEO's metadata.
cellxgene-census; for aligned reads, download FASTQ from ENA/SRA insteadGEOparse, requests, pandaspip install GEOparse requests pandas
import GEOparse
# Download a GEO series (caches in current directory)
gse = GEOparse.get_GEO("GSE2553", destdir="./geo_data/")
print(f"Title: {gse.metadata['title'][0]}")
print(f"Samples: {len(gse.gsms)}")
print(f"Platform: {list(gse.gpls.keys())}")
# Sample metadata
meta = gse.phenotype_data
print(meta.head())
Find GEO series (GSE) by keyword, organism, or dataset type.
import requests
EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
def geo_search(query, retmax=20):
r = requests.get(f"{BASE}/esearch.fcgi",
params={"db": "gds", "term": query,
"retmax": retmax, "retmode": "json", "email": EMAIL})
r.raise_for_status()
return r.json()["esearchresult"]
# Search for human breast cancer RNA-seq datasets
result = geo_search(
"breast cancer[title] AND Homo sapiens[organism] AND gse[entry type]",
retmax=10
)
print(f"Found {result['count']} matching GEO datasets")
print(f"First accessions (UIDs): {result['idlist']}")
# Search for specific platform (e.g., Illumina HumanHT-12)
result = geo_search(
"Illumina HumanHT-12[platform] AND Homo sapiens[organism] AND gse[entry type]",
retmax=5
)
print(f"Illumina HumanHT-12 human datasets: {result['count']}")
Retrieve title, accession, and organism for search results.
import requests
EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
def geo_summary(uids):
r = requests.post(f"{BASE}/esummary.fcgi",
data={"db": "gds", "id": ",".join(uids),
"retmode": "json", "email": EMAIL})
r.raise_for_status()
return r.json()["result"]
# Get metadata for search results
result = geo_search_func = lambda q: requests.get(
f"{BASE}/esearch.fcgi",
params={"db": "gds", "term": q, "retmax": 3, "retmode": "json", "email": EMAIL}
).json()["esearchresult"]["idlist"]
uids = requests.get(
f"{BASE}/esearch.fcgi",
params={"db": "gds", "term": "lung cancer[title] AND gse[entry type]",
"retmax": 3, "retmode": "json", "email": EMAIL}
).json()["esearchresult"]["idlist"]
summaries = geo_summary(uids)
for uid in summaries.get("uids", []):
s = summaries[uid]
print(f"\nAccession: {s.get('accession')} | {s.get('title')}")
print(f" Organism: {s.get('taxon')}")
print(f" Samples: {s.get('n_samples')}")
print(f" Type: {s.get('gdstype')}")
Use GEOparse to download a full GSE record with expression matrix and sample metadata.
import GEOparse
# Download GSE (auto-caches; skip download if already present)
gse = GEOparse.get_GEO("GSE2553", destdir="./geo_data/", silent=True)
# Series metadata
print(f"Title : {gse.metadata['title'][0]}")
print(f"Summary : {gse.metadata['summary'][0][:200]}...")
print(f"Samples : {len(gse.gsms)} GSMs")
print(f"Platforms: {list(gse.gpls.keys())}")
# Sample metadata table (phenotype data)
meta = gse.phenotype_data
print(f"Metadata columns: {list(meta.columns)}")
print(meta.head())
Parse probe-level expression data and optionally merge with platform gene annotations.
import GEOparse, pandas as pd
gse = GEOparse.get_GEO("GSE2553", destdir="./geo_data/", silent=True)
# Pivot to gene expression matrix (probes × samples)
gpl_id = list(gse.gpls.keys())[0]
pivot = gse.pivot_samples("VALUE", gpl_id)
print(f"Expression matrix shape: {pivot.shape}") # (probes, samples)
print(pivot.iloc[:5, :3])
# Annotate probes with gene symbols from the GPL platform
gpl = gse.gpls[gpl_id]
annot = gpl.table[["ID", "Gene Symbol", "Gene Title"]].copy()
annot.columns = ["ID", "gene_symbol", "gene_title"]
annot = annot.dropna(subset=["gene_symbol"])
annot = annot[annot["gene_symbol"] != ""]
expr_annotated = pivot.join(annot.set_index("ID"), how="inner")
print(f"Annotated expression matrix: {expr_annotated.shape}")
print(expr_annotated[["gene_symbol", "gene_title"]].head())
Retrieve expression values and metadata for a single sample.
import GEOparse
gsm = GEOparse.get_GEO("GSM45553", destdir="./geo_data/", silent=True)
print(f"Title : {gsm.metadata['title'][0]}")
print(f"Source : {gsm.metadata.get('source_name_ch1', ['n/a'])[0]}")
print(f"Organism: {gsm.metadata.get('organism_ch1', ['n/a'])[0]}")
print(f"Data rows: {len(gsm.table)}")
print(gsm.table.head())
For large datasets, download the series matrix file directly from GEO FTP.
import urllib.request, gzip, io, pandas as pd
# GEO series matrix file URL pattern
accession = "GSE2553"
series_num = accession[3:] # strip "GSE"
folder = f"GSE{series_num[:-3]}nnn" if len(series_num) > 3 else f"GSE{series_num[:-2]}nn"
url = f"https://ftp.ncbi.nlm.nih.gov/geo/series/{folder}/{accession}/matrix/{accession}_series_matrix.txt.gz"
with urllib.request.urlopen(url) as resp:
with gzip.open(resp, "rt", encoding="utf-8") as f:
lines = f.readlines()
# Find metadata lines (start with !) and data table
meta_lines = [l for l in lines if l.startswith("!")]
data_start = next(i for i, l in enumerate(lines) if l.startswith('"ID_REF"'))
df = pd.read_csv(
io.StringIO("".join(lines[data_start:])),
sep="\t", index_col=0
)
print(f"Matrix shape: {df.shape}")
print(df.iloc[:3, :3])
Multi-assay or multi-batch submissions (e.g., RNA-seq + ATAC-seq) are organized as a SuperSeries GSE that references one or more SubSeries GSEs. Each SubSeries holds its own samples, platform, and matrix; the SuperSeries itself has no samples of its own. Both are tagged in gse.metadata:
gse.metadata["relation"] contains entries like "SuperSeries of: GSExxxx"gse.metadata["relation"] contains "SubSeries of: GSEyyyy"Always resolve SubSeries before pulling an expression matrix — downloading the SuperSeries alone yields metadata but no data.
import GEOparse
gse = GEOparse.get_GEO("GSE47966", destdir="./geo_data/", silent=True) # a SuperSeries
relations = gse.metadata.get("relation", [])
subseries = [r.split(": ")[1] for r in relations if r.startswith("SuperSeries of")]
print(f"SubSeries to download: {subseries}")
for acc in subseries:
sub = GEOparse.get_GEO(acc, destdir="./geo_data/", silent=True)
print(f" {acc}: {len(sub.gsms)} samples, platforms={list(sub.gpls.keys())}")
GEOparse downloads SOFT-format files (plain text). For XML-based access, use MiniML format via E-utilities. Series Matrix files (tab-delimited) are the most compact format for expression data.
Goal: Download a GEO dataset, extract the expression matrix and group labels, and save for downstream differential expression analysis.
import GEOparse, pandas as pd
# Download series
gse = GEOparse.get_GEO("GSE2553", destdir="./geo_data/", silent=True)
# 1. Extract expression matrix
gpl_id = list(gse.gpls.keys())[0]
expr = gse.pivot_samples("VALUE", gpl_id)
# 2. Extract sample groups from characteristics
meta = gse.phenotype_data
print("Available metadata columns:", list(meta.columns))
# 3. Annotate probes with gene symbols
gpl = gse.gpls[gpl_id]
gene_col = "Gene Symbol" if "Gene Symbol" in gpl.table.columns else gpl.table.columns[1]
annot = gpl.table[["ID", gene_col]].dropna()
annot.columns = ["probe_id", "gene_symbol"]
annot = annot[annot["gene_symbol"].str.strip() != ""]
expr_genes = expr.join(annot.set_index("probe_id")[["gene_symbol"]], how="inner")
expr_genes = expr_genes.groupby("gene_symbol").mean() # average duplicate probes
print(f"Genes × Samples: {expr_genes.shape}")
expr_genes.to_csv("expression_matrix.csv")
meta.to_csv("sample_metadata.csv")
print("Saved: expression_matrix.csv, sample_metadata.csv")
Goal: Search GEO for studies matching a topic and build a curated inventory CSV.
import requests, time, pandas as pd
EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
topic = "Alzheimer disease"
r = requests.get(f"{BASE}/esearch.fcgi",
params={"db": "gds", "email": EMAIL, "retmode": "json", "retmax": 50,
"term": f"{topic}[title] AND Homo sapiens[organism] AND gse[entry type]"})
uids = r.json()["esearchresult"]["idlist"]
print(f"Found {len(uids)} GSE datasets for '{topic}'")
rows = []
for i in range(0, len(uids), 20):
batch = uids[i:i+20]
r2 = requests.post(f"{BASE}/esummary.fcgi",
data={"db": "gds", "id": ",".join(batch),
"retmode": "json", "email": EMAIL})
result = r2.json()["result"]
for uid in result.get("uids", []):
s = result[uid]
rows.append({
"accession": s.get("accession"),
"title": s.get("title"),
"n_samples": s.get("n_samples"),
"organism": s.get("taxon"),
"gds_type": s.get("gdstype"),
"pub_date": s.get("pdat"),
})
time.sleep(0.4)
df = pd.DataFrame(rows).sort_values("n_samples", ascending=False)
df.to_csv(f"{topic.replace(' ', '_')}_geo_datasets.csv", index=False)
print(df[["accession", "title", "n_samples"]].head(10).to_string(index=False))
| Parameter | Module | Default | Range / Options | Effect |
|---|---|---|---|---|
destdir | GEOparse.get_GEO | "./" | any directory path | Where to save downloaded files |
silent | GEOparse.get_GEO | False | True/False | Suppress download progress output |
retmax | ESearch | 20 | 1–10000 | Max dataset records returned |
entry type query | ESearch | — | "gse", "gds", "gpl", "gsm" | Filter by GEO record type |
VALUE column | pivot_samples | — | column name in GSM table | Expression value column to pivot |
email | E-utilities | required | valid email | NCBI rate-limit attribution |
Use silent=True in GEOparse: Suppresses verbose download progress; add your own print statement to confirm download.
Cache downloads: GEOparse skips re-downloading if the .soft.gz file already exists in destdir. Set a shared destdir across sessions to avoid redundant downloads.
Prefer Series Matrix for large datasets: For series with 100+ samples, download the _series_matrix.txt.gz directly from FTP rather than parsing individual GSM soft files—it's orders of magnitude faster.
Handle probe-to-gene mapping carefully: Many probes map to multiple genes or no gene. Decide how to handle multi-gene probes (drop, split, or keep) before analysis. Use gene_symbol.str.split(" /// ") for Affymetrix arrays.
Check platform column names: GPL annotation table column names vary by platform (e.g., "Gene Symbol" vs "GENE_SYMBOL" vs "gene_id"). Always inspect gpl.table.columns before assuming field names.
Always resolve SubSeries before analysis: After loading any GSE, inspect gse.metadata.get("relation", []) for "SuperSeries of: ..." entries. If present, iterate every referenced SubSeries accession and download each one — the SuperSeries record itself carries no samples or expression matrices. Skipping this step silently drops the actual data.
When to use: Get series title, sample count, and platform for any GSE accession.
import GEOparse
gse = GEOparse.get_GEO("GSE2553", destdir="./geo_data/", silent=True)
print(f"Title : {gse.metadata['title'][0]}")
print(f"Samples: {len(gse.gsms)}")
print(f"Platform: {list(gse.gpls.keys())}")
print(f"Summary: {gse.metadata['summary'][0][:300]}")
When to use: Parse GEO sample characteristics into a tidy DataFrame for grouping.
import GEOparse, pandas as pd, re
gse = GEOparse.get_GEO("GSE2553", destdir="./geo_data/", silent=True)
meta = gse.phenotype_data
# Parse "characteristics_ch1" columns
ch_cols = [c for c in meta.columns if "characteristics" in c.lower()]
print(f"Characteristic columns: {ch_cols}")
print(meta[ch_cols].head())
When to use: Enumerate sample accessions for download or metadata collection.
import GEOparse
gse = GEOparse.get_GEO("GSE2553", destdir="./geo_data/", silent=True)
gsm_ids = list(gse.gsms.keys())
print(f"Total samples: {len(gsm_ids)}")
print("First 5:", gsm_ids[:5])
| Problem | Cause | Solution |
|---|---|---|
FileNotFoundError during download | Incorrect destdir | Create directory first: os.makedirs("geo_data/", exist_ok=True) |
pivot_samples returns empty DataFrame | GPL annotation table missing ID | Check gpl.table.columns; use correct probe ID column name |
KeyError for "Gene Symbol" | Platform uses different column name | Inspect gpl.table.columns and use the correct annotation column |
| Download hangs for large series | Large SOFT file (GB range) | Use FTP Series Matrix download instead of GEOparse for large series |
| ESearch returns 0 results | Wrong entry type or field tag | Switch gse[entry type] to gds[entry type]; verify query syntax |
Numeric sample columns contain null | Missing/absent expression values | Fill with df.fillna(0) or drop columns with high missingness |
GSE has no samples / empty gse.gsms | Accession is a SuperSeries | Parse gse.metadata["relation"] for SuperSeries of: entries and download each SubSeries |
cellxgene-census — Single-cell RNA-seq data at scale (61M+ cells) as an alternative to GEO for scRNA-seqgene-database — NCBI Gene records with curated annotations for genes found in GEO studiespubmed-database — Retrieve publications linked to GEO datasets via NCBI ELinkpydeseq2-differential-expression — Downstream differential expression analysis after loading GEO count datagds database