Help us improve
Share bugs, ideas, or general feedback.
From tooluniverse
Find and evaluate research datasets for any scientific question. Maps questions to study designs and searches 30+ repositories including GEO, UK Biobank, and NHANES.
npx claudepluginhub mims-harvard/tooluniverse --plugin tooluniverseHow this skill is triggered — by the user, by Claude, or both
Slash command
/tooluniverse:tooluniverse-dataset-discoveryThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- User asks "find me data about X" or "where can I get data on Y"
Guides epidemiological study analysis from PECO question design through statistical modeling and publication-ready reporting. Runs Python code for NHANES/UK-Biobank-style cohort, case-control, and cross-sectional analyses.
Systematically discovers novel research topics from longitudinal cohort databases by profiling cohort variables, matching PI expertise, and scanning literature saturation to output ranked gap proposals.
Guides researchers through open science practices: preregistration, FAIR data, open access publishing, reproducible analysis, and funder mandate compliance.
Share bugs, ideas, or general feedback.
Before searching, determine the minimum data requirements:
Study design needed:
Variables needed:
Population needed:
Search from broadest to most specific. Use find_tools to discover available dataset search tools — don't rely on memorized tool names.
Layer 1 — Cross-repository search (cast wide net): Search tools that index datasets across thousands of repositories. These find datasets you didn't know existed.
Layer 2 — Domain-specific repositories: Search repositories specialized for your data type.
Layer 3 — Literature-based discovery: Many datasets aren't in any repository — they're described in paper methods sections.
For each candidate dataset, assess these dimensions:
Variables:
Design match:
Sample:
Access:
Quality:
Don't stop at finding datasets — download and analyze them. Write and run Python code via Bash. Never describe what you "would do" — execute it.
Choose the loader that matches your data source. When unsure of the format, download a small sample first and inspect.
import requests, io, pandas as pd
# --- Tabular files (most common) ---
df = pd.read_csv("data.csv") # CSV / TSV (use sep="\t" for TSV)
df = pd.read_excel("data.xlsx") # Excel
df = pd.read_stata("data.dta") # Stata
df = pd.read_sas("data.xpt", format="xport") # SAS transport (XPT)
df = pd.read_sas("data.sas7bdat", format="sas7bdat") # SAS native
df = pd.read_parquet("data.parquet") # Parquet
df = pd.read_json("data.json") # JSON (records or columnar)
df = pd.read_fwf("data.dat") # Fixed-width (some legacy surveys)
# --- Download from URL first, then parse ---
resp = requests.get(url, timeout=120)
content = resp.content
# Detect format from URL or content header
if url.endswith(".XPT") or url.endswith(".xpt"):
df = pd.read_sas(io.BytesIO(content), format="xport")
elif url.endswith(".csv") or url.endswith(".csv.gz"):
df = pd.read_csv(io.BytesIO(content))
elif url.endswith(".tsv") or url.endswith(".tsv.gz"):
df = pd.read_csv(io.BytesIO(content), sep="\t")
elif url.endswith(".json"):
df = pd.read_json(io.BytesIO(content))
else:
# Try CSV first, then inspect
df = pd.read_csv(io.BytesIO(content))
# --- REST API pagination (common for GDC, ClinicalTrials.gov, etc.) ---
import json
all_records = []
offset = 0
while True:
resp = requests.get(f"{api_url}?offset={offset}&limit=100", timeout=30)
batch = resp.json().get("data", [])
if not batch:
break
all_records.extend(batch)
offset += len(batch)
df = pd.DataFrame(all_records)
# Merge multiple files on participant/sample ID
merged = df1.merge(df2, on="id_col", how="inner")
# Filter population
subset = merged[(merged["age"] >= 60) & (merged["age"] <= 80)].copy()
# Handle missing values
missing_pct = subset.isnull().mean() * 100
print("Missing % per variable:\n", missing_pct[missing_pct > 0].sort_values(ascending=False))
subset = subset.dropna(subset=["exposure_var", "outcome_var"])
# Quick regression
import statsmodels.formula.api as smf
model = smf.ols("outcome ~ exposure + age + sex", data=subset).fit()
print(model.summary())
# Visualization
import matplotlib.pyplot as plt
plt.scatter(subset["exposure"], subset["outcome"], alpha=0.3)
plt.xlabel("Exposure"); plt.ylabel("Outcome")
plt.savefig("/tmp/scatter.png", dpi=150, bbox_inches="tight")
Always run the code and report actual numbers (β, p-value, CI, N).
Structure the report as:
Critical honesty rules:
Never assume a dataset exists — search for it. Never assume access is public — check. Never assume variables are measured the way you need — verify the codebook.