From mims-harvard-tooluniverse
Finds and evaluates research datasets for scientific questions. Reasons about data needs like study design and variables, searches repositories, assesses fitness, and identifies access requirements.
npx claudepluginhub joshuarweaver/cascade-data-analytics --plugin mims-harvard-tooluniverseThis skill uses the workspace's default tool permissions.
- User asks "find me data about X" or "where can I get data on Y"
Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.
Dynamically discovers and combines enabled skills into cohesive, unexpected delightful experiences like interactive HTML or themed artifacts. Activates on 'surprise me', inspiration, or boredom cues.
Generates images from structured JSON prompts via Python script execution. Supports reference images and aspect ratios for characters, scenes, products, visuals.
Before searching, determine the minimum data requirements:
Study design needed:
Variables needed:
Population needed:
Search from broadest to most specific. Use find_tools to discover available dataset search tools — don't rely on memorized tool names.
Layer 1 — Cross-repository search (cast wide net): Search tools that index datasets across thousands of repositories. These find datasets you didn't know existed.
Layer 2 — Domain-specific repositories: Search repositories specialized for your data type.
Layer 3 — Literature-based discovery: Many datasets aren't in any repository — they're described in paper methods sections.
For each candidate dataset, assess these dimensions:
Variables:
Design match:
Sample:
Access:
Quality:
Don't stop at finding datasets — download and analyze them. Write and run Python code via Bash. Never describe what you "would do" — execute it.
Choose the loader that matches your data source. When unsure of the format, download a small sample first and inspect.
import requests, io, pandas as pd
# --- Tabular files (most common) ---
df = pd.read_csv("data.csv") # CSV / TSV (use sep="\t" for TSV)
df = pd.read_excel("data.xlsx") # Excel
df = pd.read_stata("data.dta") # Stata
df = pd.read_sas("data.xpt", format="xport") # SAS transport (XPT)
df = pd.read_sas("data.sas7bdat", format="sas7bdat") # SAS native
df = pd.read_parquet("data.parquet") # Parquet
df = pd.read_json("data.json") # JSON (records or columnar)
df = pd.read_fwf("data.dat") # Fixed-width (some legacy surveys)
# --- Download from URL first, then parse ---
resp = requests.get(url, timeout=120)
content = resp.content
# Detect format from URL or content header
if url.endswith(".XPT") or url.endswith(".xpt"):
df = pd.read_sas(io.BytesIO(content), format="xport")
elif url.endswith(".csv") or url.endswith(".csv.gz"):
df = pd.read_csv(io.BytesIO(content))
elif url.endswith(".tsv") or url.endswith(".tsv.gz"):
df = pd.read_csv(io.BytesIO(content), sep="\t")
elif url.endswith(".json"):
df = pd.read_json(io.BytesIO(content))
else:
# Try CSV first, then inspect
df = pd.read_csv(io.BytesIO(content))
# --- REST API pagination (common for GDC, ClinicalTrials.gov, etc.) ---
import json
all_records = []
offset = 0
while True:
resp = requests.get(f"{api_url}?offset={offset}&limit=100", timeout=30)
batch = resp.json().get("data", [])
if not batch:
break
all_records.extend(batch)
offset += len(batch)
df = pd.DataFrame(all_records)
# Merge multiple files on participant/sample ID
merged = df1.merge(df2, on="id_col", how="inner")
# Filter population
subset = merged[(merged["age"] >= 60) & (merged["age"] <= 80)].copy()
# Handle missing values
missing_pct = subset.isnull().mean() * 100
print("Missing % per variable:\n", missing_pct[missing_pct > 0].sort_values(ascending=False))
subset = subset.dropna(subset=["exposure_var", "outcome_var"])
# Quick regression
import statsmodels.formula.api as smf
model = smf.ols("outcome ~ exposure + age + sex", data=subset).fit()
print(model.summary())
# Visualization
import matplotlib.pyplot as plt
plt.scatter(subset["exposure"], subset["outcome"], alpha=0.3)
plt.xlabel("Exposure"); plt.ylabel("Outcome")
plt.savefig("/tmp/scatter.png", dpi=150, bbox_inches="tight")
Always run the code and report actual numbers (β, p-value, CI, N).
Structure the report as:
Critical honesty rules:
Never assume a dataset exists — search for it. Never assume access is public — check. Never assume variables are measured the way you need — verify the codebook.