From claude-data-analyst
Identify what the user is trying to analyse, diagnose gaps in the current dataset, propose external data sources that could fill them, then plan and implement the enrichment. Use when the dataset alone can't answer the user's question and extra context (reference data, lookups, joinable public datasets) is needed.
npx claudepluginhub danielrosehill/claude-code-plugins --plugin claude-data-analystThis skill uses the workspace's default tool permissions.
Turn an under-powered dataset into one that can actually answer the user's question, by identifying gaps and fusing in external data.
Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.
Share bugs, ideas, or general feedback.
Turn an under-powered dataset into one that can actually answer the user's question, by identifying gaps and fusing in external data.
Restate the user's question in one sentence. Identify the analytical unit (row = customer? transaction? country-year?) and the target (what are we trying to explain, predict, compare, or rank?).
If the goal is vague ("analyse this data"), push back: ask what decision or insight they want. Enrichment without a target is busywork.
Run a quick schema + sample on the dataset:
duckdb -c "DESCRIBE SELECT * FROM '<file>'"
duckdb -c "SELECT * FROM '<file>' LIMIT 5"
Note the columns grouped by role:
Compare the data to the goal and list concrete gaps. Each gap should name a missing variable or missing context, not just "more data". Examples:
For each gap, note: what variable is missing, why it matters for the goal, and what join key would connect it (country code, date, customer ID, postcode, ...).
For each gap, propose 1–3 candidate external sources. Evaluate each on:
| Criterion | What to check |
|---|---|
| Joinability | Does it share a key with the primary dataset? (ISO codes, dates, lat/lon, ids) |
| Coverage | Does it span the time range / geography / population of the primary data? |
| Freshness | How current is it? |
| Licence | Is it redistributable? (ODbL, CC-BY, public domain, commercial...) |
| Access | Bulk download, API, scrape? Cost? Rate limits? |
| Granularity | Does the resolution match? (country-year vs. city-month vs. postcode-day) |
Common reliable sources (use WebSearch / WebFetch to verify current endpoints before implementing):
Present the source shortlist to the user. Let them pick — don't silently fan out to 5 APIs.
Before writing code, write the plan as a short table:
| Target column(s) | Source | Join key | Method |
|---|---|---|---|
population, gdp_usd | World Bank API | iso3 + year | HTTP fetch + join |
avg_temp_c | Open-Meteo archive API | lat,lon + date | HTTP fetch per location + join |
currency_to_usd | Frankfurter API | currency + date | HTTP fetch + join |
Flag: cardinality of API calls (one per row? one per unique key?), caching strategy, expected runtime.
Get user sign-off on the plan before implementing — especially if it involves paid APIs or thousands of requests.
Write the enrichment as a reproducible script under scripts/enrich_<topic>.py (or .sql if it's pure DuckDB + HTTP extension). Conventions:
data/enrichment/cache/ keyed by query params. Re-runs should be free.data/enrichment/<source>.parquet — don't mutate the primary dataset in place.*_enriched table if joins are expensive.data/enrichment/PROVENANCE.md with source URL, fetch date, licence, and a SHA256 of the cached payload.Example structure:
data/
raw/
sales.csv
enrichment/
cache/
worldbank_population_IND_2015-2024.json
openmeteo_40.71_-74.01_2024.json
worldbank_indicators.parquet
weather_daily.parquet
PROVENANCE.md
scripts/
enrich_worldbank.py
enrich_weather.py
After enrichment:
Update the repo's CLAUDE.md (or create a data/enrichment/README.md) with:
python scripts/enrich_weather.py).setup-data-workspace, attach enrichments as new tables in the same .duckdb and update the Data section of CLAUDE.md.