From claude-data-analyst
Set up a "talk to your data" workspace in the current repo — discover local data files, load them into a DuckDB database, and append a CLAUDE.md block telling future Claude sessions how to query it. Use when the user wants to make a repo's data conversationally queryable without wiring up a full BI stack.
npx claudepluginhub danielrosehill/claude-code-plugins --plugin claude-data-analystThis skill uses the workspace's default tool permissions.
Turn a folder of loose data files into a single queryable DuckDB database, and teach future Claude sessions (via CLAUDE.md) how to use it.
Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.
Share bugs, ideas, or general feedback.
Turn a folder of loose data files into a single queryable DuckDB database, and teach future Claude sessions (via CLAUDE.md) how to use it.
DuckDB is the right default here: zero-server, single-file .duckdb database, reads CSV/Parquet/JSON/Excel natively, and every skill in this plugin already assumes duckdb is on PATH.
data.duckdb at repo root).Scan the repo for data files, ignoring .git, node_modules, .venv, __pycache__, dist, build:
find . -type f \( -iname '*.csv' -o -iname '*.tsv' -o -iname '*.parquet' \
-o -iname '*.json' -o -iname '*.jsonl' -o -iname '*.ndjson' \
-o -iname '*.xlsx' -o -iname '*.xls' \) \
-not -path '*/\.*' -not -path '*/node_modules/*' -not -path '*/\.venv/*'
Report what was found: file count, total size, formats, any obvious groupings (e.g. all CSVs in data/raw/). If nothing is found, ask the user where the data lives.
Before loading, show the user the file list and confirm:
sales_2024.csv → sales_2024). Flag collisions.Generate a loader SQL script at scripts/load_data.sql (create scripts/ if needed). Example shape:
-- Auto-generated by setup-data-workspace
-- Re-run with: duckdb data.duckdb < scripts/load_data.sql
INSTALL excel; LOAD excel; -- only if xlsx present
CREATE OR REPLACE VIEW sales_2024 AS SELECT * FROM read_csv_auto('data/raw/sales_2024.csv');
CREATE OR REPLACE VIEW customers AS SELECT * FROM read_csv_auto('data/raw/customers.csv');
CREATE OR REPLACE TABLE budget AS SELECT * FROM read_xlsx('data/raw/budget.xlsx');
-- ...
Run it: duckdb data.duckdb < scripts/load_data.sql
Then verify: duckdb data.duckdb -c "SHOW TABLES;" and a SELECT COUNT(*) on each.
If data.duckdb might be regenerated from source (i.e. all loads are views or the raw files are committed), add data.duckdb to .gitignore. If the database is the primary artifact (raw files not in repo), leave it tracked but warn the user about size.
Append (or create) a ## Data section in the repo's CLAUDE.md. Keep it terse — this is operational context, not a tutorial:
## Data
This repo has a DuckDB database at `data.duckdb` with the following tables/views:
| Name | Source | Rows | Description |
|---|---|---|---|
| sales_2024 | data/raw/sales_2024.csv | 12,450 | <one-line description> |
| customers | data/raw/customers.csv | 3,201 | <one-line description> |
| budget | data/raw/budget.xlsx | 48 | <one-line description> |
To query:
duckdb data.duckdb -c "SELECT ... FROM sales_2024 ..."
To rebuild from source after data files change:
duckdb data.duckdb < scripts/load_data.sql
For analysis tasks, prefer the `claude-data-analyst` plugin skills (`/claude-data-analyst:trend-analysis`, `:correlation-analysis`, etc.) — they assume `duckdb` on PATH and will operate on `data.duckdb` by default.
Fill in row counts from Step 3. For descriptions, ask the user or infer from column names — mark inferred ones with (inferred).
Summarise:
/claude-data-analyst:data-dictionary-creator to document columns").data.duckdb, don't overwrite — offer to add new tables to it, or pick a different filename.read_csv_auto usually handles it; fall back to explicit read_csv(..., encoding='latin-1') if it errors.SELECT *.CLAUDE.md, create one with just the ## Data section — don't invent project-wide instructions.