Skill

setup-data-workspace

Set up a "talk to your data" workspace in the current repo — discover local data files, load them into a DuckDB database, and append a CLAUDE.md block telling future Claude sessions how to query it. Use when the user wants to make a repo's data conversationally queryable without wiring up a full BI stack.

npx claudepluginhub danielrosehill/claude-code-plugins --plugin claude-data-analyst

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Turn a folder of loose data files into a single queryable DuckDB database, and teach future Claude sessions (via CLAUDE.md) how to use it.

SKILL.md

Similar Skills

github-deep-research

63.9k

Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.

2 files

bytedance-deer-flow-1

Stats

Stars1

Forks0

Last CommitApr 23, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Setup Data Workspace

Turn a folder of loose data files into a single queryable DuckDB database, and teach future Claude sessions (via CLAUDE.md) how to use it.

DuckDB is the right default here: zero-server, single-file .duckdb database, reads CSV/Parquet/JSON/Excel natively, and every skill in this plugin already assumes duckdb is on PATH.

Inputs

Target repo (default: current working directory).
Optional: explicit data directory (default: scan the repo for data files).
Optional: database filename (default: data.duckdb at repo root).

Procedure

Step 1 — Discover data

Scan the repo for data files, ignoring .git, node_modules, .venv, __pycache__, dist, build:

find . -type f \( -iname '*.csv' -o -iname '*.tsv' -o -iname '*.parquet' \
  -o -iname '*.json' -o -iname '*.jsonl' -o -iname '*.ndjson' \
  -o -iname '*.xlsx' -o -iname '*.xls' \) \
  -not -path '*/\.*' -not -path '*/node_modules/*' -not -path '*/\.venv/*'

Report what was found: file count, total size, formats, any obvious groupings (e.g. all CSVs in data/raw/). If nothing is found, ask the user where the data lives.

Step 2 — Confirm scope

Before loading, show the user the file list and confirm:

Which files to load (default: all discovered).
Table naming — default: sanitised filename stem (e.g. sales_2024.csv → sales_2024). Flag collisions.
Whether to load as views (re-read source on every query, always fresh) or tables (materialised, faster but frozen). Default: views for CSV/JSON, tables for Excel (xlsx reading is slow).

Step 3 — Create the database

Generate a loader SQL script at scripts/load_data.sql (create scripts/ if needed). Example shape:

-- Auto-generated by setup-data-workspace
-- Re-run with: duckdb data.duckdb < scripts/load_data.sql

INSTALL excel; LOAD excel;  -- only if xlsx present

CREATE OR REPLACE VIEW sales_2024 AS SELECT * FROM read_csv_auto('data/raw/sales_2024.csv');
CREATE OR REPLACE VIEW customers AS SELECT * FROM read_csv_auto('data/raw/customers.csv');
CREATE OR REPLACE TABLE budget AS SELECT * FROM read_xlsx('data/raw/budget.xlsx');
-- ...

Run it: duckdb data.duckdb < scripts/load_data.sql

Then verify: duckdb data.duckdb -c "SHOW TABLES;" and a SELECT COUNT(*) on each.

Step 4 — Add .gitignore entry

If data.duckdb might be regenerated from source (i.e. all loads are views or the raw files are committed), add data.duckdb to .gitignore. If the database is the primary artifact (raw files not in repo), leave it tracked but warn the user about size.

Step 5 — Update CLAUDE.md

Append (or create) a ## Data section in the repo's CLAUDE.md. Keep it terse — this is operational context, not a tutorial:

## Data

This repo has a DuckDB database at `data.duckdb` with the following tables/views:

| Name | Source | Rows | Description |
|---|---|---|---|
| sales_2024 | data/raw/sales_2024.csv | 12,450 | <one-line description> |
| customers | data/raw/customers.csv | 3,201 | <one-line description> |
| budget | data/raw/budget.xlsx | 48 | <one-line description> |

To query:

    duckdb data.duckdb -c "SELECT ... FROM sales_2024 ..."

To rebuild from source after data files change:

    duckdb data.duckdb < scripts/load_data.sql

For analysis tasks, prefer the `claude-data-analyst` plugin skills (`/claude-data-analyst:trend-analysis`, `:correlation-analysis`, etc.) — they assume `duckdb` on PATH and will operate on `data.duckdb` by default.

Fill in row counts from Step 3. For descriptions, ask the user or infer from column names — mark inferred ones with (inferred).

Step 6 — Report

Summarise:

Database path and size.
Tables/views created with row counts.
Where the loader script lives.
What was added to CLAUDE.md and .gitignore.
Suggested next steps (e.g. "run /claude-data-analyst:data-dictionary-creator to document columns").

Notes

If the user already has a data.duckdb, don't overwrite — offer to add new tables to it, or pick a different filename.
If a CSV has encoding issues, read_csv_auto usually handles it; fall back to explicit read_csv(..., encoding='latin-1') if it errors.
For very large files (>1 GB), prefer loading as a table with an explicit column subset rather than SELECT *.
If the repo has no CLAUDE.md, create one with just the ## Data section — don't invent project-wide instructions.