From datafusion-skills
Registers Parquet, CSV, JSON, Arrow IPC, or Avro files as persistent external tables in DataFusion sessions. Auto-detects format, explores schema, and persists state for reuse across skills.
npx claudepluginhub datafusion-contrib/datafusion-skills --plugin datafusion-skillsThis skill is limited to using the following tools:
You are helping the user register a data file as a persistent table in their DataFusion session.
Runs SQL queries or natural language questions against registered tables or ad-hoc on Parquet, CSV, JSON, Arrow IPC files using datafusion-cli.
Executes raw SQL or natural language queries against attached DuckDB databases or ad-hoc files. Manages session state, schema retrieval, and result size estimation.
Execute DuckDB CLI commands for SQL queries on CSV/Parquet/JSON files, data conversion (CSV to Parquet, JSON to Parquet), persistent database management, and schema inspection.
Share bugs, ideas, or general feedback.
You are helping the user register a data file as a persistent table in their DataFusion session.
File path given: $0
Additional arguments: ${1:-}
Follow these steps in order.
If $0 is a relative path, resolve it:
RESOLVED_PATH="$(cd "$(dirname "$0")" 2>/dev/null && pwd)/$(basename "$0")"
Check the file exists (for local files):
test -f "$RESOLVED_PATH" || test -d "$RESOLVED_PATH"
For directories (partitioned data), use the directory path as-is.
command -v datafusion-cli
If not found, delegate to /datafusion-skills:install-datafusion.
If --format was specified, use that. Otherwise detect from extension:
| Extension | Format |
|---|---|
.parquet, .pq | PARQUET |
.csv, .tsv, .txt | CSV |
.json, .jsonl, .ndjson | JSON |
.arrow, .ipc, .feather | ARROW |
.avro | AVRO |
| directory | PARQUET (default for partitioned data) |
If the extension is unknown, try Parquet first, then CSV.
If --name was specified, use that. Otherwise derive from the filename:
Example: My-Data File.parquet → my_data_file
Confirm the name with the user.
STATE_DIR=""
test -f .datafusion-skills/state.sql && STATE_DIR=".datafusion-skills"
PROJECT_ROOT="$(git rev-parse --show-toplevel 2>/dev/null || echo "$PWD")"
PROJECT_ID="$(echo "$PROJECT_ROOT" | tr '/' '-')"
test -f "$HOME/.datafusion-skills/$PROJECT_ID/state.sql" && STATE_DIR="$HOME/.datafusion-skills/$PROJECT_ID"
If no state directory exists, ask the user where to store state (same as other skills):
- In the project directory (
.datafusion-skills/)- In your home directory (
~/.datafusion-skills/<project-id>/)
mkdir -p "$STATE_DIR"
touch "$STATE_DIR/state.sql"
Build the CREATE EXTERNAL TABLE statement:
For Parquet:
CREATE EXTERNAL TABLE IF NOT EXISTS <table_name> STORED AS PARQUET LOCATION '<RESOLVED_PATH>';
For CSV:
CREATE EXTERNAL TABLE IF NOT EXISTS <table_name> STORED AS CSV LOCATION '<RESOLVED_PATH>' OPTIONS ('has_header' 'true');
For JSON:
CREATE EXTERNAL TABLE IF NOT EXISTS <table_name> STORED AS JSON LOCATION '<RESOLVED_PATH>';
For Arrow IPC:
CREATE EXTERNAL TABLE IF NOT EXISTS <table_name> STORED AS ARROW LOCATION '<RESOLVED_PATH>';
For Avro:
CREATE EXTERNAL TABLE IF NOT EXISTS <table_name> STORED AS AVRO LOCATION '<RESOLVED_PATH>';
Test it:
datafusion-cli --file "$STATE_DIR/state.sql" -c "
<CREATE_STATEMENT>
DESCRIBE <table_name>;
SELECT COUNT(*) AS row_count FROM <table_name>;
SELECT * FROM <table_name> LIMIT 5;
"
Check if this table is already in the state file:
grep -q "<table_name>" "$STATE_DIR/state.sql" 2>/dev/null
If not present, append:
cat >> "$STATE_DIR/state.sql" <<'SQL'
-- Table: <table_name> (<FORMAT> from <RESOLVED_PATH>)
<CREATE_STATEMENT>
SQL
Summarize:
<table_name>This table is now available in all
/datafusion-skills:querysessions. Try:/datafusion-skills:query SELECT * FROM <table_name> LIMIT 10