From datafusion-skills
Reads and explores Parquet, CSV, JSON, Arrow IPC, Avro files locally, from S3/GCS using datafusion-cli for schema inspection, row counts, and data previews.
npx claudepluginhub datafusion-contrib/datafusion-skills --plugin datafusion-skillsThis skill is limited to using the following tools:
You are helping the user read and analyze a data file using Apache DataFusion.
Reads data files (CSV, JSON, Parquet, Avro, Excel, spatial, SQLite) or remote S3/HTTPS URLs using DuckDB. Activates for file references, 'what's in this file' queries, or dataset previews.
Runs SQL queries or natural language questions against registered tables or ad-hoc on Parquet, CSV, JSON, Arrow IPC files using datafusion-cli.
Profiles tables or files (CSV, Excel, Parquet, JSON) to reveal shape, null rates, column distributions, top values, percentiles, data quality issues, and column categories.
Share bugs, ideas, or general feedback.
You are helping the user read and analyze a data file using Apache DataFusion.
Filename given: $0
Question: ${1:-describe the data}
Follow these steps in order, stopping and reporting clearly if any step fails.
Determine whether the input is local or remote:
s3://...) → remotegs://...) → remotefind "$PWD" -name "$0" -not -path '*/.git/*' 2>/dev/null
RESOLVED_PATH).Use the URI/URL as-is for RESOLVED_PATH.
For S3 access, DataFusion uses environment variables:
AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGIONAWS_PROFILE for profile-based credentialsCheck if credentials are available:
test -n "$AWS_ACCESS_KEY_ID" || test -n "$AWS_PROFILE" || test -f "$HOME/.aws/credentials"
If not available, inform the user they need to configure AWS credentials.
command -v datafusion-cli
If not found, delegate to /datafusion-skills:install-datafusion and then continue.
Detect format from extension:
| Extension | Format | DataFusion support |
|---|---|---|
.parquet, .pq | Parquet | Direct query: SELECT * FROM 'file.parquet' |
.csv, .tsv, .txt | CSV | Direct query: SELECT * FROM 'file.csv' |
.json, .jsonl, .ndjson | JSON | Direct query: SELECT * FROM 'file.json' |
.arrow, .ipc, .feather | Arrow IPC | CREATE EXTERNAL TABLE with STORED AS ARROW |
.avro | Avro | CREATE EXTERNAL TABLE with STORED AS AVRO |
Important: datafusion-cli -c only accepts one SQL statement per flag. Use multiple
-c flags for multiple statements, or write a .sql file and use --file.
DataFusion v44+ supports direct queries on Parquet, CSV, and JSON files by path:
datafusion-cli -c "DESCRIBE 'RESOLVED_PATH';"
datafusion-cli -c "SELECT COUNT(*) AS row_count FROM 'RESOLVED_PATH';"
datafusion-cli -c "SELECT * FROM 'RESOLVED_PATH' LIMIT 10;"
For CSV files with non-standard delimiters or no header, fall back to CREATE EXTERNAL TABLE
using a .sql file:
cat > /tmp/_df_preview.sql << 'SQL'
CREATE EXTERNAL TABLE _preview STORED AS CSV LOCATION 'RESOLVED_PATH' OPTIONS ('has_header' 'false', 'delimiter' '\t');
DESCRIBE _preview;
SELECT COUNT(*) AS row_count FROM _preview;
SELECT * FROM _preview LIMIT 10;
SQL
datafusion-cli --file /tmp/_df_preview.sql
cat > /tmp/_df_preview.sql << 'SQL'
CREATE EXTERNAL TABLE _preview STORED AS ARROW LOCATION 'RESOLVED_PATH';
DESCRIBE _preview;
SELECT COUNT(*) AS row_count FROM _preview;
SELECT * FROM _preview LIMIT 10;
SQL
datafusion-cli --file /tmp/_df_preview.sql
cat > /tmp/_df_preview.sql << 'SQL'
CREATE EXTERNAL TABLE _preview STORED AS AVRO LOCATION 'RESOLVED_PATH';
DESCRIBE _preview;
SELECT COUNT(*) AS row_count FROM _preview;
SELECT * FROM _preview LIMIT 10;
SQL
datafusion-cli --file /tmp/_df_preview.sql
If the extension doesn't match any known format:
datafusion-cli: command not found → invoke /datafusion-skills:install-datafusion and retryOPTIONS ('has_header' 'false'), or OPTIONS ('delimiter' '\t') for TSV/datafusion-skills:datafusion-docs <error keywords> for helpUsing the schema, row count, and sample rows gathered above, answer:
${1:-describe the data: summarize column types, row count, and any notable patterns.}
Be concise but thorough — mention:
After answering, suggest relevant follow-ups:
To query this data further — filter, aggregate, join — use
/datafusion-skills:query.
If the file is useful for repeated access:
To register this as a persistent table, run
/datafusion-skills:create-table RESOLVED_PATH.
If the data is large and the user might want to materialize a summary:
To persist a summary as a Parquet file, try
/datafusion-skills:materialized-view.
Keep suggestions brief and show them only once.
/datafusion-skills:query for further exploration/datafusion-skills:create-table for persistent access/datafusion-skills:datafusion-docs for unclear errors