From datafusion-skills
Runs SQL queries or natural language questions against registered tables or ad-hoc on Parquet, CSV, JSON, Arrow IPC files using datafusion-cli.
npx claudepluginhub datafusion-contrib/datafusion-skills --plugin datafusion-skillsThis skill is limited to using the following tools:
You are helping the user query data using Apache DataFusion.
Executes raw SQL or natural language queries against attached DuckDB databases or ad-hoc files. Manages session state, schema retrieval, and result size estimation.
Registers Parquet, CSV, JSON, Arrow IPC, or Avro files as persistent external tables in DataFusion sessions. Auto-detects format, explores schema, and persists state for reuse across skills.
Run ClickHouse SQL queries in Python on local files (Parquet/CSV/JSON), remote databases, S3/cloud storage without a server. Supports sessions, table functions (mysql/s3/postgresql), parametrized queries, cross-source joins.
Share bugs, ideas, or general feedback.
You are helping the user query data using Apache DataFusion.
Input: $@
Follow these steps in order.
Look for an existing state file:
STATE_DIR=""
test -f .datafusion-skills/state.sql && STATE_DIR=".datafusion-skills"
PROJECT_ROOT="$(git rev-parse --show-toplevel 2>/dev/null || echo "$PWD")"
PROJECT_ID="$(echo "$PROJECT_ROOT" | tr '/' '-')"
test -f "$HOME/.datafusion-skills/$PROJECT_ID/state.sql" && STATE_DIR="$HOME/.datafusion-skills/$PROJECT_ID"
If found, verify the state file is valid:
datafusion-cli --file "$STATE_DIR/state.sql" -c "SELECT 1;"
Now determine the mode:
--file flag is present, or the SQL references file paths (e.g. FROM 'data.parquet'), or STATE_DIR is empty.STATE_DIR is set and the input references table names, is natural language, or is SQL without file references.If no state file exists and no file is referenced, fall back to ad-hoc mode.
command -v datafusion-cli
If not found, delegate to /datafusion-skills:install-datafusion and then continue.
If the input is natural language (not valid SQL), generate SQL.
In session mode, first retrieve the schema to inform query generation:
datafusion-cli --file "$STATE_DIR/state.sql" -c "SHOW TABLES;"
Then for relevant tables:
datafusion-cli --file "$STATE_DIR/state.sql" -c "DESCRIBE <table_name>;"
When generating SQL, keep these DataFusion-specific features in mind:
SELECT * FROM 'path/to/file.parquet' works for Parquet files directlyCREATE EXTERNAL TABLE t STORED AS PARQUET LOCATION 'data.parquet';
CREATE EXTERNAL TABLE t STORED AS CSV LOCATION 'data.csv' OPTIONS ('has_header' 'true');
CREATE EXTERNAL TABLE t STORED AS JSON LOCATION 'data.json';
CREATE EXTERNAL TABLE t STORED AS ARROW LOCATION 'data.arrow';
CREATE EXTERNAL TABLE t STORED AS AVRO LOCATION 'data.avro';
COPY (SELECT * FROM t WHERE x > 10) TO 'output.parquet' STORED AS PARQUET;
ROW_NUMBER, RANK, LAG, LEAD, etc.WITH clauses are fully supportedCOUNT, SUM, AVG, MIN, MAX, MEDIAN, APPROX_PERCENTILE_CONT, etc.LENGTH, LOWER, UPPER, TRIM, SUBSTR, REGEXP_MATCH, etc.NOW(), DATE_TRUNC, DATE_PART, INTERVAL, TO_TIMESTAMP, etc.EXPLAIN or EXPLAIN ANALYZE to see query plansBefore executing, check if the query might produce too many rows.
If the query has no LIMIT and might return a large result, wrap it:
SELECT COUNT(*) AS total_rows FROM (<original_query>);
If total_rows > 1000, add LIMIT 100 and inform the user.
Important: datafusion-cli -c only accepts one SQL statement per flag. Use multiple
-c flags for multiple statements, or write a .sql file and use --file.
Session mode:
datafusion-cli --file "$STATE_DIR/state.sql" -c "<SQL_QUERY>"
Ad-hoc mode with file:
DataFusion v44+ can query Parquet, CSV, and JSON files directly by path:
datafusion-cli -c "SELECT * FROM '<file_path>' LIMIT 10;"
For Arrow IPC or Avro, create a temporary external table using a .sql file:
cat > /tmp/_df_query.sql << SQL
CREATE EXTERNAL TABLE _tmp STORED AS ARROW LOCATION '<file_path>';
<SQL_QUERY using _tmp>;
SQL
datafusion-cli --file /tmp/_df_query.sql
Ad-hoc mode without file:
datafusion-cli -c "<SQL_QUERY>"
datafusion-cli: command not found → invoke /datafusion-skills:install-datafusion and retryDESCRIBE on the table/file and show the user available columns/datafusion-skills:datafusion-docs <error keywords> to search documentationPresent the query results clearly. If the result is a table, format it nicely.
Suggest next steps if appropriate:
/datafusion-skills:create-table to register the file for repeated use/datafusion-skills:materialized-view to persist resultsstate.sql exists, tables registered by /datafusion-skills:create-table are automatically available/datafusion-skills:datafusion-docs for persistent errors/datafusion-skills:explain-plan for slow queries