From datafusion-skills
Create and manage materialized views using DataFusion. Persist SQL query results as Parquet files for fast repeated access. Track source dependencies and refresh when data changes.
npx claudepluginhub datafusion-contrib/datafusion-skills --plugin datafusion-skillsThis skill is limited to using the following tools:
You are helping the user create and manage materialized views using Apache DataFusion.
Provides 10+ patterns and rules for ClickHouse Materialized Views using SummingMergeTree and AggregatingMergeTree for real-time aggregation, ETL pipelines, and query optimization.
Runs SQL queries or natural language questions against registered tables or ad-hoc on Parquet, CSV, JSON, Arrow IPC files using datafusion-cli.
Develop Lakeflow Spark Declarative Pipelines (formerly Delta Live Tables) on Databricks. Use when building batch or streaming data pipelines with Python or SQL. Invoke BEFORE starting implementation.
Share bugs, ideas, or general feedback.
You are helping the user create and manage materialized views using Apache DataFusion.
Input: $@
A materialized view in this context is a SQL query whose results are persisted as a Parquet file, registered as an external table, and tracked for refresh. This is a lightweight, file-based approach — no database server needed.
command -v datafusion-cli
If not found, delegate to /datafusion-skills:install-datafusion and then continue.
Look for an existing state file:
STATE_DIR=""
test -f .datafusion-skills/state.sql && STATE_DIR=".datafusion-skills"
PROJECT_ROOT="$(git rev-parse --show-toplevel 2>/dev/null || echo "$PWD")"
PROJECT_ID="$(echo "$PROJECT_ROOT" | tr '/' '-')"
test -f "$HOME/.datafusion-skills/$PROJECT_ID/state.sql" && STATE_DIR="$HOME/.datafusion-skills/$PROJECT_ID"
If no state directory exists, create one. Ask the user:
Where would you like to store DataFusion session state?
- In the project directory (
.datafusion-skills/) — colocated with the project- In your home directory (
~/.datafusion-skills/<project-id>/) — keeps the repo clean
Create the chosen directory and initialize state.sql if it doesn't exist:
mkdir -p "$STATE_DIR"
touch "$STATE_DIR/state.sql"
Also create the materialized views directory and manifest:
mkdir -p "$STATE_DIR/views"
test -f "$STATE_DIR/views/manifest.json" || echo '{"views":{}}' > "$STATE_DIR/views/manifest.json"
Determine what the user wants:
create <SQL or natural language> → create a new materialized viewrefresh <name> → refresh an existing materialized viewstatus → show status of all materialized viewslist → list all materialized viewsdrop <name> → remove a materialized viewIf the input is just SQL or a natural language description, treat it as create.
If the input is natural language, generate SQL. Use the session state to understand available tables:
datafusion-cli --file "$STATE_DIR/state.sql" -c "SHOW TABLES;" 2>/dev/null
Generate a descriptive snake_case name from the SQL (e.g., daily_trades_by_symbol). Ask the user to confirm or rename.
Test that the SQL runs successfully:
datafusion-cli --file "$STATE_DIR/state.sql" -c "
$SQL LIMIT 1;
"
If it fails, diagnose the error and fix the SQL.
datafusion-cli --file "$STATE_DIR/state.sql" -c "
COPY ($SQL) TO '$STATE_DIR/views/<view_name>.parquet' STORED AS PARQUET;
"
Append to the state file so the view is available in future sessions:
cat >> "$STATE_DIR/state.sql" <<SQL
-- Materialized view: <view_name>
-- Source SQL: $SQL
CREATE EXTERNAL TABLE IF NOT EXISTS <view_name> STORED AS PARQUET LOCATION '$STATE_DIR/views/<view_name>.parquet';
SQL
Update $STATE_DIR/views/manifest.json to track the view:
python3 -c "
import json, datetime
manifest_path = '$STATE_DIR/views/manifest.json'
with open(manifest_path) as f:
manifest = json.load(f)
manifest['views']['<view_name>'] = {
'sql': '''$SQL''',
'parquet_path': '$STATE_DIR/views/<view_name>.parquet',
'created_at': datetime.datetime.now().isoformat(),
'last_refreshed': datetime.datetime.now().isoformat(),
'source_tables': [] # extracted from SQL
}
with open(manifest_path, 'w') as f:
json.dump(manifest, f, indent=2)
"
datafusion-cli --file "$STATE_DIR/state.sql" -c "
SELECT COUNT(*) AS row_count FROM <view_name>;
DESCRIBE <view_name>;
"
Report success with row count and schema.
Read the view's SQL from the manifest:
python3 -c "
import json
with open('$STATE_DIR/views/manifest.json') as f:
manifest = json.load(f)
view = manifest['views'].get('<view_name>')
if view:
print(view['sql'])
else:
print('ERROR: view not found')
"
Then re-materialize:
datafusion-cli --file "$STATE_DIR/state.sql" -c "
COPY ($SQL) TO '$STATE_DIR/views/<view_name>.parquet' STORED AS PARQUET;
"
Update the last_refreshed timestamp in the manifest.
Report the new row count.
Show the status of all materialized views:
python3 -c "
import json, os, datetime
with open('$STATE_DIR/views/manifest.json') as f:
manifest = json.load(f)
for name, view in manifest['views'].items():
parquet = view['parquet_path']
size = os.path.getsize(parquet) if os.path.exists(parquet) else 0
size_mb = size / (1024 * 1024)
print(f\"{name}:\")
print(f\" SQL: {view['sql'][:80]}...\")
print(f\" Last refreshed: {view['last_refreshed']}\")
print(f\" File size: {size_mb:.2f} MB\")
print(f\" Path: {parquet}\")
print()
"
Also verify each view is queryable:
datafusion-cli --file "$STATE_DIR/state.sql" -c "SELECT COUNT(*) AS row_count FROM <view_name>;"
python3 -c "
import json
with open('$STATE_DIR/views/manifest.json') as f:
manifest = json.load(f)
for name in sorted(manifest['views']):
view = manifest['views'][name]
print(f\" {name} — refreshed {view['last_refreshed'][:10]}\")
"
Remove a materialized view:
rm -f "$STATE_DIR/views/<view_name>.parquet"
python3 -c "
import json
with open('$STATE_DIR/views/manifest.json') as f:
manifest = json.load(f)
manifest['views'].pop('<view_name>', None)
with open('$STATE_DIR/views/manifest.json', 'w') as f:
json.dump(manifest, f, indent=2)
"
Remove the corresponding lines from state.sql:
sed -i '' '/-- Materialized view: <view_name>/,/^$/d' "$STATE_DIR/state.sql"
/datafusion-skills:create-table to register themAfter creating a view:
Your materialized view
<view_name>is now available as a table. Query it with/datafusion-skills:query SELECT * FROM <view_name> LIMIT 10. To refresh it when source data changes, run/datafusion-skills:materialized-view refresh <view_name>.