Runs SQL queries on CSV/TSV/Excel files using Polars engine via qsv_sqlp. Provides stats, frequency analysis, indexing, search, select, and Parquet export for efficient data exploration.
From qsv-data-wranglingnpx claudepluginhub dathere/qsv --plugin qsv-data-wranglingThis skill is limited to using the following tools:
Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.
Query tabular data files using SQL via the Polars-powered sqlp command.
Cowork note: If relative paths don't resolve, call
qsv_get_working_dirandqsv_set_working_dirto sync the working directory.
Is the query simple (single column filter, basic select)?
select + search for simpler operationssqlp for full SQL supportDoes the query involve joins, GROUP BY, window functions, or complex expressions?
sqlp (Polars SQL engine)Is the CSV file very large (> 10MB)?
qsv_to_parquet for faster repeated queries. Note: sqlp can also query CSV files of any size directly.Prepare the file: Run qsv_index and qsv_stats with cardinality: true, stats_jsonl: true to create index and stats cache.
Read the stats cache: Read <FILESTEM>.stats.csv (e.g., data.stats.csv for data.csv) to understand column metadata before writing SQL. This is the most important step for writing efficient queries.
Run frequency on key columns: For columns you plan to GROUP BY, filter on, or join on, run qsv_frequency to see actual value distributions. This reveals the best filter values and whether a GROUP BY will produce a manageable result set.
Write and run SQL: Use qsv_sqlp with the SQL query informed by stats and frequency data. The table name in SQL is the filename stem (e.g., data.csv -> SELECT * FROM data). For Parquet files, use read_parquet('data.parquet') as the table source instead.
Refine if needed: Check results and adjust the query.
After reading the .stats.csv cache, use these columns to inform your SQL:
| Stats Column | How to Use in SQL |
|---|---|
type | Use correct casts and comparisons — don't quote integers, use date functions for Date/DateTime columns |
min / max | Write precise WHERE clauses using actual data range (e.g., WHERE price BETWEEN 10.5 AND 999.99 instead of arbitrary bounds) |
cardinality | Estimate GROUP BY result size — low cardinality (< 100) is fast; high cardinality (> 10K) may need LIMIT or a different approach |
nullcount | Only add COALESCE or IS NOT NULL where nullcount > 0 — skip null handling for columns with zero nulls |
sort_order | Skip ORDER BY if data is already sorted on that column (sort_order = "Ascending"/"Descending") |
mean / stddev | Write outlier filters: WHERE col BETWEEN mean - 3*stddev AND mean + 3*stddev |
median / q1 / q3 | For skewed data (when mean and median diverge), use quartile-based ranges: WHERE col BETWEEN q1 AND q3 instead of mean ± stddev |
skewness | If skewness > 1 or < -1, prefer median/quartile-based filters over mean-based ones |
cv | High CV (> 100%) signals high relative variability — add LIMIT to GROUP BY queries and consider binning continuous values |
outliers_percentage | If > 5%, consider excluding outliers before aggregation: WHERE col BETWEEN lower_inner_fence AND upper_inner_fence |
sparsity | Columns with sparsity > 0.5 are mostly null — avoid using them as join keys or GROUP BY columns |
Run qsv_frequency --select col --limit 20 before writing WHERE clauses on categorical columns:
frequency shows "active" has 90% of rows, filtering on WHERE status = 'active' is wasteful — filter on the rare values insteadWHERE category IN ('A','B','C'), check frequency first to confirm those values exist and see if you're missing anyThe sqlp command uses Polars SQL dialect:
-- Basic select
SELECT col1, col2 FROM data WHERE col1 > 100
-- Aggregation
SELECT category, COUNT(*) as cnt, AVG(price) as avg_price
FROM data GROUP BY category ORDER BY cnt DESC
-- Window functions
SELECT *, ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) as rank
FROM employees
-- String operations
SELECT * FROM data WHERE col1 LIKE '%pattern%'
-- Date operations
SELECT *, EXTRACT(YEAR FROM date_col) as year FROM data
-- Multiple files (join)
SELECT a.*, b.name FROM file1 a JOIN file2 b ON a.id = b.id
-- CASE expressions
SELECT *, CASE WHEN amount > 1000 THEN 'high' ELSE 'low' END as tier FROM data
sales_2024.csv -> Table: sales_2024my-data.csv -> Table: "my-data" (quote if contains special chars)sqlp uses the Polars engine - some PostgreSQL-specific syntax may not be supported--output file.csv for large result setsLIMIT to preview large result sets before running full queriessqlp can query multiple CSV files in a single SQL statement (useful for joins)