Help us improve
Share bugs, ideas, or general feedback.
From qsv-data-wrangling
Runs SQL queries on CSV/TSV/Excel files using Polars engine via qsv_sqlp. Provides stats, frequency analysis, indexing, search, select, and Parquet export for efficient data exploration.
npx claudepluginhub dathere/qsv --plugin qsv-data-wranglingHow this skill is triggered — by the user, by Claude, or both
Slash command
/qsv-data-wrangling:csv-queryThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Query tabular data files using SQL via the Polars-powered `sqlp` command.
Manipulates CSV data with csvlook, csvsql, csvjoin, csvstat. Streams records lazily; handles datasets larger than RAM. Invoke via csvlook.
Accelerates qsv CSV processing with index files, stats cache, Polars engine, and Parquet conversion for large files and smart commands.
Parse, transform, clean, and analyze CSV files: auto-detect formats, filter/sort/merge/pivot, generate stats/outliers, with Python (pandas) and JavaScript examples.
Share bugs, ideas, or general feedback.
Query tabular data files using SQL via the Polars-powered sqlp command.
Cowork note: If relative paths don't resolve, call
qsv_get_working_dirandqsv_set_working_dirto sync the working directory.
Is the query simple (single column filter, basic select)?
select + search for simpler operationssqlp for full SQL supportDoes the query involve joins, GROUP BY, window functions, or complex expressions?
sqlp (Polars SQL engine)Is the CSV file very large (> 10MB)?
qsv_to_parquet for faster repeated queries. Note: sqlp can also query CSV files of any size directly.Prepare the file: Run qsv_index and qsv_stats with cardinality: true, stats_jsonl: true to create index and stats cache.
Read the stats cache: Read <FILESTEM>.stats.csv (e.g., data.stats.csv for data.csv) to understand column metadata before writing SQL. This is the most important step for writing efficient queries.
Run frequency on key columns: For columns you plan to GROUP BY, filter on, or join on, run qsv_frequency to see actual value distributions. This reveals the best filter values and whether a GROUP BY will produce a manageable result set.
Write and run SQL: Use qsv_sqlp with the SQL query informed by stats and frequency data. The table name in SQL is the filename stem (e.g., data.csv -> SELECT * FROM data). For Parquet files, use read_parquet('data.parquet') as the table source instead.
Refine if needed: Check results and adjust the query.
After reading the .stats.csv cache, use these columns to inform your SQL:
| Stats Column | How to Use in SQL |
|---|---|
type | Use correct casts and comparisons — don't quote integers, use date functions for Date/DateTime columns |
min / max | Write precise WHERE clauses using actual data range (e.g., WHERE price BETWEEN 10.5 AND 999.99 instead of arbitrary bounds) |
cardinality | Estimate GROUP BY result size — low cardinality (< 100) is fast; high cardinality (> 10K) may need LIMIT or a different approach |
nullcount | Only add COALESCE or IS NOT NULL where nullcount > 0 — skip null handling for columns with zero nulls |
sort_order | Skip ORDER BY if data is already sorted on that column (sort_order = "Ascending"/"Descending") |
mean / stddev | Write outlier filters: WHERE col BETWEEN mean - 3*stddev AND mean + 3*stddev |
median / q1 / q3 | For skewed data (when mean and median diverge), use quartile-based ranges: WHERE col BETWEEN q1 AND q3 instead of mean ± stddev |
skewness | If skewness > 1 or < -1, prefer median/quartile-based filters over mean-based ones |
cv | High CV (> 100%) signals high relative variability — add LIMIT to GROUP BY queries and consider binning continuous values |
outliers_percentage | If > 5%, consider excluding outliers before aggregation: WHERE col BETWEEN lower_inner_fence AND upper_inner_fence |
sparsity | Columns with sparsity > 0.5 are mostly null — avoid using them as join keys or GROUP BY columns |
Run qsv_frequency --select col --limit 20 before writing WHERE clauses on categorical columns:
frequency shows "active" has 90% of rows, filtering on WHERE status = 'active' is wasteful — filter on the rare values insteadWHERE category IN ('A','B','C'), check frequency first to confirm those values exist and see if you're missing anyThe sqlp command uses Polars SQL dialect:
-- Basic select
SELECT col1, col2 FROM data WHERE col1 > 100
-- Aggregation
SELECT category, COUNT(*) as cnt, AVG(price) as avg_price
FROM data GROUP BY category ORDER BY cnt DESC
-- Window functions
SELECT *, ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) as rank
FROM employees
-- String operations
SELECT * FROM data WHERE col1 LIKE '%pattern%'
-- Date operations
SELECT *, EXTRACT(YEAR FROM date_col) as year FROM data
-- Multiple files (join)
SELECT a.*, b.name FROM file1 a JOIN file2 b ON a.id = b.id
-- CASE expressions
SELECT *, CASE WHEN amount > 1000 THEN 'high' ELSE 'low' END as tier FROM data
sales_2024.csv -> Table: sales_2024my-data.csv -> Table: "my-data" (quote if contains special chars)sqlp uses the Polars engine - some PostgreSQL-specific syntax may not be supported--output file.csv for large result setsLIMIT to preview large result sets before running full queriessqlp can query multiple CSV files in a single SQL statement (useful for joins)