Help us improve
Share bugs, ideas, or general feedback.
From qsv-data-wrangling
Accelerates qsv CSV processing with index files, stats cache, Polars engine, and Parquet conversion for large files and smart commands.
npx claudepluginhub dathere/qsv --plugin qsv-data-wranglingHow this skill is triggered โ by the user, by Claude, or both
Slash command
/qsv-data-wrangling:qsv-performanceThe summary Claude sees in its skill listing โ used to decide when to auto-load this skill
**Created by**: `qsv index`
Guides qsv-based CSV wrangling with standard workflow order, tool selection matrix for tasks like filtering/joining/aggregating, selection syntax, and pipeline patterns for cleaning/profiling.
Parse, transform, clean, and analyze CSV files: auto-detect formats, filter/sort/merge/pivot, generate stats/outliers, with Python (pandas) and JavaScript examples.
Share bugs, ideas, or general feedback.
.csv.idx)Created by: qsv index
Used by: count, slice, sample, split, stats, frequency, schema, and others marked with ๐
| Benefit | Without Index | With Index |
|---|---|---|
| Row count | Scan entire file | Instant (stored in index) |
| Random access | Sequential scan | O(1) lookup |
| Multithreaded | Not possible | Enabled for many commands |
| Slicing | Read from start | Jump to position |
Rule: Always run index first if you'll run 2+ commands on the same file.
Auto-indexing: The MCP server auto-indexes files > 10MB.
.stats.csv + .stats.csv.data.jsonl)Created by: qsv stats --cardinality --stats-jsonl
Used by: frequency, schema, tojsonl, sqlp, joinp, pivotp, diff, sample (smart commands)
| Smart Command | What It Uses from Cache |
|---|---|
frequency | Cardinality to skip all-unique columns |
schema | Data types for JSON Schema generation |
sqlp | Column types for Polars optimization |
joinp | Cardinality for optimal join order |
pivotp | Cardinality to estimate output width |
diff | Column types for comparison |
Rule: Run stats --cardinality --stats-jsonl before using any smart command.
Auto-caching: The MCP server auto-adds --stats-jsonl to stats commands.
Commands: sqlp, joinp, pivotp, count (with --polars-len), schema (with --polars)
| Benefit | Standard (csv crate) | Polars Engine |
|---|---|---|
| Processing model | Row-by-row streaming | Vectorized columnar |
| Memory | Streaming (constant) | Columnar (efficient) |
| Parallelism | Single-threaded | Multi-threaded |
| Large files | Limited by memory | Larger-than-memory |
| SQL support | N/A | Full SQL dialect |
Rule: Use Polars commands (sqlp, joinp, pivotp) for files > 100MB or complex queries.
For repeated SQL queries on large CSV (> 10MB), consider converting to Parquet with qsv_to_parquet. Parquet is a columnar format that speeds up repeated SQL queries in sqlp. Use read_parquet('file.parquet') as the table source. DuckDB is the preferred engine for Parquet queries; sqlp with SKIP_INPUT mode also works. Note: sqlp can query CSV of any size directly โ Parquet is an optimization for repeated queries, not a requirement. Parquet works ONLY with sqlp and DuckDB โ all other qsv commands require CSV/TSV/SSV input.
dedup, reverse, sort, stats (with extended stats), table, transpose
frequency, join, schema, tojsonl
Everything else - select, search, slice, replace, count, etc.
File size?
โโโ < 10MB: Any command works fine
โโโ 10MB - 100MB:
โ โโโ Always: index first
โ โโโ Repeated SQL: consider Parquet with qsv_to_parquet
โ โโโ Prefer: streaming commands
โ โโโ OK: memory-intensive if < available RAM
โโโ 100MB - 1GB:
โ โโโ Always: index + stats cache first
โ โโโ Repeated SQL: consider Parquet with qsv_to_parquet
โ โโโ Prefer: Polars commands (sqlp, joinp, pivotp)
โ โโโ Avoid: sort, reverse, table (load entire file)
โ โโโ Alternative: sqlp with ORDER BY LIMIT instead of sort
โโโ > 1GB:
โโโ Must: index + stats cache
โโโ Repeated SQL: convert to Parquet with qsv_to_parquet
โโโ Must: Polars commands only for joins/queries
โโโ Avoid: all ๐คฏ commands
โโโ Consider: split into chunks, process, cat rows
| Tip | Why |
|---|---|
Use --output file.csv | Avoids stdout buffering overhead |
Use count before stats | Fast row count for progress bars |
Use select early in pipeline | Reduce columns = faster processing |
Use --no-headers only when needed | Header detection is cheap |
Use slice --len N for previews | Don't read entire file to inspect |
Prefer joinp over join | Polars engine is significantly faster |
Use frequency --limit N | Don't compute all unique values |
Use stats --cardinality | Enables smart optimizations downstream |
The MCP server limits concurrent qsv operations (default: 1). For multiple independent files, the agent can issue separate tool calls.
QSV_MCP_OPERATION_TIMEOUT_MS)