Joins two tabular datasets using qsv tools with strategy selection: joinp for fast equi-joins, sqlp for complex conditions, join for low-memory streaming.
From qsv-data-wranglingnpx claudepluginhub dathere/qsv --plugin qsv-data-wranglingThis skill is limited to using the following tools:
Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.
Join two tabular data files on common columns.
Cowork note: If relative paths don't resolve, call
qsv_get_working_dirandqsv_set_working_dirto sync the working directory.
| Scenario | Best Tool | Why |
|---|---|---|
| Standard equi-join | joinp | Polars engine, fastest |
| Non-equi join (>, <, BETWEEN) | sqlp | SQL supports complex conditions |
| Cross join / cartesian | sqlp | CROSS JOIN syntax |
| Memory-constrained | join | Streaming, lower memory |
| Fuzzy/approximate match | joinp --asof | Nearest-match join |
Index both files: Run qsv_index on both files for fast random access.
Inspect both files: Run qsv_headers on both files to identify column names. Determine which columns to join on.
Profile join columns: Run qsv_stats with cardinality: true, stats_jsonl: true on both files. Check the cardinality of join columns to determine optimal table order.
Choose strategy:
joinp: smaller cardinality table should be on the right for best performancesqlpjoinp --asofExecute join: Use qsv_joinp for standard joins:
joinp --left/--inner/--full/--cross
columns1: "id"
input1: "file1.csv"
columns2: "id"
input2: "file2.csv"
Or use qsv_sqlp for complex joins:
SELECT a.*, b.col1, b.col2
FROM file1 a
JOIN file2 b ON a.id = b.id AND a.date BETWEEN b.start_date AND b.end_date
For ASOF (nearest-match) joins, use qsv_joinp with --asof:
joinp
columns1: "date"
input1: "events.csv"
columns2: "date"
input2: "reference.csv"
asof: true
strategy: "backward"
allow_exact_matches: true
strategy: "backward" (default) — match to the last right row with key < left keystrategy: "forward" — match to the first right row with key > left keystrategy: "nearest" — match to the numerically closest row (supports --tolerance)--left_by/--right_by to restrict matching within subgroups (e.g., per jurisdiction)allow_exact_matches: true to include equal keys (<=, >=); default is strict inequality (<, >)Clean up result: Use qsv_select to remove duplicate join columns or unnecessary columns from the result.
Verify: Run qsv_count on the result. Compare with input counts to validate join behavior:
Before executing a join, read .stats.csv for both files and validate:
| Check | Stats Column | Red Flag | Action |
|---|---|---|---|
| Type match | type | Join columns have different types (e.g., Integer vs String) | Cast one column before joining: sqlp with CAST(col AS INTEGER) |
| Null density | nullcount, sparsity | sparsity > 0.3 on join column | Nulls don't match — expect unmatched rows; consider filtering nulls first |
| Value overlap | min, max | Non-overlapping ranges across files | No rows will match — verify correct join column |
| Skew detection | mode, mode_count | One value dominates (mode_count > 50% of rows) | Join will be heavily skewed many-to-one; verify this is expected |
| Uniqueness | uniqueness_ratio | Both files have uniqueness_ratio < 1.0 on join column | Many-to-many join risk — expect row explosion; verify with qsv_count after |
| Outlier keys | outliers_percentage | outliers_percentage > 5% on numeric join column | Outlier keys may not match across files; consider trimming first |
| Type | joinp Flag | SQL | Behavior |
|---|---|---|---|
| Inner | (default) | JOIN | Only matching rows |
| Left | --left | LEFT JOIN | All left + matching right |
| Full outer | --full | FULL OUTER JOIN | All rows from both |
| Cross | --cross | CROSS JOIN | Cartesian product |
| Anti | --anti | NOT IN / NOT EXISTS | Left rows without match |
| Semi | --semi | EXISTS | Left rows with match (no right cols) |
| ASOF | --asof | (use joinp) | Nearest-key match (temporal/numeric) |
joinp uses the Polars engine and is significantly faster than join for large filesjoinp optimize join executioncolumns1: "col1,col2"columns1: "id", columns2: "customer_id"joinp handles null values in join columns (nulls don't match by default)--try-parsedates — no need to pass it explicitly--left_by and --right_by (e.g., match nearest date per jurisdiction)--tolerance option (nearest strategy only) limits how far the nearest match can be: use duration strings for dates (1d, 30d, 365d) or positive integers for numeric keys--no-sort is set