Profiles CSV/TSV/Excel files: detects format, counts rows/headers, computes basic/advanced statistics (kurtosis, Gini, outliers), shows top value distributions.
From qsv-data-wranglingnpx claudepluginhub dathere/qsv --plugin qsv-data-wranglingThis skill is limited to using the following tools:
Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.
Profile the given tabular data file to understand its structure, types, and distributions.
Cowork note: If relative paths don't resolve, call
qsv_get_working_dirandqsv_set_working_dirto sync the working directory.
Index: Run qsv_index on the file for fast random access in subsequent steps.
Detect format: Run qsv_sniff on the file to detect delimiter, encoding, preamble, and row count estimate.
Count rows: Run qsv_count to get the exact row count.
Get headers: Run qsv_headers to list all column names and positions.
Compute statistics: Run qsv_stats with cardinality: true and stats_jsonl: true to generate full column statistics and cache them. Include --everything for comprehensive stats (mean, median, mode, stddev, quartiles, etc.). Basic moarstats auto-runs to enrich the cache with ~18 additional columns.
Advanced statistics: Run qsv_moarstats with advanced: true (omit output_file — it updates the stats cache in-place by default). This enriches the stats cache with:
Show distributions: Run qsv_frequency with limit: 10 to show top value distributions for each column. For high-cardinality columns (cardinality close to row count), note them as likely unique identifiers.
Optional: Bivariate correlations (if multiple numeric columns): Run qsv_moarstats with bivariate: true to compute pairwise Pearson/Spearman/Kendall correlations, covariance, and mutual information. Output goes to <FILESTEM>.stats.bivariate.csv. Reveals hidden relationships between columns.
Optional: Robust statistics (if data is messy/heavy-tailed and < 100K rows): Run qsv_command with command: "pragmastat" for Hodges-Lehmann center and Shamos spread — robust estimators that tolerate up to 29% corrupted data. Especially useful when mean/stddev are misleading due to outliers. Warning: pragmastat computes median-of-pairwise statistics (O(n²) complexity) and becomes very slow on large datasets. For files > 100K rows, use --subsample 10000 for ~100x speedup, or combine --subsample 10000 --no-bounds for ~200x speedup.
Screen for PII/PHI: Run qsv_command with command: "searchset" and args: ["--flag", "pii_match", "${CLAUDE_PLUGIN_ROOT}/resources/pii-regexes.txt"] to scan for sensitive data patterns (SSN, credit cards, email, phone, IBAN). Report any columns with matches.
Screen for injection: Run qsv_command with command: "searchset" and args: ["--flag", "injection_match", "${CLAUDE_PLUGIN_ROOT}/resources/injection-regexes.txt"] to scan for CSV/formula injection and SQL injection payloads. Report any columns with matches.
Preview data: Run qsv_slice with len: 5 to show the first 5 rows as a sample.
Document: Generate a Data Dictionary, Dataset Description, and Tags as JSON.
13a) Primary — use describegpt: Run qsv_describegpt with all: true, format: "JSON" and output: "<filestem>.describegpt.json". If the user provided a Tag Vocabulary file, also pass tag_vocab: "<vocab_file>". This produces a structured JSON file with three top-level objects: Dictionary, Description, and Tags. Each of these contains a response (the main content), optional reasoning, and token_usage metadata. The data dictionary itself is under Dictionary.response.fields, as an array of field descriptors with keys like name, null_count, cardinality, min, max, mean, and stddev. Present the results to the user. When MCP sampling is unavailable but the tool still returns prompts, follow those prompts by issuing a follow-up call with _llm_responses instead of using the agent fallback.
13b) Fallback — agent generation: If describegpt encounters a tool error or times out, or if following its prompts via _llm_responses is not possible, fall back to generating the same artifacts from the statistics (steps 5-6) and frequency distributions (step 7). Save the result as <filestem>.profile.json using the same canonical structure as describegpt, for example:
{
"Dictionary": {
"response": {
"fields": [
{
"name": "column_name",
"type": "Integer",
"label": "Column Name",
"description": "1-5 sentence description informed by type, stats, and frequency distribution",
"null_count": 0,
"cardinality": 100,
"min": "0",
"max": "999",
"mean": "450.5",
"stddev": "120.3"
}
],
"enum_threshold": 20,
"num_examples": 5,
"truncate_str": 80,
"attribution": "agent_fallback"
},
"reasoning": "",
"token_usage": { "prompt": 0, "completion": 0, "total": 0, "elapsed": 0 }
},
"Description": {
"response": "3-10 sentences describing the dataset: what it represents, scope, key characteristics, quality issues, and potential use cases.",
"reasoning": "",
"token_usage": { "prompt": 0, "completion": 0, "total": 0, "elapsed": 0 }
},
"Tags": {
"response": ["tag1", "tag2", "tag3"],
"reasoning": "",
"token_usage": { "prompt": 0, "completion": 0, "total": 0, "elapsed": 0 }
}
}
For the fallback dictionary entries (under Dictionary.response.fields):
label: Human-readable version of the field name (e.g., customer_id → Customer ID)description: 1-5 sentence description informed by type, statistics, and frequency distributionnull_count, cardinality, min, max, mean, sortiness, stddev, variance, cv, sparsity) where applicableFor the fallback tags (under Tags.response): Infer 5-15 semantic tags based on column names, data types, value distributions, and domain characteristics. If a controlled Tag Vocabulary is provided, constrain choices to that vocabulary only.
When profiling, assess these quality dimensions:
| Check | Command | What to Look For |
|---|---|---|
| Null counts | stats --cardinality --stats-jsonl | nullcount column > 0 |
| Empty strings | frequency --limit 10 | Empty string in top values |
| Sparsity | stats | sparsity field (ratio of nulls) |
Red flag: Sparsity > 0.5 means more than half the values are null.
| Check | Command | What to Look For |
|---|---|---|
| Duplicate rows | dedup --dupes-output dupes.csv | Non-empty dupes file |
| Cardinality | stats --cardinality | cardinality vs row count |
| Unique ratio | stats | If cardinality = row count, column is unique |
Red flag: Key columns (ID, email) with cardinality < row count.
| Check | Command | What to Look For |
|---|---|---|
| Schema validation | validate schema.json | Validation error count |
| Data types | stats | type column (String, Integer, Float, Date, etc.) |
| Format patterns | search --flag | Rows not matching expected regex |
| Value ranges | stats | min, max outside expected range |
Red flag: Type column shows "String" for what should be numeric data.
| Check | Command | What to Look For |
|---|---|---|
| Date formats | stats | Mixed date types in same column |
| Case consistency | frequency | "NYC" vs "nyc" vs "Nyc" as separate values |
| Encoding | sniff | Non-UTF-8 encoding detected |
| Delimiters | sniff | Unexpected delimiter or quoting |
| Row lengths | fixlengths | Pads short rows to match longest row; compare count before/after to detect ragged rows |
Red flag: Frequency shows same value in different cases/formats.
| Check | Command | What to Look For |
|---|---|---|
| Statistical outliers | stats | mean, stddev - values > 3 stddev from mean |
| Outlier counts | moarstats | outliers_total_cnt, outliers_percentage > 5% |
| Distribution shape | moarstats --advanced | kurtosis > 3 (heavy tails), bimodality_coefficient >= 0.555 (bimodal) |
| Inequality | moarstats --advanced | gini_coefficient near 1 (extreme concentration) |
| Value distributions | frequency --limit 20 | Unexpected dominant values |
| Range checks | stats | min/max outside plausible range |
| Cross-field checks | sqlp | SQL WHERE clauses for business rules |
Red flag: Latitude > 90 or < -90, negative ages, future birth dates, kurtosis > 10 (extreme outliers).
| Check | Command | What to Look For |
|---|---|---|
| Unsafe names | safenames --verify | Spaces, special chars, reserved words |
| Duplicate headers | headers | Same name appearing twice |
| Naming consistency | headers | Mixed conventions (camelCase vs snake_case) |
Red flag: Column names with spaces or special characters break downstream tools and SQL queries.
| Check | Command | What to Look For |
|---|---|---|
| Standard codes | searchset with domain regex file | Values not matching ISO country, state, zip patterns |
| Format adherence | search --flag with expected pattern | Phone numbers, emails, URLs not matching standard format |
| Controlled vocabularies | frequency | Unexpected values outside known valid set |
Red flag: A "country" column with free-text entries instead of ISO 3166 codes, or a "state" column mixing abbreviations and full names.
| Check | Command | What to Look For |
|---|---|---|
| Orphaned foreign keys | joinp --left-anti | Rows in child file with no match in parent |
| Missing references | joinp --left-anti (reversed) | Parent records with no children (if expected) |
| Key overlap | sqlp | Cross-file key comparison via SQL |
Red flag: An orders file referencing customer IDs that don't exist in the customers file. Only applicable when profiling related files together.
Question: Does the data contain personally identifiable or protected health information?
Use searchset with a regex file to scan all columns for sensitive patterns:
qsv_command command: "searchset", input_file: "<file>", args: ["--flag", "pii_match", "${CLAUDE_PLUGIN_ROOT}/resources/pii-regexes.txt"]
The bundled ${CLAUDE_PLUGIN_ROOT}/resources/pii-regexes.txt detects:
| Pattern | Example |
|---|---|
| SSN | 123-45-6789 |
| Mastercard | 5100 1234 5678 9012 |
| Visa | 4111 1111 1111 1111 |
| American Express | 371449635398431 |
| IBAN | GB29NWBK60161331926819 |
user@example.com | |
| US Phone | +1 (555) 123-4567 |
For PHI screening, use the bundled ${CLAUDE_PLUGIN_ROOT}/resources/phi-regexes.txt:
qsv_command command: "searchset", input_file: "<file>", args: ["--flag", "phi_match", "${CLAUDE_PLUGIN_ROOT}/resources/phi-regexes.txt"]
The bundled ${CLAUDE_PLUGIN_ROOT}/resources/phi-regexes.txt detects:
| Pattern | Example |
|---|---|
| MRN (Medical Record Number) | MRN123456 |
| DEA Number | AB1234563 |
| NPI (National Provider Identifier) | 1234567890 (broad — verify with Luhn check) |
| ICD-10-CM Diagnosis Code | J45.20, E11.9 |
| NDC (National Drug Code) | 0002-3456-78 |
For additional PHI patterns (e.g., MBI, state license numbers), create a custom regex file and pass it to searchset the same way.
Red flag: Any matches indicate PII/PHI exposure — flag columns for masking or removal before sharing.
Question: Does the data contain CSV/formula injection or SQL injection payloads?
Malicious cell values can execute code when opened in spreadsheet applications (Excel, Google Sheets) or cause damage when loaded into databases without parameterized queries.
Use searchset with the bundled injection regex file to scan all columns:
qsv_command command: "searchset", input_file: "<file>", args: ["--flag", "injection_match", "${CLAUDE_PLUGIN_ROOT}/resources/injection-regexes.txt"]
The bundled ${CLAUDE_PLUGIN_ROOT}/resources/injection-regexes.txt detects:
CSV/Formula Injection:
| Pattern | Example | Risk |
|---|---|---|
Starts with = | =CMD("calc") | Arbitrary command execution in Excel |
Starts with + + function | +CMD("calc") | Same as = in many spreadsheet apps (positive numbers/phone numbers excluded) |
Starts with - + function | -SUM(A1:A10) | Formula execution (negative numbers excluded) |
Starts with @ | @SUM(A1:A10) | Excel function prefix |
| Starts with tab/CR | \t=CMD(...) | Bypasses naive prefix checks |
SQL Injection:
| Pattern | Example | Risk |
|---|---|---|
| SELECT...FROM | '; SELECT * FROM users-- | Data exfiltration |
| UNION SELECT | ' UNION SELECT password FROM users-- | Query hijacking |
| DROP TABLE/DATABASE | '; DROP TABLE users-- | Data destruction |
| INSERT INTO | '; INSERT INTO users VALUES(...)-- | Data tampering |
| DELETE FROM | '; DELETE FROM orders-- | Data deletion |
| UPDATE SET | '; UPDATE users SET role='admin'-- | Data modification |
| Tautology | ' OR 1=1-- | Authentication bypass |
| Stacked queries | '; DELETE FROM orders-- | Arbitrary SQL execution |
Red flag: Any matches indicate potential injection payloads — sanitize cells before sharing the file or loading into a database. For formula injection, prefix dangerous cells with a single quote (') or strip leading =+-@ characters.
Present a summary with:
describegpt (step 13a), or manually from stats cache and frequency distributions as fallback (step 13b)describegpt --format JSON (step 13a) or agent fallback (step 13b)| Problem | Fix Command |
|---|---|
| Inconsistent case | sqlp with UPPER(col) or LOWER(col) |
| Leading/trailing whitespace | sqlp with TRIM(col) |
| Duplicate rows | dedup |
| Ragged rows | fixlengths |
| Unsafe column names | safenames |
| Non-conforming values | searchset + search --flag to identify, sqlp to fix |
| Orphaned foreign keys | joinp --left-anti to find, then remove or fix references |
| Injection payloads | searchset to detect + sqlp to sanitize (prefix with ' or strip leading =+-@) |
| Wrong encoding | input (normalizes to UTF-8) |
| Empty values | sqlp with COALESCE(NULLIF(col, ''), 'N/A') |
| Invalid rows | validate schema.json + filter |