Validates CSV/TSV/Excel files and data analyses for quality, completeness, uniqueness, accuracy, consistency, outliers, and bias using qsv stats and frequency tools.
From qsv-data-wranglingnpx claudepluginhub dathere/qsv --plugin qsv-data-wranglingThis skill is limited to using the following tools:
Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.
Validate data files and analyses for accuracy, methodology, and potential biases before sharing with stakeholders. Generates a confidence assessment and improvement suggestions.
Cowork note: If relative paths don't resolve, call
qsv_get_working_dirandqsv_set_working_dirto sync the working directory.
The input can be:
If a data file is provided, run these checks using qsv:
a. Index and profile: Run qsv_index, then qsv_stats with cardinality: true, stats_jsonl: true and qsv_sniff to understand the data.
b. Completeness: Read .stats.csv — check nullcount and sparsity for each column. Flag columns with sparsity > 0.5.
c. Uniqueness: Compare cardinality to row count from qsv_count. Flag key columns (ID, email) where cardinality < row count. Run qsv_command with command: "dedup" and args: ["--dupes-output", "dupes.csv"] to find exact duplicates.
d. Validity: Check type column in stats — flag String columns that should be numeric. Run qsv_command with command: "validate" against a JSON Schema if available.
e. Consistency: Run qsv_frequency with limit: 20 on categorical columns — look for case variants ("NYC" vs "nyc"), inconsistent formats, unexpected values.
f. Accuracy: Read .stats.csv for min, max, mean, stddev — flag implausible ranges (negative ages, latitude > 90, future dates). Run qsv_moarstats with advanced: true — check outliers_percentage > 5%, kurtosis > 10 (extreme outliers).
g. Distribution sanity: Read moarstats columns for deeper validation:
median_mean_ratio — if < 0.8 or > 1.2, distribution is significantly skewed; verify the mean isn't misleadingwinsorized_mean_25pct vs mean — large divergence (> 10%) confirms outliers are distorting the averagemad (median absolute deviation) — more robust than stddev for outlier detection; if mad_stddev_ratio > 0.8, stddev is reasonably reliablejarque_bera_pvalue — if < 0.05, data is NOT normally distributed; flag any analysis that assumes normalitymode_count — if mode accounts for > 50% of values, investigate whether this reflects a data entry default or missing value maskingh. Join integrity (if multiple files): Run qsv_joinp with --left-anti to find orphaned foreign keys.
i. Injection screening: Run qsv_command with command: "searchset" and args: ["--flag", "injection_match", "${CLAUDE_PLUGIN_ROOT}/resources/injection-regexes.txt"] to scan for malicious payloads.
Examine the analysis for:
Systematically review against these pitfalls:
| Pitfall | How to Detect with qsv | Red Flag |
|---|---|---|
| Join explosion | qsv_count before and after join | Row count increased after join |
| Survivorship bias | qsv_frequency on status/lifecycle columns | Missing churned/deleted/failed entities |
| Incomplete period | qsv_sqlp to check date ranges | Partial periods compared to full periods |
| Denominator shifting | qsv_sqlp to verify denominator consistency | Definition changed between periods |
| Average of averages | qsv_sqlp to recalculate from raw data | Pre-aggregated averages with unequal group sizes |
| Selection bias | qsv_frequency on segment definitions | Segments defined by the outcome being measured |
Spot-check using qsv_sqlp:
SELECT SUM(subtotal) as check_total FROM dataSELECT SUM(pct) FROM data| Metric Type | Sanity Check via qsv |
|---|---|
| Counts | qsv_count — does it match known figures? |
| Sums/averages | qsv_stats — are min/max/mean in plausible range? |
| Rates | qsv_sqlp — are values between 0% and 100%? |
| Distributions | qsv_frequency — do segment percentages sum to ~100%? |
| Growth rates | qsv_sqlp — is 50%+ MoM growth realistic? |
| Outliers | qsv_moarstats — outliers_percentage, kurtosis |
If the analysis includes charts:
Review whether:
Provide specific, actionable suggestions:
Rate the analysis on a 3-level scale:
Ready to share — Analysis is methodologically sound, calculations verified, caveats noted. Minor suggestions for improvement but nothing blocking.
Share with noted caveats — Analysis is largely correct but has specific limitations or assumptions that must be communicated to stakeholders. List the required caveats.
Needs revision — Found specific errors, methodological issues, or missing analyses that should be addressed before sharing. List the required changes with priority order.
qsv_sniff to verify format and encoding.nullcount/sparsity in .stats.csv.qsv_count before/after joins, dedup --dupes-output.qsv_count after joins.qsv_sqlp.min/max in .stats.csv.qsv_sqlp to check period-over-period.A many-to-many join silently multiplies rows, inflating counts and sums. Detect: qsv_count before and after join — if count increased, investigate the join relationship. Prevent: Use COUNT(DISTINCT id) instead of COUNT(*) when counting entities through joins.
Analyzing only entities that exist today, ignoring churned/deleted/failed ones. Detect: qsv_frequency on status columns — are all lifecycle states represented? Prevent: Ask "who is NOT in this dataset?" before drawing conclusions.
Comparing a partial period to a full period. Detect: qsv_sqlp to check min/max dates per period. Prevent: Filter to complete periods or compare same number of days.
The denominator changes between periods, making rates incomparable. Detect: qsv_sqlp to verify denominator definition consistency. Prevent: Use consistent definitions across all compared periods.
Averaging pre-computed averages gives wrong results when group sizes differ. Detect: Compare qsv_stats mean against qsv_sqlp weighted average. Prevent: Always aggregate from raw data.
Trend reverses when data is aggregated vs. segmented. Detect: qsv_sqlp GROUP BY at different granularity levels — does the conclusion change? Prevent: Always check results at segment level before aggregating.
## Validation Report
### Overall Assessment: [Ready to share | Share with caveats | Needs revision]
### Data Quality Summary
- File: [format, rows, columns, encoding]
- Completeness: [null rates, gaps found]
- Uniqueness: [duplicates found, cardinality issues]
- Validity: [type mismatches, schema violations]
- Accuracy: [outliers, implausible ranges]
### Methodology Review
[Findings about approach, data selection, definitions]
### Issues Found
1. [Severity: High/Medium/Low] [Issue description and impact]
2. ...
### Calculation Spot-Checks
- [Metric]: [Verified / Discrepancy found]
- ...
### Visualization Review
[Any issues with charts or visual presentation]
### Suggested Improvements
1. [Improvement and why it matters]
2. ...
### Required Caveats for Stakeholders
- [Caveat that must be communicated]
- ...
/data-profile instead/data-clean to fix them and re-validate