Help us improve
Share bugs, ideas, or general feedback.
From qsv-data-wrangling
References data quality dimensions with qsv checks and provides remediation decision tree for tabular CSV assessment and fixes.
npx claudepluginhub dathere/qsv --plugin qsv-data-wranglingHow this skill is triggered — by the user, by Claude, or both
Slash command
/qsv-data-wrangling:data-qualityThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
For the full step-by-step profiling workflow, use the `/data-profile` command. This skill provides quick-reference guidance for quality assessment and remediation decisions.
Profiles CSV/TSV/Excel files: detects format, counts rows/headers, computes basic/advanced statistics (kurtosis, Gini, outliers), shows top value distributions.
Performs data quality checks for completeness, uniqueness, freshness, volume, and distribution drift. Generates scorecards and HTML reports for pipelines.
Profiles tables or files (CSV, Excel, Parquet, JSON) to reveal shape, null rates, column distributions, top values, percentiles, data quality issues, and column categories.
Share bugs, ideas, or general feedback.
For the full step-by-step profiling workflow, use the /data-profile command. This skill provides quick-reference guidance for quality assessment and remediation decisions.
| Dimension | Key Question | Primary Check | Red Flag |
|---|---|---|---|
| Completeness | Missing values? | stats — nullcount, sparsity | Sparsity > 0.5 |
| Uniqueness | Unwanted duplicates? | stats --cardinality vs row count | Key column cardinality < row count |
| Validity | Correct formats/types? | stats — type; validate schema.json | String type on numeric column |
| Consistency | Uniform formats? | frequency — case variants; sniff — encoding | Same value in different cases |
| Accuracy | Plausible values? | stats — min/max/stddev | Values > 3 stddev from mean |
| Column Name Quality | Headers safe & descriptive? | safenames --verify | Spaces, special chars, or duplicates in headers |
| Conformity | Values follow standards? | searchset with domain regex | Non-standard codes (country, state, zip, phone) |
| Referential Integrity | Foreign keys valid? | joinp --left-anti | Orphaned references across related files |
| Injection Safety | Malicious payloads? | searchset with injection regex | Formula/SQL injection patterns in cells |
| Documentation | Dataset described? | describegpt --all | No Data Dictionary or Description |
When a quality issue is found, choose the right fix:
| Problem | Severity | Fix Command | When to Skip |
|---|---|---|---|
| Ragged rows | High | fixlengths | Never — breaks downstream tools |
| Wrong encoding | High | input | File is already UTF-8 (check with sniff) |
| Unsafe column names | Medium | safenames | Headers already safe (no spaces/special chars) |
| Leading/trailing whitespace | Medium | sqlp with TRIM(col) | Stats show no difference between min/max lengths and trimmed values |
| Duplicate rows | Medium | dedup (or extdedup for >1GB) | stats --cardinality on key columns shows all unique |
| Inconsistent case | Low | sqlp with UPPER(col) or LOWER(col) | frequency shows no case variants |
| Empty values | Low | sqlp with COALESCE(NULLIF(col, ''), 'N/A') | Nulls are semantically meaningful |
| Non-conforming values | Medium | searchset + search --flag | No domain standard applies |
| Orphaned foreign keys | Medium | joinp --left-anti | Single-file dataset with no references |
| Injection payloads | High | searchset with injection regex + sanitize | Data is internal-only and never opened in spreadsheets or loaded into databases |
| Invalid rows | Low | validate schema.json + filter | No schema available |
Always apply fixes in this order to avoid cascading issues:
1. input (encoding — must be UTF-8 before anything else)
2. safenames (headers — fixes names before column references)
3. fixlengths (structure — ensures consistent field counts)
4. sqlp with TRIM() (whitespace — clean values before dedup)
5. dedup (duplicates — remove after trimming so "foo " and "foo" match)
6. validate (validation — check against schema last)
After running stats --cardinality --stats-jsonl (basic moarstats auto-runs), read the .stats.csv cache to assess quality in one pass:
| Cache Column | Quality Signal |
|---|---|
nullcount | Completeness — 0 is ideal |
sparsity | Completeness — ratio of nulls (0.0–1.0) |
cardinality | Uniqueness — compare to row count |
type | Validity — check expected types |
min / max | Accuracy — plausible range? |
mean / stddev | Accuracy — outlier detection (>3σ) |
outliers_total_cnt | Accuracy — from moarstats; outlier count per column |
mode | Consistency — dominant value expected? |
moarstats --advanced)Run moarstats --advanced to enrich the cache with distribution shape metrics:
| Cache Column | Quality Signal |
|---|---|
kurtosis | >3 heavy tails (outlier-prone), <3 light tails; >10 = extreme outliers |
bimodality_coefficient | >=0.555 suggests bimodal distribution (possible mixed populations) |
jarque_bera_pvalue | <0.05 = NOT normally distributed; flag analyses assuming normality |
gini_coefficient | Near 1 = extreme concentration; near 0 = uniform |
shannon_entropy | Low = concentrated values; high = diverse |
winsorized_mean | Compare to mean — large difference signals outlier influence |
median_mean_ratio | <0.8 or >1.2 = significantly skewed; mean may be misleading |
range_stddev_ratio | Very high = extreme outliers relative to variability |
cv | >100% = high relative variability; data is highly spread relative to mean |
mad_stddev_ratio | >0.8 = stddev is reliable; <<0.8 = outliers inflating stddev |
mode_zscore | Far from 0 = mode is atypical; possible mixed populations |