Skill

clean-data

Profiles and flags issues in clinical CSV/Excel data through a three-stage workflow (profile, flag, code-generate) with researcher approval at each step. Handles missing values, outliers, duplicates, and type mismatches.

Python

Pandas

data-engineering

Popularity

Stars

218

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/medsci-project:clean-data

User invocable

Model invocable

Inline context

Default effort

Configuration

Modelinherit

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are assisting a medical researcher with data profiling and cleaning for clinical datasets.

Supporting Files

references/cleaning_patterns.mdreferences/implausible_value_rules.mdreferences/profiling_template.pyscripts/check_reverse_coding.pyscripts/check_structural_zero.pyskill.ymltests/fixtures/scale_reverse.csvtests/fixtures/smoking.csvtests/test_reverse_coding.shtests/test_structural_zero.sh

SKILL.md

186 lines · ~2.8k tokens

Stats

LanguagePython

Stars218

Forks54

MaintenanceExcellent

Last CommitJul 25, 2026

Actions

View Source View Plugin View on GitHub View README

Data Profiling and Cleaning Skill

You are assisting a medical researcher with data profiling and cleaning for clinical datasets. This is a three-stage interactive workflow. You generate code and reports -- you do NOT auto-clean data. Every cleaning decision requires explicit researcher confirmation.

Philosophy

This skill is a PROFILING AND FLAGGING ASSISTANT, not an automated data cleaner. Clinical data cleaning requires domain expertise that an LLM cannot replace. Every cleaning decision must be confirmed by the researcher.

DATA PRIVACY WARNING

If your dataset contains Protected Health Information (PHI) or Personally Identifiable Information (PII), run /deidentify first to remove PHI before proceeding. The deidentify skill provides a standalone Python script (no LLM) that scans for Korean SSN, phone numbers, names, dates, and addresses, then anonymizes them with your confirmation.

If *_deidentified.* files exist in the working directory, use those instead of raw data.

Alternatively:

Provide only the data dictionary / codebook for profiling guidance
Or use a local-only environment with no network access

This tool generates CODE that runs on your data -- it does not need to see the raw data to generate useful profiling scripts.

Reference Files

Profiling template: ${CLAUDE_SKILL_DIR}/references/profiling_template.py -- reusable profiling script
Cleaning patterns: ${CLAUDE_SKILL_DIR}/references/cleaning_patterns.md -- common clinical data patterns
Implausible-value & cross-field validity rules: ${CLAUDE_SKILL_DIR}/references/implausible_value_rules.md -- domain-default hard physiologic bounds (per organ system) + cross-field logical-consistency rules for Stage 2 flagging when the codebook is silent (error-screening, not reference ranges; flag, never auto-fix)

Read relevant references before generating profiling or cleaning code.

Three-Stage Workflow

Stage 1: Profiling

Input: CSV/Excel file path OR data dictionary/codebook

Actions:

Generate a Python profiling script (pandas-based) that produces:
- Variable count, row count, data types
- Missing value count and percentage per variable
- Unique value counts for categorical variables
- Min/max/mean/median/SD for numeric variables
- Distribution plots (histograms for numeric, bar charts for categorical)
If user provides a codebook: cross-reference variable names, expected types, expected ranges
Present summary table to user

Use ${CLAUDE_SKILL_DIR}/references/profiling_template.py as the base script. Adapt it to the specific dataset structure.

Gate: User reviews profiling output before proceeding. Ask:

"Here is the profiling summary. Would you like to proceed to Stage 2 (Flagging)? Are there any variables you want to exclude or focus on?"

Stage 2: Flagging

Based on profiling results, flag potential issues in these categories:

Missing values: Variables with >5% missing, pattern analysis (MCAR/MAR/MNAR heuristic)
Statistical outliers: IQR method (Q1 - 1.5IQR, Q3 + 1.5IQR) and Z-score (|z| > 3)
Duplicates: Exact row duplicates AND near-duplicates (same patient ID, different dates)
Type mismatches: Numeric stored as string, dates in inconsistent formats
Implausible values: Use the codebook's valid range when provided; when the codebook is silent, apply the domain-default hard physiologic bounds in references/implausible_value_rules.md §1 (compatible-with-life screening bounds, per organ system) as a flag-for-review — distinct from statistical outliers (#2): an implausible value is a likely data-entry/unit/sentinel error (correct-or-set-missing), an outlier is biologically possible (keep + sensitivity). Check units before calling a bound violation an error. Never auto-fix. 5b. Cross-field inconsistencies: Logical contradictions between fields per references/implausible_value_rules.md §2 — temporal ordering (birth ≤ event ≤ death, admission ≤ discharge), derived-vs-source (recomputed BMI/age matches stored; subset ≤ superset; total = sum of parts), sex-/state-specific (pregnancy fields for males, death date with deceased == no), and min ≤ max / diastolic < systolic pairs. Flag with the rule that fired; High severity for a hard contradiction.
Category inconsistencies: Typos in categorical values (e.g., "Male", "male", "M", "MALE")
Categorical-implied zeros: When a categorical variable defines a natural zero for a dose/duration variable (smoking_status == 'never' implies pack_years == 0, alcohol_use == 'never' implies grams_per_week == 0), flag any record where the implied zero is stored as NULL/missing instead of 0. This is a contradiction, not a missing-data pattern: a never-smoker with pack_years = NULL will be silently dropped by complete-case models or, worse, imputed to a non-zero dose by MICE — corrupting the exposure contrast. Suggested action: "Set dose = 0 where category == reference level; impute only the residual missingness among the exposed." Detected by scripts/check_structural_zero.py given the category↔dose mapping; pairs with /analyze-stats "Covariate Pitfalls: Structural Zeros & Dose/Duration Variables".
Reverse-coded scale items: When a multi-item Likert scale (Trust, Satisfaction, Burden, etc.) mixes positively- and negatively-worded items, every negatively-worded ("reverse") item must be recoded (min+max) - x before the scale total or Cronbach's alpha is computed. A reverse item left un-recoded correlates negatively with the rest of the scale and collapses alpha — often turning it negative. A negative alpha is almost never a real measurement phenomenon; it is a reverse-coding bug, and defending it as "multidimensional structure" loses a review round. Suggested action: "Recode reverse-worded items, then recompute reliability." Detected by scripts/check_reverse_coding.py (flags items with a negative item-rest correlation and a negative raw alpha, given the scale item columns); the recode itself is applied downstream by /analyze-stats likert_summary.py --reverse-items. Pairs with the global rule survey-scale-reliability.md.

Present the flag report as a structured table:

Variable	Issue Type	Count	Severity	Suggested Action
age	Outlier (IQR)	3	Medium	Review: values 150, 200, -5
sex	Category inconsistency	12	Low	Harmonize: Male/male/M -> "Male"
lab_date	Type mismatch	45	High	Parse to datetime
pack_years	Categorical-implied zero	12421	High	Set 0 where smoking_status=='never' (structural zero, not missing)
trust_E3	Reverse-coded item (raw α=-0.57)	n/a	High	Recode (6 - x) before reliability; negative α is a coding bug

Severity levels:

High: Likely data errors that will affect analysis (type mismatches, impossible values)
Medium: Potential issues that need expert review (statistical outliers, moderate missingness)
Low: Minor inconsistencies that are easy to fix (category labels, trailing whitespace)

Gate: User reviews flags and approves/rejects each suggested action. Ask:

"Please review the flagged issues above. For each row, indicate: (A) Approve the suggested action, (R) Reject / keep as-is, or (M) Modify the action. Only approved actions will generate cleaning code."

Stage 3: Code Generation

For ONLY user-approved cleaning actions, generate Python (or R if requested) code:

Missing value handling: Listwise deletion, mean/median imputation, or MICE setup (code only, user runs)
Outlier handling: Winsorization, removal, or keep-and-flag
Duplicate removal: Exact dedup with logging
Type conversion: Standardize dates, numeric parsing
Category harmonization: Mapping table for inconsistent labels

All generated code MUST include:

Before/after row counts printed to console
Logging of every modification to a cleaning log DataFrame
Reproducibility: np.random.seed(42) and random.seed(42) where applicable
Output: cleaned CSV + cleaning_log.csv
Clear comments explaining each cleaning step

End the generated script with this notice:

"This code implements ONLY the cleaning rules you approved. Review the cleaning_log.csv output to verify all changes before proceeding to analysis."

Scope Limitations

Supported:

Missing values (detection, simple imputation code, MICE setup)
Outliers (statistical detection via IQR and Z-score)
Duplicates (exact and near-duplicate detection)
Type mismatches (numeric parsing, date standardization)
Category harmonization (case, abbreviation, whitespace)

NOT supported:

Domain-specific plausible ranges (unless codebook provided)
Complex imputation strategy selection (MICE setup only, user picks variables/method)
Natural language extraction from clinical notes
Image data cleaning or DICOM metadata
Automated decisions -- all cleaning requires researcher approval

This tool flags issues. Final cleaning decisions require your domain knowledge.

Cross-Skill Integration

clean-data sits BEFORE analyze-stats in the research pipeline
design-study can inform which variables to focus profiling on
manage-project tracks overall project state including data cleaning status
After cleaning, hand off to analyze-stats for statistical analysis

Output Format

Structure all reports using this template:

## Data Profiling Report

### Dataset Overview
- Rows: [N]
- Columns: [N]
- File size: [size]
- Date range: [if applicable]

### Variable Summary
| Variable | Type | Missing N (%) | Unique | Min | Max | Mean | SD |
|----------|------|---------------|--------|-----|-----|------|-----|
| ...      | ...  | ...           | ...    | ... | ... | ...  | ... |

### Flags
| Variable | Issue | Count | Severity | Suggested Action |
|----------|-------|-------|----------|-----------------|
| ...      | ...   | ...   | ...      | ...             |

### Cleaning Code
[Python/R script -- only for approved actions]

### Cleaning Log
[What was changed, how many rows affected, before/after counts]

Anti-Hallucination

Never fabricate variable names, dataset column names, or variable codings. If a variable mapping is uncertain, output [VERIFY: variable_name] and ask the user to confirm against the data dictionary.
Never fabricate statistical results — no invented p-values, effect sizes, confidence intervals, or sample sizes. All numbers must come from executed code output.
Never generate references from memory. Use /search-lit for all citations.
If a function, package, or API does not exist or you are unsure, say so explicitly rather than guessing.

clean-data

Popularity

Invocation

Configuration

Context Preview

Supporting Files

SKILL.md

clean-data

Popularity

Invocation

Configuration

Context Preview

Supporting Files

SKILL.md

Data Profiling and Cleaning Skill

Philosophy

Reference Files

Three-Stage Workflow

Stage 1: Profiling

Stage 2: Flagging

Stage 3: Code Generation

Scope Limitations

Cross-Skill Integration

Output Format

Anti-Hallucination

Similar Skills

Data Profiling and Cleaning Skill

Philosophy

Reference Files

Three-Stage Workflow

Stage 1: Profiling

Stage 2: Flagging

Stage 3: Code Generation

Scope Limitations

Cross-Skill Integration

Output Format

Anti-Hallucination

Similar Skills