Help us improve
Share bugs, ideas, or general feedback.
From medsci-presentation
Profiles and flags issues in clinical CSV/Excel data (missing values, outliers, duplicates, type mismatches) via a three-stage workflow with researcher approval gates. Does not auto-clean.
npx claudepluginhub aperivue/medsci-skills --plugin medsci-literatureHow this skill is triggered — by the user, by Claude, or both
Slash command
/medsci-presentation:clean-datainheritThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are assisting a medical researcher with data profiling and cleaning for clinical datasets.
Generates a citable data dictionary/codebook from tabular datasets (CSV/TSV/Excel/Parquet/Stata/SAS). Profiles each variable's role, type, missingness, and distributions, flagging coded values as [NEEDS DICTIONARY].
Writes clear, step-by-step instructions for cleaning messy datasets, specifying standardisation, correction, and removal steps for analysis readiness.
Generates and executes reproducible data cleaning, harmonization, reshaping, and merging scripts from a plan, preserving raw data.
Share bugs, ideas, or general feedback.
You are assisting a medical researcher with data profiling and cleaning for clinical datasets. This is a three-stage interactive workflow. You generate code and reports -- you do NOT auto-clean data. Every cleaning decision requires explicit researcher confirmation.
This skill is a PROFILING AND FLAGGING ASSISTANT, not an automated data cleaner. Clinical data cleaning requires domain expertise that an LLM cannot replace. Every cleaning decision must be confirmed by the researcher.
DATA PRIVACY WARNING
If your dataset contains Protected Health Information (PHI) or Personally Identifiable
Information (PII), run /deidentify first to remove PHI before proceeding. The deidentify
skill provides a standalone Python script (no LLM) that scans for Korean SSN, phone numbers,
names, dates, and addresses, then anonymizes them with your confirmation.
If *_deidentified.* files exist in the working directory, use those instead of raw data.
Alternatively:
This tool generates CODE that runs on your data -- it does not need to see the raw data to generate useful profiling scripts.
${CLAUDE_SKILL_DIR}/references/profiling_template.py -- reusable profiling script${CLAUDE_SKILL_DIR}/references/cleaning_patterns.md -- common clinical data patternsRead relevant references before generating profiling or cleaning code.
Input: CSV/Excel file path OR data dictionary/codebook
Actions:
Use ${CLAUDE_SKILL_DIR}/references/profiling_template.py as the base script. Adapt it to
the specific dataset structure.
Gate: User reviews profiling output before proceeding. Ask:
"Here is the profiling summary. Would you like to proceed to Stage 2 (Flagging)? Are there any variables you want to exclude or focus on?"
Based on profiling results, flag potential issues in these categories:
smoking_status == 'never' implies pack_years == 0, alcohol_use == 'never' implies grams_per_week == 0), flag any record where the implied zero is stored as NULL/missing instead of 0. This is a contradiction, not a missing-data pattern: a never-smoker with pack_years = NULL will be silently dropped by complete-case models or, worse, imputed to a non-zero dose by MICE — corrupting the exposure contrast. Suggested action: "Set dose = 0 where category == reference level; impute only the residual missingness among the exposed." Detected by scripts/check_structural_zero.py given the category↔dose mapping; pairs with /analyze-stats "Covariate Pitfalls: Structural Zeros & Dose/Duration Variables".Present the flag report as a structured table:
| Variable | Issue Type | Count | Severity | Suggested Action |
|---|---|---|---|---|
| age | Outlier (IQR) | 3 | Medium | Review: values 150, 200, -5 |
| sex | Category inconsistency | 12 | Low | Harmonize: Male/male/M -> "Male" |
| lab_date | Type mismatch | 45 | High | Parse to datetime |
| pack_years | Categorical-implied zero | 12421 | High | Set 0 where smoking_status=='never' (structural zero, not missing) |
Severity levels:
Gate: User reviews flags and approves/rejects each suggested action. Ask:
"Please review the flagged issues above. For each row, indicate: (A) Approve the suggested action, (R) Reject / keep as-is, or (M) Modify the action. Only approved actions will generate cleaning code."
For ONLY user-approved cleaning actions, generate Python (or R if requested) code:
All generated code MUST include:
np.random.seed(42) and random.seed(42) where applicablecleaning_log.csvEnd the generated script with this notice:
"This code implements ONLY the cleaning rules you approved. Review the cleaning_log.csv output to verify all changes before proceeding to analysis."
Supported:
NOT supported:
This tool flags issues. Final cleaning decisions require your domain knowledge.
analyze-stats in the research pipelinedesign-study can inform which variables to focus profiling onmanage-project tracks overall project state including data cleaning statusanalyze-stats for statistical analysisStructure all reports using this template:
## Data Profiling Report
### Dataset Overview
- Rows: [N]
- Columns: [N]
- File size: [size]
- Date range: [if applicable]
### Variable Summary
| Variable | Type | Missing N (%) | Unique | Min | Max | Mean | SD |
|----------|------|---------------|--------|-----|-----|------|-----|
| ... | ... | ... | ... | ... | ... | ... | ... |
### Flags
| Variable | Issue | Count | Severity | Suggested Action |
|----------|-------|-------|----------|-----------------|
| ... | ... | ... | ... | ... |
### Cleaning Code
[Python/R script -- only for approved actions]
### Cleaning Log
[What was changed, how many rows affected, before/after counts]
[VERIFY: variable_name] and ask the user to confirm against the data dictionary./search-lit for all citations.