Cleaning Data - Component Skill

Purpose

Use this skill when:

Have completed the importing-data skill with quality report generated
Need to address data quality issues before analysis (duplicates, outliers, NULL handling, free text categorization)
Want systematic approach to cleaning decisions with documented rationale
Need to create clean tables ready for process skills (exploratory-analysis, guided-investigation, etc.)
Following DataPeeker principle: cleaning is ALWAYS mandatory even if minimal issues found

This skill is a prerequisite for all DataPeeker analysis workflows and consumes the quality report from importing-data.

Prerequisites

Before using this skill, you MUST:

Have completed the importing-data skill successfully
Have access to 05-quality-report.md generated by importing-data
Have raw_* table(s) in data/analytics.db
Be familiar with basic SQLite for validation queries
Understand data quality concepts: duplicates, outliers, NULL handling, categorical standardization

Data Cleaning Process

Create a TodoWrite checklist for the 5-phase data cleaning process:

Phase 1: Quality Report Review - pending
Phase 2: Issue Detection (Agent-Delegated) - pending
Phase 3: Cleaning Strategy Design - pending
Phase 4: Cleaning Execution - pending
Phase 5: Verification & Documentation - pending

Mark each phase as you complete it. Document all findings in numbered markdown files (01-cleaning-scope.md through 05-verification-report.md) within your analysis workspace directory.

Phase 1: Quality Report Review

Goal: Read quality report from importing-data, understand detected issues, prioritize for cleaning based on impact and severity.

Read Quality Report from importing-data

Locate the quality report:

analysis/[session-name]/05-quality-report.md

This report (generated by importing-data Phase 5) contains:

Table schema and row counts
NULL percentages per column
Duplicate counts and examples
Outlier flags (3 MAD threshold) per numeric column
Free text candidates (columns with >50% uniqueness)
Summary of quality concerns

Extract key information:

Which columns have >10% NULLs?
How many duplicate rows exist (exact duplicates)?
Which numeric columns have outliers?
Which text columns need categorization?

Document: Summarize findings from quality report.

Prioritize Issues Using Framework

Issue Prioritization Matrix:

Evaluate each issue on three dimensions:

1. Impact (% of rows/columns affected)

High: >10% of rows affected
Medium: 1-10% of rows affected
Low: <1% of rows affected

2. Severity (effect on analysis validity)

Critical: Makes analysis invalid or misleading (e.g., key column >50% NULL)
Significant: Reduces data quality for important columns (e.g., duplicates, inconsistent categories)
Minor: Affects edge cases only (e.g., outliers that are legitimate)

3. Effort (complexity to resolve)

Low: Simple removal, exclusion, or standardization (1-2 SQL queries)
Medium: Requires sub-agent for categorization or pattern analysis (3-5 queries)
High: Complex deduplication, manual review, or domain expertise needed (>5 queries)

Combine dimensions to assign priority:

Impact	Severity	Effort	Priority	Action Timing
High	Critical	Any	CRITICAL	Must address
High	Significant	Low/Med	HIGH	Must address
Medium	Critical	Low/Med	HIGH	Must address
Any	Any	High	MEDIUM	Address if time permits
Low	Minor	Any	LOW	Document, may skip

Document: Create prioritized issue table in 01-cleaning-scope.md.

Define Cleaning Scope and Objectives

Create analysis/[session-name]/01-cleaning-scope.md with: ./templates/phase-1.md

CHECKPOINT: Before proceeding to Phase 2, you MUST have:

Read and understood 05-quality-report.md from importing-data
Extracted all detected issues (NULLs, duplicates, outliers, free text, FK orphans)
Applied prioritization matrix (impact × severity × effort)
Reviewed FK relationships and orphaned records (if multiple tables)
Defined cleaning objectives with success criteria
01-cleaning-scope.md created with all sections filled

Phase 2: Issue Detection (Agent-Delegated)

Goal: Deep-dive investigation of prioritized data quality issues using sub-agents to prevent context pollution.

CRITICAL: This phase MUST use sub-agent delegation. DO NOT analyze data in main agent context.

Detection 1: Duplicate Records

Use dedicated duplicate detection agents

For exact duplicates:

Invoke the detect-exact-duplicates agent:

Task tool with agent: detect-exact-duplicates
Parameters:
- table_name: raw_[actual_table_name]
- key_columns: [columns that define uniqueness, from Phase 1 scope]

For near-duplicates (fuzzy matching):

Invoke the detect-near-duplicates agent:

Task tool with agent: detect-near-duplicates
Parameters:
- table_name: raw_[actual_table_name]
- text_column: [specific text column flagged in Phase 1]

Repeat for each text column requiring fuzzy matching.

Document findings in 02-detected-issues.md using template below.

Detection 2: Outliers (MAD-Based)

Use dedicated outlier detection agent

For each numeric column flagged in Phase 1:

Invoke the detect-outliers agent:

Task tool with agent: detect-outliers
Parameters:
- table_name: raw_[actual_table_name]
- numeric_col: [specific numeric column from Phase 1 scope]

Repeat for each numeric column requiring outlier analysis.

Document findings in 02-detected-issues.md using template below.

Detection 4: Referential Integrity Validation

If multiple tables exist with FK relationships identified in Phase 1:

Use dedicated FK validation agent

For each FK relationship flagged in Phase 1:

Invoke the detect-foreign-keys agent (focused validation mode):

Task tool with agent: detect-foreign-keys
Parameters:
- database_path: data/analytics.db
- child_table: [specific child table]
- child_column: [FK column]
- parent_table: [specific parent table]
- parent_column: [PK column]

The agent will:

Confirm value overlap percentage (validate Phase 1 findings)
Identify specific orphaned record IDs
Assess orphan patterns (recent vs old, specific categories, etc.)
Quantify impact on analysis (% of records affected in joins)

If single table: Skip this detection, document "N/A - Single table" in detected issues report.

Document findings in 02-detected-issues.md using template below.

Review Sub-Agent Findings

After all sub-agents return findings:

For duplicates:

Are exact duplicates truly identical (all columns match)?
Are near-duplicates legitimate variations or data entry errors?
Which duplicate groups should be merged vs kept separate?

For outliers:

Are outliers data errors or legitimate extreme values?
Do outliers follow any pattern (seasonal, geographic, product-specific)?
Which outliers should be excluded vs capped vs flagged?

For FK orphans (if multiple tables):

Are orphaned records recent (may resolve soon) or old (permanent issue)?
Do orphans follow a pattern (specific categories, time periods)?
Can orphans be matched to parent records through fuzzy matching?
Should orphans be excluded, flagged, or have placeholder parents created?

Document: Observations and preliminary decisions for Strategy Phase.

Create Detected Issues Report

Create analysis/[session-name]/02-detected-issues.md with: ./templates/phase-2.md

CHECKPOINT: Before proceeding to Phase 3, you MUST have:

Delegated duplicate detection to sub-agent (exact and near-duplicates)
Delegated outlier detection to sub-agent (MAD-based, all flagged columns)
Delegated FK validation to sub-agent (if multiple tables with relationships)
Reviewed all sub-agent findings and documented observations
Created 02-detected-issues.md with all sections filled (including FK orphans if applicable)
Identified specific records/issues for Phase 3 strategy decisions
Listed implications for Phase 3 (what user needs to decide)

Phase 3: Cleaning Strategy Design

Goal: Design cleaning approach for each detected issue type, present options with trade-offs, get user confirmation before execution.

Review Detected Issues from Phase 2

From 02-detected-issues.md, identify all issue types requiring decisions:

Exact duplicates: [count]
Near-duplicates: [count]
Outliers per column: [counts]
Free text categorization: [columns]

Decision Framework: Duplicates

For exact duplicates, choose ONE approach:

Option A: Keep First Occurrence

Method: ORDER BY rowid, keep lowest rowid per duplicate group
Pros: Simple, deterministic, fast
Cons: First may not be most complete/accurate
Use when: No quality difference between duplicates

Option B: Keep Most Complete

Method: Rank by completeness (fewest NULLs), keep best per group
Pros: Preserves maximum information
Cons: More complex, requires completeness scoring
Use when: Duplicates have varying data quality

Option C: Merge Records

Method: Combine non-NULL values from all duplicates
Pros: No data loss
Cons: Complex, may create inconsistencies
Use when: Duplicates have complementary information

For near-duplicates, choose ONE approach:

Option A: Auto-Merge High Confidence (>95%)

Method: Apply fuzzy matching agent's high confidence mappings automatically
Pros: Efficient, addresses most obvious issues
Cons: Small risk of incorrect merges
Use when: Trust fuzzy matching agent's assessment

Option B: Manual Review All

Method: Review every near-duplicate group before merging
Pros: Zero incorrect merges
Cons: Time-consuming
Use when: Data quality is critical

Option C: Skip Near-Duplicates

Method: Only handle exact duplicates, leave fuzzy matches as-is
Pros: Conservative, no risk
Cons: Misses data quality improvements
Use when: Near-duplicates are legitimate variations

Document chosen approach in 03-cleaning-strategy.md

Decision Framework: Outliers

For each numeric column with outliers, choose ONE approach:

Option A: Exclude Outliers

Method: Filter out rows where value > 3 MAD from median
Pros: Clean dataset, no extreme values skewing analysis
Cons: Data loss, may exclude legitimate extremes
Use when: Outliers are clearly data errors

Option B: Cap at Threshold

Method: Set outliers to 3 MAD threshold (winsorization)
Pros: Preserves row count, reduces extreme influence
Cons: Distorts actual values
Use when: Want to preserve rows but limit extreme influence

Option C: Flag and Keep

Method: Add outlier_flag column, keep all data
Pros: No data loss, analysts can filter if needed
Cons: Outliers may still skew analysis if not filtered
Use when: Outliers might be legitimate, need analyst judgment

Option D: Keep As-Is

Method: No transformation
Pros: Preserves true data
Cons: Extremes may dominate analysis
Use when: Outliers are legitimate (VIPs, seasonal spikes)

Document chosen approach per column in 03-cleaning-strategy.md

Decision Framework: Free Text Categorization

If free text columns flagged in Phase 2:

Step 1: Invoke categorization agent

Task tool with agent: categorize-free-text
Parameters:
- column_name: [specific text column]
- unique_values: [list from Phase 2 detection]
- business_context: [optional context about what values represent]

Step 2: Review agent's proposed categories

Agent will return:

Proposed category schema (3-10 categories)
Value mappings with confidence levels
Ambiguous/uncategorizable values flagged

Step 3: Choose categorization approach:

Option A: Accept Agent Proposal

Method: Use agent's categories and mappings as-is
Pros: Fast, leverages semantic analysis
Cons: May miss business context
Use when: Agent's categories make sense for analysis

Option B: Modify Categories

Method: Adjust agent's proposal (rename, merge, split categories)
Pros: Incorporates business knowledge
Cons: Requires manual refinement
Use when: Agent's categories are close but need tweaking

Option C: Manual Categorization

Method: Define categories and mappings from scratch
Pros: Full control, perfect fit for business needs
Cons: Time-consuming
Use when: Agent's proposal doesn't fit business model

Option D: Keep As-Is

Method: No categorization
Pros: Preserves original data
Cons: High-cardinality text column harder to analyze
Use when: Free text values are inherently unique (IDs, descriptions)

Document chosen approach in 03-cleaning-strategy.md

Decision Framework: Referential Integrity (If Multiple Tables)

For FK orphaned records identified in Phase 2:

For each FK relationship with orphaned records, choose ONE approach:

Option A: Exclude Orphaned Records

Method: Filter out child records where FK doesn't match any parent PK
Pros: Clean referential integrity, INNER JOINs work correctly
Cons: Data loss
Use when: Orphaned records are data errors with no business value

Option B: Preserve with NULL

Method: Set orphaned FK values to NULL (retain child records)
Pros: Preserves child row count, makes orphans explicit
Cons: Loses relationship information, NULL handling required in queries
Use when: Child records have value even without parent context

Option C: Flag and Keep

Method: Add [fk_column]_orphan_flag column, keep original FK value
Pros: No data loss, analysts can filter as needed
Cons: Referential integrity violated until analyst filters
Use when: Need investigation before deciding, orphans may resolve

Option D: Create Placeholder Parent

Method: Insert synthetic parent record (e.g., id=-1, name="Unknown"), map orphans to it
Pros: Preserves referential integrity AND child rows, INNER JOINs work
Cons: Creates synthetic data, may skew parent-level aggregations
Use when: JOINs required but can't lose child records (e.g., orders with unknown customer)

Document chosen approach per FK relationship in 03-cleaning-strategy.md

Decision Framework: Business Rules (Optional)

If business rules were defined in Phase 1 scope:

For each rule, choose approach for violations:

Option A: Exclude Violating Records

Method: Filter out rows that fail validation
Pros: Clean dataset, only valid data
Cons: Data loss
Use when: Invalid data cannot be corrected

Option B: Cap/Coerce to Valid Range

Method: Adjust values to meet constraints
Pros: Preserves rows
Cons: Changes actual data
Use when: Violations are minor (e.g., age 150 → cap at 120)

Option C: Flag and Keep

Method: Add validation_flag column
Pros: No data loss, transparent
Cons: Invalid data present
Use when: Need to investigate violations before deciding

Document chosen approach per rule in 03-cleaning-strategy.md

Create Cleaning Strategy Document

Create analysis/[session-name]/03-cleaning-strategy.md with: ./templates/phase-3.md

CHECKPOINT: Before proceeding to Phase 4, you MUST have:

Reviewed all detected issues from Phase 2
Chosen approach for duplicates (exact and near)
Chosen approach for outliers (per numeric column)
Reviewed free text categorization agent proposal (if applicable)
Chosen approach for free text categorization
Chosen approach for FK orphans (if multiple tables with relationships)
Defined business rule handling (if applicable)
User confirmed all strategies via checkpoint review
03-cleaning-strategy.md created with all decisions documented

Phase 4: Cleaning Execution

Goal: Execute approved cleaning strategies, create clean_* tables, track all exclusions and transformations.

CRITICAL: All transformations use CREATE TABLE AS SELECT pattern. Keep raw_* tables intact.

Transformation 1: Remove Duplicates

Based on Strategy from Phase 3:

For Exact Duplicates (Keep First approach):

-- Create clean table without exact duplicates
CREATE TABLE clean_[table_name] AS
WITH ranked_records AS (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY [key_columns]
      ORDER BY rowid
    ) as rn
  FROM raw_[table_name]
)
SELECT [all_columns]  -- Exclude rn column
FROM ranked_records
WHERE rn = 1;

For Exact Duplicates (Keep Most Complete approach):

-- Create clean table keeping most complete record per duplicate group
CREATE TABLE clean_[table_name] AS
WITH completeness_scored AS (
  SELECT *,
    ([count non-NULL columns formula]) as completeness_score,
    ROW_NUMBER() OVER (
      PARTITION BY [key_columns]
      ORDER BY completeness_score DESC, rowid
    ) as rn
  FROM raw_[table_name]
)
SELECT [all_columns]  -- Exclude rn and completeness_score
FROM completeness_scored
WHERE rn = 1;

For Near-Duplicates (Auto-Merge High Confidence approach):

-- Create mapping table from categorization agent
CREATE TABLE [column]_near_dup_mapping AS
SELECT
  original_value,
  canonical_value
FROM (VALUES
  ('[value1]', '[canonical]'),
  ('[value2]', '[canonical]'),
  ...
) AS mapping(original_value, canonical_value);

-- Apply mapping
UPDATE clean_[table_name]
SET [text_column] = (
  SELECT canonical_value
  FROM [column]_near_dup_mapping
  WHERE original_value = [text_column]
)
WHERE [text_column] IN (SELECT original_value FROM [column]_near_dup_mapping);

Verification:

-- Verify duplicate removal
SELECT COUNT(*) as clean_count FROM clean_[table_name];
SELECT COUNT(*) - COUNT(DISTINCT [key_columns]) as remaining_dups FROM clean_[table_name];
-- Expected: 0 remaining duplicates

-- Exclusion count
SELECT [raw_count] - [clean_count] as excluded_duplicates;

Document in 04-cleaning-execution.md:

SQL executed
Before count: [N] rows
After count: [N] rows
Duplicates removed: [N] rows ([X]%)

Transformation 2: Handle Outliers

Based on Strategy from Phase 3:

For Outliers (Exclude approach):

-- Calculate thresholds
WITH stats AS (
  SELECT
    [median_calculation] as median,
    [mad_calculation] * 1.4826 as mad
  FROM raw_[table_name]
  WHERE [numeric_col] IS NOT NULL
)
-- Create clean table excluding outliers
CREATE TABLE clean_[table_name] AS
SELECT r.*
FROM raw_[table_name] r
CROSS JOIN stats s
WHERE ABS(r.[numeric_col] - s.median) <= 3 * s.mad
   OR r.[numeric_col] IS NULL;  -- Keep NULLs

For Outliers (Cap at Threshold approach - Winsorization):

WITH stats AS (
  SELECT
    [median_calculation] as median,
    [mad_calculation] * 1.4826 as mad
  FROM raw_[table_name]
  WHERE [numeric_col] IS NOT NULL
)
CREATE TABLE clean_[table_name] AS
SELECT
  [other_columns],
  CASE
    WHEN [numeric_col] > median + 3 * mad THEN median + 3 * mad
    WHEN [numeric_col] < median - 3 * mad THEN median - 3 * mad
    ELSE [numeric_col]
  END as [numeric_col]
FROM raw_[table_name]
CROSS JOIN stats;

For Outliers (Flag and Keep approach):

WITH stats AS (
  SELECT [median_calculation] as median, [mad_calculation] * 1.4826 as mad
  FROM raw_[table_name]
  WHERE [numeric_col] IS NOT NULL
)
CREATE TABLE clean_[table_name] AS
SELECT
  r.*,
  CASE
    WHEN ABS(r.[numeric_col] - s.median) > 3 * s.mad THEN 1
    ELSE 0
  END as [numeric_col]_outlier_flag
FROM raw_[table_name] r
CROSS JOIN stats s;

Verification:

-- Verify outlier handling
SELECT COUNT(*) as clean_count FROM clean_[table_name];

-- For Exclude approach: check no outliers remain
WITH stats AS (...)
SELECT COUNT(*) as remaining_outliers
FROM clean_[table_name], stats
WHERE ABS([numeric_col] - median) > 3 * mad;
-- Expected: 0

-- For Cap approach: check values at thresholds
SELECT MIN([numeric_col]), MAX([numeric_col]) FROM clean_[table_name];

-- For Flag approach: check flag distribution
SELECT [numeric_col]_outlier_flag, COUNT(*)
FROM clean_[table_name]
GROUP BY [numeric_col]_outlier_flag;

Document in 04-cleaning-execution.md:

Approach used per column
SQL executed
Before/after row counts (if excluding)
Outliers affected: [N] rows ([X]%)

Transformation 3: Categorize Free Text

Based on Strategy from Phase 3:

If using agent's proposed categories:

-- Create category mapping from agent output
CREATE TABLE [column]_category_mapping AS
VALUES
  ('[value]', '[Category 1]'),
  ('[value]', '[Category 1]'),
  ('[value]', '[Category 2]'),
  ...
) AS mapping(original_value, category);

-- Apply categorization
CREATE TABLE clean_[table_name] AS
SELECT
  r.[other_columns],
  COALESCE(m.category, 'Other') as [column]_category
FROM raw_[table_name] r
LEFT JOIN [column]_category_mapping m
  ON r.[text_column] = m.original_value;

If handling uncategorizable values:

-- Option A: Exclude uncategorizable
CREATE TABLE clean_[table_name] AS
SELECT
  r.*,
  m.category as [column]_category
FROM raw_[table_name] r
INNER JOIN [column]_category_mapping m
  ON r.[text_column] = m.original_value;
-- INNER JOIN excludes unmapped values

-- Option B: Map to "Other" category (shown in previous query with COALESCE)

Verification:

-- Verify categorization coverage
SELECT
  [column]_category,
  COUNT(*) as count,
  ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) as pct
FROM clean_[table_name]
GROUP BY [column]_category
ORDER BY count DESC;

-- Check for unmapped values (if kept)
SELECT COUNT(*) as unmapped
FROM clean_[table_name]
WHERE [column]_category IS NULL OR [column]_category = 'Other';

Document in 04-cleaning-execution.md:

Category schema used
SQL executed
Distribution by category
Unmapped/excluded values: [N] rows ([X]%)

Transformation 4: Business Rule Validation (if applicable)

Based on Strategy from Phase 3:

-- Exclude rows violating business rules
CREATE TABLE clean_[table_name] AS
SELECT *
FROM raw_[table_name]
WHERE [rule_1_validation]
  AND [rule_2_validation]
  ...;

-- Example rules:
-- WHERE age BETWEEN 0 AND 120
-- AND amount > 0
-- AND date BETWEEN '2020-01-01' AND '2025-12-31'

Verification:

-- Verify no violations remain
SELECT COUNT(*) as violations
FROM clean_[table_name]
WHERE NOT ([rule_1_validation] AND [rule_2_validation] ...);
-- Expected: 0

Document in 04-cleaning-execution.md:

Rules enforced
SQL executed
Violations excluded: [N] rows ([X]%)

Transformation 5: Referential Integrity Enforcement (if multiple tables)

Based on Strategy from Phase 3:

For Orphaned Records (Exclude approach):

-- Remove orphaned child records (Option A from Phase 3)
CREATE TABLE clean_child_table AS
SELECT c.*
FROM raw_child_table c
INNER JOIN raw_parent_table p ON c.fk_column = p.pk_column;
-- INNER JOIN automatically excludes orphans

For Orphaned Records (Preserve with NULL approach):

-- Set orphaned FK values to NULL (Option B from Phase 3)
CREATE TABLE clean_child_table AS
SELECT
  c.*,
  CASE
    WHEN p.pk_column IS NULL THEN NULL
    ELSE c.fk_column
  END as fk_column
FROM raw_child_table c
LEFT JOIN raw_parent_table p ON c.fk_column = p.pk_column;

For Orphaned Records (Flag and Keep approach):

-- Add orphan flag column (Option C from Phase 3)
CREATE TABLE clean_child_table AS
SELECT
  c.*,
  CASE
    WHEN p.pk_column IS NULL AND c.fk_column IS NOT NULL THEN 1
    ELSE 0
  END as fk_column_orphan_flag
FROM raw_child_table c
LEFT JOIN raw_parent_table p ON c.fk_column = p.pk_column;

For Orphaned Records (Create Placeholder Parent approach):

-- Step 1: Create placeholder parent record (Option D from Phase 3)
INSERT INTO raw_parent_table (pk_column, name, other_fields)
VALUES (-1, 'Unknown', NULL, ...);

-- Step 2: Remap orphans to placeholder
CREATE TABLE clean_child_table AS
SELECT
  c.*,
  CASE
    WHEN p.pk_column IS NULL THEN -1
    ELSE c.fk_column
  END as fk_column
FROM raw_child_table c
LEFT JOIN raw_parent_table p ON c.fk_column = p.pk_column;

Verification:

-- Verify no orphans remain (for Exclude approach)
SELECT COUNT(*) as orphans
FROM clean_child_table c
LEFT JOIN clean_parent_table p ON c.fk_column = p.pk_column
WHERE p.pk_column IS NULL AND c.fk_column IS NOT NULL;
-- Expected: 0

-- Verify NULL remapping (for Preserve with NULL approach)
SELECT COUNT(*) as nulled_fks
FROM clean_child_table
WHERE fk_column IS NULL;
-- Expected: [count of orphans from Phase 2]

-- Verify flag accuracy (for Flag and Keep approach)
SELECT fk_column_orphan_flag, COUNT(*)
FROM clean_child_table
GROUP BY fk_column_orphan_flag;
-- Expected: flag=1 count matches orphan count from Phase 2

Document in 04-cleaning-execution.md:

FK relationship handled
Approach used (Exclude/Preserve/Flag/Placeholder)
SQL executed
Orphans affected: [N] rows ([X]%)
JOIN behavior after transformation

[If single table: "N/A - Single table analysis"]

Combined Transformation Approach

If multiple transformations needed, use CTE chain:

CREATE TABLE clean_[table_name] AS
WITH
-- Step 1: Remove duplicates
deduped AS (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY [key_cols] ORDER BY rowid) as rn
  FROM raw_[table_name]
),
no_dups AS (
  SELECT [all_columns] FROM deduped WHERE rn = 1
),
-- Step 2: Handle outliers
outliers_removed AS (
  SELECT d.*
  FROM no_dups d
  CROSS JOIN (SELECT [median], [mad] FROM ...) stats
  WHERE ABS(d.[numeric_col] - stats.median) <= 3 * stats.mad
),
-- Step 3: Apply categorization
categorized AS (
  SELECT
    o.*,
    COALESCE(m.category, 'Other') as [column]_category
  FROM outliers_removed o
  LEFT JOIN [column]_category_mapping m ON o.[text_col] = m.original_value
),
-- Step 4: Enforce business rules
final AS (
  SELECT *
  FROM categorized
  WHERE [rule_validations]
)
SELECT * FROM final;

Verification of combined transformations:

-- Row count reconciliation
SELECT
  (SELECT COUNT(*) FROM raw_[table_name]) as raw_count,
  (SELECT COUNT(*) FROM clean_[table_name]) as clean_count,
  (SELECT COUNT(*) FROM raw_[table_name]) - (SELECT COUNT(*) FROM clean_[table_name]) as total_excluded;

Create Cleaning Execution Log

Create analysis/[session-name]/04-cleaning-execution.md with: ./templates/phase-4.md

CHECKPOINT: Before proceeding to Phase 5, you MUST have:

Executed all transformations from Phase 3 strategy
Created clean_[table_name] table in data/analytics.db
Verified each transformation with specific checks
Reconciled row counts (raw = clean + exclusions)
Documented all exclusions with reasons and counts
Spot-checked sample records before/after
04-cleaning-execution.md created with all results documented

Phase 5: Verification & Documentation

Goal: Validate cleaning results, quantify quality improvements, document complete audit trail from raw to clean.

Verify Row Count Reconciliation

Critical validation - MUST match exactly:

-- Count raw table
SELECT COUNT(*) as raw_count FROM raw_[table_name];

-- Count clean table
SELECT COUNT(*) as clean_count FROM clean_[table_name];

-- Calculate exclusions from Phase 4 log
-- Expected: raw_count = clean_count + total_exclusions

Document: Confirm reconciliation passes. If mismatch, investigate before proceeding.

Verify Transformation Results

For each transformation applied in Phase 4:

Duplicate Removal Verification:

-- Confirm no duplicates remain
SELECT [key_columns], COUNT(*) as occurrences
FROM clean_[table_name]
GROUP BY [key_columns]
HAVING COUNT(*) > 1;
-- Expected: 0 rows returned

Outlier Handling Verification:

-- For Exclude approach: confirm no outliers remain
WITH stats AS (
  SELECT [median], [mad] FROM ...
)
SELECT COUNT(*) as remaining_outliers
FROM clean_[table_name], stats
WHERE ABS([numeric_col] - median) > 3 * mad;
-- Expected: 0 rows

-- For Cap approach: confirm values at thresholds
SELECT MIN([numeric_col]) as min_val, MAX([numeric_col]) as max_val
FROM clean_[table_name];
-- Expected: min >= (median - 3*MAD), max <= (median + 3*MAD)

-- For Flag approach: check flag accuracy
SELECT [numeric_col]_outlier_flag, COUNT(*)
FROM clean_[table_name]
GROUP BY [numeric_col]_outlier_flag;
-- Expected: distribution matches Phase 4 execution log

Free Text Categorization Verification:

-- Confirm all values categorized
SELECT COUNT(*) as uncategorized
FROM clean_[table_name]
WHERE [column]_category IS NULL;
-- Expected: 0 (unless "keep uncategorized" was strategy)

-- Verify category distribution
SELECT [column]_category, COUNT(*) as count,
  ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) as pct
FROM clean_[table_name]
GROUP BY [column]_category
ORDER BY count DESC;
-- Expected: matches Phase 4 execution results

Business Rule Verification (if applicable):

-- Confirm no violations remain
SELECT COUNT(*) as violations
FROM clean_[table_name]
WHERE NOT ([rule_1_condition] AND [rule_2_condition] ...);
-- Expected: 0 rows

Document all verification results:

✓ Pass: Expected result confirmed
✗ Fail: Unexpected result, requires investigation

Compare Data Quality Metrics

Before vs After comparison:

-- Completeness comparison
SELECT
  'raw' as table_name,
  COUNT(*) as total_rows,
  COUNT([col1]) as [col1]_non_null,
  ROUND(100.0 * COUNT([col1]) / COUNT(*), 2) as [col1]_completeness_pct
FROM raw_[table_name]
UNION ALL
SELECT
  'clean' as table_name,
  COUNT(*),
  COUNT([col1]),
  ROUND(100.0 * COUNT([col1]) / COUNT(*), 2)
FROM clean_[table_name];

Create quality improvement table:

Metric	Raw Table	Clean Table	Improvement
Total rows	[N]	[N]	-[X]% (exclusions)
Completeness ([col1])	[X]%	[X]%	+[X] pct points
Duplicate groups	[N]	0	-[N] (100%)
Outliers ([col2])	[N]	0	-[N] (100%)
Free text unique values	[N]	[N categories]	-[X]% (categorization)

Document: Quality improvements quantified with specific deltas.

Spot Check Sample Records

Select representative samples to manually verify:

-- Records that were in duplicate groups
SELECT * FROM clean_[table_name] WHERE rowid IN ([IDs from Phase 2]);
-- Verify: correct record kept per strategy

-- Records with outliers (if flagged/capped, not excluded)
SELECT * FROM clean_[table_name] WHERE [numeric_col]_outlier_flag = 1;
-- Verify: flag accurate, values capped if applicable

-- Records with categorized free text
SELECT [original_col], [col]_category FROM clean_[table_name] LIMIT 20;
-- Verify: categories make sense, mapping correct

Document: Manual verification confirms automated transformations worked correctly.

Assess Limitations and Confidence

Document what this cleaning did NOT address:

Scope limitations: [Issues identified but not addressed - e.g., "Date range not validated"]
Data coverage: [Time periods, geographies, categories not covered]
Assumptions made: [Business rules assumed without domain validation]
Edge cases: [Unusual values handled a specific way]

Confidence assessment:

High confidence: [Transformations with clear validation - e.g., "Exact duplicate removal"]
Medium confidence: [Transformations with some subjectivity - e.g., "Free text categorization"]
Low confidence / Needs review: [Transformations requiring domain expertise - e.g., "Outliers might be legitimate"]

Document: Honest assessment of what was cleaned and what wasn't, with confidence levels.

Create Verification Report

Create analysis/[session-name]/05-verification-report.md with: ./templates/phase-5.md

CHECKPOINT: Before concluding cleaning-data skill, you MUST have:

Verified row count reconciliation (raw = clean + exclusions)
Validated all transformations with specific queries
Quantified quality improvements with before/after metrics
Spot-checked sample records manually
Documented limitations and assumptions honestly
Assessed confidence level for each transformation
Created complete exclusion accounting table
05-verification-report.md created with all sections filled
clean_* table ready for analysis process skills

Common Rationalizations

"The data looks clean after Phase 2, I can skip Phase 3 strategy design"

Why this is wrong: Detecting issues isn't the same as deciding how to fix them. Different approaches (exclude vs cap vs flag) have different analytical implications.

Do instead: Always complete Phase 3 with explicit decision frameworks. Document why you chose each approach with user confirmation.

"I'll just exclude all outliers automatically, no need to review them"

Why this is wrong: Some outliers are legitimate (VIP customers, seasonal spikes, rare events). Automatic exclusion loses valuable data.

Do instead: Complete Phase 2 detection with agent analysis. Review patterns in Phase 3. Choose approach based on business context, not just statistical threshold.

"The fuzzy matching agent found near-duplicates, I'll merge them all"

Why this is wrong: 90-95% similarity doesn't mean identical. "John Smith" vs "John Smyth" might be the same person OR two different people.

Do instead: Review confidence levels in Phase 3. Auto-merge only high confidence (>95%). Manual review medium confidence. Document decisions.

"I don't need to document exclusions, I can remember what I removed"

Why this is wrong: Undocumented exclusions break audit trail. When results are questioned, you can't explain what data was excluded or why.

Do instead: Complete Phase 4 execution log with exclusion summary table. Document every excluded record with reason and count. Reconcile in Phase 5.

"Verification is just running the same queries again, waste of time"

Why this is wrong: Phase 5 verification checks RESULTS, not execution. Queries can run without errors but produce wrong results (logic bugs, wrong thresholds, incorrect mappings).

Do instead: Always complete Phase 5 with before/after comparisons, spot checks, and manual inspection. Verification catches transformation bugs.

"I found one issue, fixed it, done with cleaning"

Why this is wrong: Data quality issues cluster. If you found duplicates, likely also have outliers, NULLs, and inconsistencies. One fix doesn't make data "clean".

Do instead: Complete all 5 phases systematically. Phase 2 detects ALL issue types. Address all prioritized issues in Phases 3-4.

"The clean table has fewer rows, that's proof it's better"

Why this is wrong: Smaller isn't always better. Excluding 50% of data might remove all the interesting variation. Quality ≠ quantity reduction.

Do instead: Complete Phase 5 with quality improvement quantification. Measure completeness, consistency, validity improvements - not just row count reduction.

"I'll categorize free text myself, faster than using the agent"

Why this is wrong: Manual categorization is inconsistent, misses patterns, and pollutes main agent context with hundreds of unique values.

Do instead: Always delegate free text analysis to categorize-free-text agent in Phase 3. Agent provides structured mapping with confidence levels. Review and adjust if needed.

"Business rules failed for 2%, I'll just delete those rows and move on"

Why this is wrong: 2% violations might indicate systematic data quality issue (bad data entry, import error, logic flaw). Deleting hides the problem.

Do instead: Investigate violations in Phase 2. Document why they violate rules in Phase 3. Consider whether to exclude, correct, or flag in strategy. Document pattern in Phase 5.

"Phase 5 validation passed, I'm done - no need to document limitations"

Why this is wrong: All cleaning has limitations and assumptions. Pretending otherwise misleads analysts who use the clean data.

Do instead: Complete Phase 5 limitations section honestly. Document what was NOT cleaned, assumptions made, edge cases, confidence levels. Transparency builds trust.

Summary

This skill ensures systematic, documented data cleaning with quality validation by:

Prioritized scope definition: Read quality report from importing-data, apply impact × severity × effort framework - ensures high-value issues addressed first, not random fixes.
Structured decision-making: Present options with trade-offs for duplicates, outliers, free text, business rules - gets user confirmation before execution, prevents undocumented assumptions.
Agent-delegated detection: Use dedicated sub-agents (detect-exact-duplicates, detect-near-duplicates, detect-outliers, categorize-free-text) - prevents context pollution while ensuring thorough analysis.
Explicit strategy approval: Document chosen approach per issue type in Phase 3 with rationale - creates decision audit trail, enables strategy review if results questioned.
Transformation transparency: Execute cleaning with CREATE TABLE AS SELECT, preserve raw_* tables, track all exclusions - maintains complete audit trail from raw to clean.
Rigorous verification: Validate transformations, quantify quality improvements, spot-check samples, document limitations - ensures clean data is actually clean and limitations are known.

Follow this process and you'll create well-documented clean tables with validated quality improvements, complete audit trail from raw data to analysis-ready data, and honest assessment of what was cleaned and what limitations remain.

cleaning-data

Cleaning Data - Component Skill

Purpose

Prerequisites

Data Cleaning Process

Phase 1: Quality Report Review

Read Quality Report from importing-data

Prioritize Issues Using Framework

Define Cleaning Scope and Objectives

Phase 2: Issue Detection (Agent-Delegated)

Detection 1: Duplicate Records

Detection 2: Outliers (MAD-Based)

Detection 4: Referential Integrity Validation

Review Sub-Agent Findings

Create Detected Issues Report

Phase 3: Cleaning Strategy Design

Review Detected Issues from Phase 2

Decision Framework: Duplicates

Decision Framework: Outliers

Decision Framework: Free Text Categorization

Decision Framework: Referential Integrity (If Multiple Tables)

Decision Framework: Business Rules (Optional)

Create Cleaning Strategy Document

Phase 4: Cleaning Execution

Transformation 1: Remove Duplicates

Transformation 2: Handle Outliers

Transformation 3: Categorize Free Text

Transformation 4: Business Rule Validation (if applicable)

Transformation 5: Referential Integrity Enforcement (if multiple tables)

Combined Transformation Approach

Create Cleaning Execution Log

Phase 5: Verification & Documentation

Verify Row Count Reconciliation

Verify Transformation Results

Compare Data Quality Metrics

Spot Check Sample Records

Assess Limitations and Confidence

Create Verification Report

Common Rationalizations

"The data looks clean after Phase 2, I can skip Phase 3 strategy design"

"I'll just exclude all outliers automatically, no need to review them"

"The fuzzy matching agent found near-duplicates, I'll merge them all"

"I don't need to document exclusions, I can remember what I removed"

"Verification is just running the same queries again, waste of time"

"I found one issue, fixed it, done with cleaning"

"The clean table has fewer rows, that's proof it's better"

"I'll categorize free text myself, faster than using the agent"

"Business rules failed for 2%, I'll just delete those rows and move on"

"Phase 5 validation passed, I'm done - no need to document limitations"

Summary

Similar Skills