Systematic data quality remediation - detect duplicates/outliers/inconsistencies, design cleaning strategy, execute transformations, verify results (component skill for DataPeeker analysis sessions)
Systematic data quality remediation that detects duplicates, outliers, and inconsistencies, then executes and verifies cleaning transformations. Triggers after importing-data completes with quality report, creating clean tables for analysis workflows.
/plugin marketplace add tilmon-engineering/claude-skills/plugin install datapeeker@tilmon-eng-skillsThis skill inherits all available tools. When active, it can use any tool Claude has access to.
templates/phase-1.mdtemplates/phase-2.mdtemplates/phase-3.mdtemplates/phase-4.mdtemplates/phase-5.mdUse this skill when:
importing-data skill with quality report generatedThis skill is a prerequisite for all DataPeeker analysis workflows and consumes the quality report from importing-data.
Before using this skill, you MUST:
importing-data skill successfully05-quality-report.md generated by importing-dataraw_* table(s) in data/analytics.dbCreate a TodoWrite checklist for the 5-phase data cleaning process:
Phase 1: Quality Report Review - pending
Phase 2: Issue Detection (Agent-Delegated) - pending
Phase 3: Cleaning Strategy Design - pending
Phase 4: Cleaning Execution - pending
Phase 5: Verification & Documentation - pending
Mark each phase as you complete it. Document all findings in numbered markdown files (01-cleaning-scope.md through 05-verification-report.md) within your analysis workspace directory.
Goal: Read quality report from importing-data, understand detected issues, prioritize for cleaning based on impact and severity.
Locate the quality report:
analysis/[session-name]/05-quality-report.md
This report (generated by importing-data Phase 5) contains:
Extract key information:
Document: Summarize findings from quality report.
Issue Prioritization Matrix:
Evaluate each issue on three dimensions:
1. Impact (% of rows/columns affected)
2. Severity (effect on analysis validity)
3. Effort (complexity to resolve)
Combine dimensions to assign priority:
| Impact | Severity | Effort | Priority | Action Timing |
|---|---|---|---|---|
| High | Critical | Any | CRITICAL | Must address |
| High | Significant | Low/Med | HIGH | Must address |
| Medium | Critical | Low/Med | HIGH | Must address |
| Any | Any | High | MEDIUM | Address if time permits |
| Low | Minor | Any | LOW | Document, may skip |
Document: Create prioritized issue table in 01-cleaning-scope.md.
Create analysis/[session-name]/01-cleaning-scope.md with: ./templates/phase-1.md
CHECKPOINT: Before proceeding to Phase 2, you MUST have:
05-quality-report.md from importing-data01-cleaning-scope.md created with all sections filledGoal: Deep-dive investigation of prioritized data quality issues using sub-agents to prevent context pollution.
CRITICAL: This phase MUST use sub-agent delegation. DO NOT analyze data in main agent context.
Use dedicated duplicate detection agents
For exact duplicates:
Invoke the detect-exact-duplicates agent:
Task tool with agent: detect-exact-duplicates
Parameters:
- table_name: raw_[actual_table_name]
- key_columns: [columns that define uniqueness, from Phase 1 scope]
For near-duplicates (fuzzy matching):
Invoke the detect-near-duplicates agent:
Task tool with agent: detect-near-duplicates
Parameters:
- table_name: raw_[actual_table_name]
- text_column: [specific text column flagged in Phase 1]
Repeat for each text column requiring fuzzy matching.
Document findings in 02-detected-issues.md using template below.
Use dedicated outlier detection agent
For each numeric column flagged in Phase 1:
Invoke the detect-outliers agent:
Task tool with agent: detect-outliers
Parameters:
- table_name: raw_[actual_table_name]
- numeric_col: [specific numeric column from Phase 1 scope]
Repeat for each numeric column requiring outlier analysis.
Document findings in 02-detected-issues.md using template below.
If multiple tables exist with FK relationships identified in Phase 1:
Use dedicated FK validation agent
For each FK relationship flagged in Phase 1:
Invoke the detect-foreign-keys agent (focused validation mode):
Task tool with agent: detect-foreign-keys
Parameters:
- database_path: data/analytics.db
- child_table: [specific child table]
- child_column: [FK column]
- parent_table: [specific parent table]
- parent_column: [PK column]
The agent will:
If single table: Skip this detection, document "N/A - Single table" in detected issues report.
Document findings in 02-detected-issues.md using template below.
After all sub-agents return findings:
For duplicates:
For outliers:
For FK orphans (if multiple tables):
Document: Observations and preliminary decisions for Strategy Phase.
Create analysis/[session-name]/02-detected-issues.md with: ./templates/phase-2.md
CHECKPOINT: Before proceeding to Phase 3, you MUST have:
02-detected-issues.md with all sections filled (including FK orphans if applicable)Goal: Design cleaning approach for each detected issue type, present options with trade-offs, get user confirmation before execution.
From 02-detected-issues.md, identify all issue types requiring decisions:
For exact duplicates, choose ONE approach:
Option A: Keep First Occurrence
Option B: Keep Most Complete
Option C: Merge Records
For near-duplicates, choose ONE approach:
Option A: Auto-Merge High Confidence (>95%)
Option B: Manual Review All
Option C: Skip Near-Duplicates
Document chosen approach in 03-cleaning-strategy.md
For each numeric column with outliers, choose ONE approach:
Option A: Exclude Outliers
Option B: Cap at Threshold
Option C: Flag and Keep
Option D: Keep As-Is
Document chosen approach per column in 03-cleaning-strategy.md
If free text columns flagged in Phase 2:
Step 1: Invoke categorization agent
Task tool with agent: categorize-free-text
Parameters:
- column_name: [specific text column]
- unique_values: [list from Phase 2 detection]
- business_context: [optional context about what values represent]
Step 2: Review agent's proposed categories
Agent will return:
Step 3: Choose categorization approach:
Option A: Accept Agent Proposal
Option B: Modify Categories
Option C: Manual Categorization
Option D: Keep As-Is
Document chosen approach in 03-cleaning-strategy.md
For FK orphaned records identified in Phase 2:
For each FK relationship with orphaned records, choose ONE approach:
Option A: Exclude Orphaned Records
Option B: Preserve with NULL
Option C: Flag and Keep
[fk_column]_orphan_flag column, keep original FK valueOption D: Create Placeholder Parent
Document chosen approach per FK relationship in 03-cleaning-strategy.md
If business rules were defined in Phase 1 scope:
For each rule, choose approach for violations:
Option A: Exclude Violating Records
Option B: Cap/Coerce to Valid Range
Option C: Flag and Keep
Document chosen approach per rule in 03-cleaning-strategy.md
Create analysis/[session-name]/03-cleaning-strategy.md with: ./templates/phase-3.md
CHECKPOINT: Before proceeding to Phase 4, you MUST have:
03-cleaning-strategy.md created with all decisions documentedGoal: Execute approved cleaning strategies, create clean_* tables, track all exclusions and transformations.
CRITICAL: All transformations use CREATE TABLE AS SELECT pattern. Keep raw_* tables intact.
Based on Strategy from Phase 3:
For Exact Duplicates (Keep First approach):
-- Create clean table without exact duplicates
CREATE TABLE clean_[table_name] AS
WITH ranked_records AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY [key_columns]
ORDER BY rowid
) as rn
FROM raw_[table_name]
)
SELECT [all_columns] -- Exclude rn column
FROM ranked_records
WHERE rn = 1;
For Exact Duplicates (Keep Most Complete approach):
-- Create clean table keeping most complete record per duplicate group
CREATE TABLE clean_[table_name] AS
WITH completeness_scored AS (
SELECT *,
([count non-NULL columns formula]) as completeness_score,
ROW_NUMBER() OVER (
PARTITION BY [key_columns]
ORDER BY completeness_score DESC, rowid
) as rn
FROM raw_[table_name]
)
SELECT [all_columns] -- Exclude rn and completeness_score
FROM completeness_scored
WHERE rn = 1;
For Near-Duplicates (Auto-Merge High Confidence approach):
-- Create mapping table from categorization agent
CREATE TABLE [column]_near_dup_mapping AS
SELECT
original_value,
canonical_value
FROM (VALUES
('[value1]', '[canonical]'),
('[value2]', '[canonical]'),
...
) AS mapping(original_value, canonical_value);
-- Apply mapping
UPDATE clean_[table_name]
SET [text_column] = (
SELECT canonical_value
FROM [column]_near_dup_mapping
WHERE original_value = [text_column]
)
WHERE [text_column] IN (SELECT original_value FROM [column]_near_dup_mapping);
Verification:
-- Verify duplicate removal
SELECT COUNT(*) as clean_count FROM clean_[table_name];
SELECT COUNT(*) - COUNT(DISTINCT [key_columns]) as remaining_dups FROM clean_[table_name];
-- Expected: 0 remaining duplicates
-- Exclusion count
SELECT [raw_count] - [clean_count] as excluded_duplicates;
Document in 04-cleaning-execution.md:
Based on Strategy from Phase 3:
For Outliers (Exclude approach):
-- Calculate thresholds
WITH stats AS (
SELECT
[median_calculation] as median,
[mad_calculation] * 1.4826 as mad
FROM raw_[table_name]
WHERE [numeric_col] IS NOT NULL
)
-- Create clean table excluding outliers
CREATE TABLE clean_[table_name] AS
SELECT r.*
FROM raw_[table_name] r
CROSS JOIN stats s
WHERE ABS(r.[numeric_col] - s.median) <= 3 * s.mad
OR r.[numeric_col] IS NULL; -- Keep NULLs
For Outliers (Cap at Threshold approach - Winsorization):
WITH stats AS (
SELECT
[median_calculation] as median,
[mad_calculation] * 1.4826 as mad
FROM raw_[table_name]
WHERE [numeric_col] IS NOT NULL
)
CREATE TABLE clean_[table_name] AS
SELECT
[other_columns],
CASE
WHEN [numeric_col] > median + 3 * mad THEN median + 3 * mad
WHEN [numeric_col] < median - 3 * mad THEN median - 3 * mad
ELSE [numeric_col]
END as [numeric_col]
FROM raw_[table_name]
CROSS JOIN stats;
For Outliers (Flag and Keep approach):
WITH stats AS (
SELECT [median_calculation] as median, [mad_calculation] * 1.4826 as mad
FROM raw_[table_name]
WHERE [numeric_col] IS NOT NULL
)
CREATE TABLE clean_[table_name] AS
SELECT
r.*,
CASE
WHEN ABS(r.[numeric_col] - s.median) > 3 * s.mad THEN 1
ELSE 0
END as [numeric_col]_outlier_flag
FROM raw_[table_name] r
CROSS JOIN stats s;
Verification:
-- Verify outlier handling
SELECT COUNT(*) as clean_count FROM clean_[table_name];
-- For Exclude approach: check no outliers remain
WITH stats AS (...)
SELECT COUNT(*) as remaining_outliers
FROM clean_[table_name], stats
WHERE ABS([numeric_col] - median) > 3 * mad;
-- Expected: 0
-- For Cap approach: check values at thresholds
SELECT MIN([numeric_col]), MAX([numeric_col]) FROM clean_[table_name];
-- For Flag approach: check flag distribution
SELECT [numeric_col]_outlier_flag, COUNT(*)
FROM clean_[table_name]
GROUP BY [numeric_col]_outlier_flag;
Document in 04-cleaning-execution.md:
Based on Strategy from Phase 3:
If using agent's proposed categories:
-- Create category mapping from agent output
CREATE TABLE [column]_category_mapping AS
VALUES
('[value]', '[Category 1]'),
('[value]', '[Category 1]'),
('[value]', '[Category 2]'),
...
) AS mapping(original_value, category);
-- Apply categorization
CREATE TABLE clean_[table_name] AS
SELECT
r.[other_columns],
COALESCE(m.category, 'Other') as [column]_category
FROM raw_[table_name] r
LEFT JOIN [column]_category_mapping m
ON r.[text_column] = m.original_value;
If handling uncategorizable values:
-- Option A: Exclude uncategorizable
CREATE TABLE clean_[table_name] AS
SELECT
r.*,
m.category as [column]_category
FROM raw_[table_name] r
INNER JOIN [column]_category_mapping m
ON r.[text_column] = m.original_value;
-- INNER JOIN excludes unmapped values
-- Option B: Map to "Other" category (shown in previous query with COALESCE)
Verification:
-- Verify categorization coverage
SELECT
[column]_category,
COUNT(*) as count,
ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) as pct
FROM clean_[table_name]
GROUP BY [column]_category
ORDER BY count DESC;
-- Check for unmapped values (if kept)
SELECT COUNT(*) as unmapped
FROM clean_[table_name]
WHERE [column]_category IS NULL OR [column]_category = 'Other';
Document in 04-cleaning-execution.md:
Based on Strategy from Phase 3:
-- Exclude rows violating business rules
CREATE TABLE clean_[table_name] AS
SELECT *
FROM raw_[table_name]
WHERE [rule_1_validation]
AND [rule_2_validation]
...;
-- Example rules:
-- WHERE age BETWEEN 0 AND 120
-- AND amount > 0
-- AND date BETWEEN '2020-01-01' AND '2025-12-31'
Verification:
-- Verify no violations remain
SELECT COUNT(*) as violations
FROM clean_[table_name]
WHERE NOT ([rule_1_validation] AND [rule_2_validation] ...);
-- Expected: 0
Document in 04-cleaning-execution.md:
Based on Strategy from Phase 3:
For Orphaned Records (Exclude approach):
-- Remove orphaned child records (Option A from Phase 3)
CREATE TABLE clean_child_table AS
SELECT c.*
FROM raw_child_table c
INNER JOIN raw_parent_table p ON c.fk_column = p.pk_column;
-- INNER JOIN automatically excludes orphans
For Orphaned Records (Preserve with NULL approach):
-- Set orphaned FK values to NULL (Option B from Phase 3)
CREATE TABLE clean_child_table AS
SELECT
c.*,
CASE
WHEN p.pk_column IS NULL THEN NULL
ELSE c.fk_column
END as fk_column
FROM raw_child_table c
LEFT JOIN raw_parent_table p ON c.fk_column = p.pk_column;
For Orphaned Records (Flag and Keep approach):
-- Add orphan flag column (Option C from Phase 3)
CREATE TABLE clean_child_table AS
SELECT
c.*,
CASE
WHEN p.pk_column IS NULL AND c.fk_column IS NOT NULL THEN 1
ELSE 0
END as fk_column_orphan_flag
FROM raw_child_table c
LEFT JOIN raw_parent_table p ON c.fk_column = p.pk_column;
For Orphaned Records (Create Placeholder Parent approach):
-- Step 1: Create placeholder parent record (Option D from Phase 3)
INSERT INTO raw_parent_table (pk_column, name, other_fields)
VALUES (-1, 'Unknown', NULL, ...);
-- Step 2: Remap orphans to placeholder
CREATE TABLE clean_child_table AS
SELECT
c.*,
CASE
WHEN p.pk_column IS NULL THEN -1
ELSE c.fk_column
END as fk_column
FROM raw_child_table c
LEFT JOIN raw_parent_table p ON c.fk_column = p.pk_column;
Verification:
-- Verify no orphans remain (for Exclude approach)
SELECT COUNT(*) as orphans
FROM clean_child_table c
LEFT JOIN clean_parent_table p ON c.fk_column = p.pk_column
WHERE p.pk_column IS NULL AND c.fk_column IS NOT NULL;
-- Expected: 0
-- Verify NULL remapping (for Preserve with NULL approach)
SELECT COUNT(*) as nulled_fks
FROM clean_child_table
WHERE fk_column IS NULL;
-- Expected: [count of orphans from Phase 2]
-- Verify flag accuracy (for Flag and Keep approach)
SELECT fk_column_orphan_flag, COUNT(*)
FROM clean_child_table
GROUP BY fk_column_orphan_flag;
-- Expected: flag=1 count matches orphan count from Phase 2
Document in 04-cleaning-execution.md:
[If single table: "N/A - Single table analysis"]
If multiple transformations needed, use CTE chain:
CREATE TABLE clean_[table_name] AS
WITH
-- Step 1: Remove duplicates
deduped AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY [key_cols] ORDER BY rowid) as rn
FROM raw_[table_name]
),
no_dups AS (
SELECT [all_columns] FROM deduped WHERE rn = 1
),
-- Step 2: Handle outliers
outliers_removed AS (
SELECT d.*
FROM no_dups d
CROSS JOIN (SELECT [median], [mad] FROM ...) stats
WHERE ABS(d.[numeric_col] - stats.median) <= 3 * stats.mad
),
-- Step 3: Apply categorization
categorized AS (
SELECT
o.*,
COALESCE(m.category, 'Other') as [column]_category
FROM outliers_removed o
LEFT JOIN [column]_category_mapping m ON o.[text_col] = m.original_value
),
-- Step 4: Enforce business rules
final AS (
SELECT *
FROM categorized
WHERE [rule_validations]
)
SELECT * FROM final;
Verification of combined transformations:
-- Row count reconciliation
SELECT
(SELECT COUNT(*) FROM raw_[table_name]) as raw_count,
(SELECT COUNT(*) FROM clean_[table_name]) as clean_count,
(SELECT COUNT(*) FROM raw_[table_name]) - (SELECT COUNT(*) FROM clean_[table_name]) as total_excluded;
Create analysis/[session-name]/04-cleaning-execution.md with: ./templates/phase-4.md
CHECKPOINT: Before proceeding to Phase 5, you MUST have:
04-cleaning-execution.md created with all results documentedGoal: Validate cleaning results, quantify quality improvements, document complete audit trail from raw to clean.
Critical validation - MUST match exactly:
-- Count raw table
SELECT COUNT(*) as raw_count FROM raw_[table_name];
-- Count clean table
SELECT COUNT(*) as clean_count FROM clean_[table_name];
-- Calculate exclusions from Phase 4 log
-- Expected: raw_count = clean_count + total_exclusions
Document: Confirm reconciliation passes. If mismatch, investigate before proceeding.
For each transformation applied in Phase 4:
Duplicate Removal Verification:
-- Confirm no duplicates remain
SELECT [key_columns], COUNT(*) as occurrences
FROM clean_[table_name]
GROUP BY [key_columns]
HAVING COUNT(*) > 1;
-- Expected: 0 rows returned
Outlier Handling Verification:
-- For Exclude approach: confirm no outliers remain
WITH stats AS (
SELECT [median], [mad] FROM ...
)
SELECT COUNT(*) as remaining_outliers
FROM clean_[table_name], stats
WHERE ABS([numeric_col] - median) > 3 * mad;
-- Expected: 0 rows
-- For Cap approach: confirm values at thresholds
SELECT MIN([numeric_col]) as min_val, MAX([numeric_col]) as max_val
FROM clean_[table_name];
-- Expected: min >= (median - 3*MAD), max <= (median + 3*MAD)
-- For Flag approach: check flag accuracy
SELECT [numeric_col]_outlier_flag, COUNT(*)
FROM clean_[table_name]
GROUP BY [numeric_col]_outlier_flag;
-- Expected: distribution matches Phase 4 execution log
Free Text Categorization Verification:
-- Confirm all values categorized
SELECT COUNT(*) as uncategorized
FROM clean_[table_name]
WHERE [column]_category IS NULL;
-- Expected: 0 (unless "keep uncategorized" was strategy)
-- Verify category distribution
SELECT [column]_category, COUNT(*) as count,
ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) as pct
FROM clean_[table_name]
GROUP BY [column]_category
ORDER BY count DESC;
-- Expected: matches Phase 4 execution results
Business Rule Verification (if applicable):
-- Confirm no violations remain
SELECT COUNT(*) as violations
FROM clean_[table_name]
WHERE NOT ([rule_1_condition] AND [rule_2_condition] ...);
-- Expected: 0 rows
Document all verification results:
Before vs After comparison:
-- Completeness comparison
SELECT
'raw' as table_name,
COUNT(*) as total_rows,
COUNT([col1]) as [col1]_non_null,
ROUND(100.0 * COUNT([col1]) / COUNT(*), 2) as [col1]_completeness_pct
FROM raw_[table_name]
UNION ALL
SELECT
'clean' as table_name,
COUNT(*),
COUNT([col1]),
ROUND(100.0 * COUNT([col1]) / COUNT(*), 2)
FROM clean_[table_name];
Create quality improvement table:
| Metric | Raw Table | Clean Table | Improvement |
|---|---|---|---|
| Total rows | [N] | [N] | -[X]% (exclusions) |
| Completeness ([col1]) | [X]% | [X]% | +[X] pct points |
| Duplicate groups | [N] | 0 | -[N] (100%) |
| Outliers ([col2]) | [N] | 0 | -[N] (100%) |
| Free text unique values | [N] | [N categories] | -[X]% (categorization) |
Document: Quality improvements quantified with specific deltas.
Select representative samples to manually verify:
-- Records that were in duplicate groups
SELECT * FROM clean_[table_name] WHERE rowid IN ([IDs from Phase 2]);
-- Verify: correct record kept per strategy
-- Records with outliers (if flagged/capped, not excluded)
SELECT * FROM clean_[table_name] WHERE [numeric_col]_outlier_flag = 1;
-- Verify: flag accurate, values capped if applicable
-- Records with categorized free text
SELECT [original_col], [col]_category FROM clean_[table_name] LIMIT 20;
-- Verify: categories make sense, mapping correct
Document: Manual verification confirms automated transformations worked correctly.
Document what this cleaning did NOT address:
Confidence assessment:
Document: Honest assessment of what was cleaned and what wasn't, with confidence levels.
Create analysis/[session-name]/05-verification-report.md with: ./templates/phase-5.md
CHECKPOINT: Before concluding cleaning-data skill, you MUST have:
05-verification-report.md created with all sections filledWhy this is wrong: Detecting issues isn't the same as deciding how to fix them. Different approaches (exclude vs cap vs flag) have different analytical implications.
Do instead: Always complete Phase 3 with explicit decision frameworks. Document why you chose each approach with user confirmation.
Why this is wrong: Some outliers are legitimate (VIP customers, seasonal spikes, rare events). Automatic exclusion loses valuable data.
Do instead: Complete Phase 2 detection with agent analysis. Review patterns in Phase 3. Choose approach based on business context, not just statistical threshold.
Why this is wrong: 90-95% similarity doesn't mean identical. "John Smith" vs "John Smyth" might be the same person OR two different people.
Do instead: Review confidence levels in Phase 3. Auto-merge only high confidence (>95%). Manual review medium confidence. Document decisions.
Why this is wrong: Undocumented exclusions break audit trail. When results are questioned, you can't explain what data was excluded or why.
Do instead: Complete Phase 4 execution log with exclusion summary table. Document every excluded record with reason and count. Reconcile in Phase 5.
Why this is wrong: Phase 5 verification checks RESULTS, not execution. Queries can run without errors but produce wrong results (logic bugs, wrong thresholds, incorrect mappings).
Do instead: Always complete Phase 5 with before/after comparisons, spot checks, and manual inspection. Verification catches transformation bugs.
Why this is wrong: Data quality issues cluster. If you found duplicates, likely also have outliers, NULLs, and inconsistencies. One fix doesn't make data "clean".
Do instead: Complete all 5 phases systematically. Phase 2 detects ALL issue types. Address all prioritized issues in Phases 3-4.
Why this is wrong: Smaller isn't always better. Excluding 50% of data might remove all the interesting variation. Quality ≠ quantity reduction.
Do instead: Complete Phase 5 with quality improvement quantification. Measure completeness, consistency, validity improvements - not just row count reduction.
Why this is wrong: Manual categorization is inconsistent, misses patterns, and pollutes main agent context with hundreds of unique values.
Do instead: Always delegate free text analysis to categorize-free-text agent in Phase 3. Agent provides structured mapping with confidence levels. Review and adjust if needed.
Why this is wrong: 2% violations might indicate systematic data quality issue (bad data entry, import error, logic flaw). Deleting hides the problem.
Do instead: Investigate violations in Phase 2. Document why they violate rules in Phase 3. Consider whether to exclude, correct, or flag in strategy. Document pattern in Phase 5.
Why this is wrong: All cleaning has limitations and assumptions. Pretending otherwise misleads analysts who use the clean data.
Do instead: Complete Phase 5 limitations section honestly. Document what was NOT cleaned, assumptions made, edge cases, confidence levels. Transparency builds trust.
This skill ensures systematic, documented data cleaning with quality validation by:
Prioritized scope definition: Read quality report from importing-data, apply impact × severity × effort framework - ensures high-value issues addressed first, not random fixes.
Structured decision-making: Present options with trade-offs for duplicates, outliers, free text, business rules - gets user confirmation before execution, prevents undocumented assumptions.
Agent-delegated detection: Use dedicated sub-agents (detect-exact-duplicates, detect-near-duplicates, detect-outliers, categorize-free-text) - prevents context pollution while ensuring thorough analysis.
Explicit strategy approval: Document chosen approach per issue type in Phase 3 with rationale - creates decision audit trail, enables strategy review if results questioned.
Transformation transparency: Execute cleaning with CREATE TABLE AS SELECT, preserve raw_* tables, track all exclusions - maintains complete audit trail from raw to clean.
Rigorous verification: Validate transformations, quantify quality improvements, spot-check samples, document limitations - ensures clean data is actually clean and limitations are known.
Follow this process and you'll create well-documented clean tables with validated quality improvements, complete audit trail from raw data to analysis-ready data, and honest assessment of what was cleaned and what limitations remain.
Use when working with Payload CMS projects (payload.config.ts, collections, fields, hooks, access control, Payload API). Use when debugging validation errors, security issues, relationship queries, transactions, or hook behavior.