Systematic CSV import process - discover structure, design schema, standardize formats, import to database, detect quality issues (component skill for DataPeeker analysis sessions)
Systematic CSV import process that profiles files, designs schemas, standardizes formats, imports to database, and delegates quality assessment to sub-agents. Use when starting analysis with CSV data to ensure proper schema design and detect quality issues before cleaning.
/plugin marketplace add tilmon-engineering/claude-skills/plugin install datapeeker@tilmon-eng-skillsThis skill inherits all available tools. When active, it can use any tool Claude has access to.
templates/phase-1.mdtemplates/phase-2.mdtemplates/phase-3.mdtemplates/phase-4.mdtemplates/phase-5.mdUse this skill when:
This skill is a prerequisite for all DataPeeker analysis workflows and is referenced by all process skills.
Note: DataPeeker uses SQLite (data/analytics.db), but this process applies to any SQL database. For SQLite-specific commands, see the using-sqlite skill.
data/analytics.db)just start-analysis or equivalent)raw_* tables, not final tables (cleaning-data handles finalization)Create a TodoWrite checklist for the 5-phase data import process:
Phase 1: CSV Discovery & Profiling - pending
Phase 2: Schema Design & Type Inference - pending
Phase 3: Basic Standardization - pending
Phase 4: Import Execution - pending
Phase 5: Quality Assessment & Reporting - pending
Mark each phase as you complete it. Document all findings in numbered markdown files (01-csv-profile.md through 05-quality-report.md) within your analysis workspace directory.
Goal: Understand CSV file structure, detect encoding and delimiter, capture sample data for schema design.
Identify the CSV file(s) to import:
Document: Record CSV file path, size, timestamp in 01-csv-profile.md
Detect file encoding to prevent import errors:
file -I /path/to/file.csv
Common encodings:
charset=us-ascii or charset=utf-8 → Standard, no conversion neededcharset=iso-8859-1 or charset=windows-1252 → May need conversion to UTF-8Document: Record detected encoding. If non-UTF-8, note conversion requirement.
Analyze first few lines to detect delimiter:
head -n 5 /path/to/file.csv
Check for:
,) - most common\t) - TSV files|) - less common;) - European CSV filesDocument: Record detected delimiter character.
Determine if first row contains headers:
Document: Record whether headers present, list header names if found.
Capture representative samples for schema inference:
# First 10 rows
head -n 11 /path/to/file.csv > /tmp/csv_head_sample.txt
# Last 10 rows
tail -n 10 /path/to/file.csv > /tmp/csv_tail_sample.txt
# Row count
wc -l /path/to/file.csv
Document: Include head and tail samples in 01-csv-profile.md for reference during schema design.
Create analysis/[session-name]/01-csv-profile.md with: ./templates/phase-1.md
CHECKPOINT: Before proceeding to Phase 2, you MUST have:
01-csv-profile.md created with all sections filledGoal: Design database schema by inferring types from CSV samples, propose table structure with rationale.
For each column from Phase 1 profiling, infer appropriate data type:
Type Inference Rules (adapt to your database):
Integer types - Use when:
Decimal/Float types - Use when:
Text/String types - Use when:
Note: Date/time handling varies by database. SQLite stores dates as TEXT. PostgreSQL/MySQL have native DATE/TIMESTAMP types.
Document: For each column, record inferred type with rationale and sample values.
Identify NULL representations in CSV:
NULL in databaseNULLNULLDocument: List all NULL representations found and conversion strategy.
Propose CREATE TABLE statement:
CREATE TABLE raw_[table_name] (
[column_1_name] [TYPE], -- Rationale for type choice
[column_2_name] [TYPE], -- Rationale for type choice
...
);
Table naming convention:
raw_ to indicate unprocessed dataraw_sales_data, raw_customersDocument: Full CREATE TABLE statement with inline comments explaining each type choice.
Use AskUserQuestion tool to present schema proposal:
Document: User's approval and any requested modifications.
Create analysis/[session-name]/02-schema-design.md with: ./templates/phase-2.md
CHECKPOINT: Before proceeding to Phase 3, you MUST have:
02-schema-design.md created with all sections filledGoal: Define transformation rules for dates, numbers, whitespace, and text formatting to ensure clean, consistent data in raw_* table.
Target format: ISO 8601
YYYY-MM-DD (e.g., 2025-01-15)YYYY-MM-DD HH:MM:SS (e.g., 2025-01-15 14:30:00)Common source formats to convert:
MM/DD/YYYY or M/D/YYYY → YYYY-MM-DDDD/MM/YYYY or D/M/YYYY → YYYY-MM-DD (verify with user which is month vs day)YYYY/MM/DD → YYYY-MM-DD (slash to hyphen)Mon DD, YYYY → YYYY-MM-DD (e.g., "Jan 15, 2025" → "2025-01-15")YYYY-MM-DDTHH:MM:SS → Keep as-is (valid ISO 8601)Document: List each date column, source format detected, target format, conversion logic.
Remove non-numeric characters:
$123.45 → 123.451,234.56 → 1234.5645% → 45 or 0.45 (document choice)25kg, 100m → 25, 100 (document unit in column name/comments)Decimal handling:
1.234,56 → 1234.56 (verify with user)Document: List each numeric column, formatting issues found, normalization rules.
Whitespace cleaning:
\r\n or \r → \nText case standardization (optional, apply selectively):
Document: Which columns get whitespace cleaning, which get case normalization.
Apply NULL representation mapping from Phase 2:
NULLNULL for numeric/date columns'' for TEXT columns (document choice)Document: NULL conversion applied, count of conversions per column.
Create analysis/[session-name]/03-standardization-rules.md with: ./templates/phase-3.md
CHECKPOINT: Before proceeding to Phase 4, you MUST have:
03-standardization-rules.md created with verification queriesGoal: Execute import with standardization rules, verify row count and data integrity.
From Phase 2 schema design, finalize CREATE TABLE statement:
CREATE TABLE IF NOT EXISTS raw_[table_name] (
[column_1] [TYPE],
[column_2] [TYPE],
...
);
Execute:
sqlite3 data/analytics.db < create_table.sql
Verify table created:
-- Check table exists
.tables
-- Inspect schema
PRAGMA table_info(raw_[table_name]);
Document: Paste table creation confirmation and schema output.
Import method options:
Option 1: SQLite .import command (for simple cases):
sqlite3 data/analytics.db <<EOF
.mode csv
.import /path/to/file.csv raw_[table_name]
EOF
Option 2: Python script (for complex standardization):
import csv
import sqlite3
from datetime import datetime
conn = sqlite3.connect('data/analytics.db')
cursor = conn.cursor()
with open('/path/to/file.csv', 'r', encoding='utf-8') as f:
reader = csv.DictReader(f)
for row in reader:
# Apply standardization rules from Phase 3
row['date_column'] = standardize_date(row['date_column'])
row['amount_column'] = standardize_number(row['amount_column'])
# ... apply other rules ...
cursor.execute("""
INSERT INTO raw_[table_name]
([columns]) VALUES ([placeholders])
""", tuple(row.values()))
conn.commit()
conn.close()
Document: Which method used, any import warnings/errors encountered and resolved.
Row count verification:
-- Count rows in table
SELECT COUNT(*) as row_count FROM raw_[table_name];
Compare to CSV row count from Phase 1. Should match (or CSV count - 1 if CSV had header row).
Sample data inspection:
-- View first 5 rows
SELECT * FROM raw_[table_name] LIMIT 5;
-- View last 5 rows
SELECT * FROM raw_[table_name]
ORDER BY rowid DESC LIMIT 5;
Column completeness check:
-- Check NULL counts per column
SELECT
COUNT(*) as total_rows,
COUNT([column_1]) as [column_1]_non_null,
COUNT([column_2]) as [column_2]_non_null,
...
FROM raw_[table_name];
Document: Paste actual results showing row counts, sample rows, NULL counts.
Execute all verification queries defined in 03-standardization-rules.md:
Expected: All verification queries return 0 rows (no violations).
If violations found: Document count and examples, decide whether to:
Document: Results of all verification queries.
Create analysis/[session-name]/04-import-log.md with: ./templates/phase-4.md
CHECKPOINT: Before proceeding to Phase 5, you MUST have:
04-import-log.md created with all results documentedGoal: Systematically detect data quality issues using sub-agent to prevent context pollution, document findings for cleaning-data skill.
CRITICAL: This phase MUST use sub-agent delegation. DO NOT analyze data in main agent context.
Use dedicated quality-assessment agent
Invoke the quality-assessment agent (defined in .claude/agents/quality-assessment.md):
Task tool with agent: quality-assessment
Parameters:
- table_name: raw_[actual_table_name]
- columns: [list of all columns from schema]
- numeric_columns: [list of numeric columns for outlier detection]
- text_columns: [list of text columns for uniqueness analysis]
The agent will execute all quality checks (NULL analysis, duplicates, outliers, free text) and return structured findings.
Document agent findings in 05-quality-report.md using template below.
If multiple tables exist in the database, detect foreign key relationships between them.
Use dedicated detect-foreign-keys agent
Invoke the detect-foreign-keys agent (defined in .claude/agents/detect-foreign-keys.md):
Task tool with agent: detect-foreign-keys
Parameters:
- database_path: data/analytics.db
- table_names: [list of all raw_* tables in database]
When to run FK detection:
The agent will:
Document FK findings in 05-quality-report.md using template below.
Create analysis/[session-name]/05-quality-report.md with: ./templates/phase-5.md
CHECKPOINT: Before concluding importing-data skill, you MUST have:
05-quality-report.md created with all sections filled (including FK relationships)Why this is wrong: Hidden issues (encoding, delimiters, NULL representations) cause import failures or silent data corruption. Even "clean" CSVs have edge cases.
Do instead: Always complete Phase 1 profiling. Takes 5 minutes and prevents hours of debugging broken imports.
Why this is wrong: Column named "year" might contain "2023-Q1" (TEXT). "amount" might have "$" symbols. Type inference prevents silent casting failures.
Do instead: Complete Phase 2 type inference with sample analysis. Document rationale for each type choice.
Why this is wrong: Mixed formats ("01/15/2025", "2025-01-15", "Jan 15 2025") break date arithmetic and sorting. Standardization is mandatory.
Do instead: Complete Phase 3 with explicit date format conversion rules. Verify with queries after import.
Why this is wrong: Undocumented imports can't be reproduced. When re-importing updated data, you'll forget the transformations applied.
Do instead: Complete Phase 4 import log with commands, results, and verification. Future-you will thank present-you.
Why this is wrong: Duplicates, outliers, and NULL patterns are invisible without systematic checks. These surface as bugs during analysis.
Do instead: ALWAYS complete Phase 5 with sub-agent delegation. Quality report saves time in cleaning-data phase.
Why this is wrong: Analyzing thousands of rows pollutes main agent context, degrading performance for entire session. Sub-agents prevent this.
Do instead: ALWAYS delegate Phase 5 to sub-agent with exact sqlite3 commands provided.
Why this is wrong: Large files are MORE likely to have quality issues, not less. Sampling strategies exist for large files.
Do instead: Use head/tail sampling (Phase 1) and query-based profiling (Phase 5 sub-agent). Don't skip phases.
Why this is wrong: Silent data loss happens. Header rows miscounted, empty rows skipped, encoding issues truncating data.
Do instead: Always verify row count matches (Phase 4). If mismatch, investigate before proceeding.
Why this is wrong: Import handles technical standardization (formats). Cleaning handles semantic issues (business rules, categorization). Mixing them creates confusion.
Do instead: Keep importing-data focused on standardization. Let cleaning-data handle deduplication, outliers, free text.
Why this is wrong: cleaning-data is ALWAYS mandatory per design. Even "clean" data needs business rule validation and final verification.
Do instead: Proceed to cleaning-data even if quality report shows minimal issues. Document "no cleaning needed" if appropriate.
This skill ensures systematic, documented CSV import with quality assessment by:
Profiling before importing: Understand encoding, delimiters, headers, and sample data before designing schema - prevents import failures and silent corruption.
Explicit type inference: Analyze samples to infer INTEGER/REAL/TEXT types with documented rationale - prevents type casting failures and ensures correct data representation.
Mandatory standardization: Convert dates to ISO 8601, normalize numbers, clean whitespace, map NULL representations - creates consistent data foundation for analysis.
Verified import execution: Document CREATE TABLE statements, import methods, row count verification - ensures reproducibility and data integrity.
Systematic quality assessment: Delegate NULL detection, duplicate finding, outlier identification, and free text discovery to sub-agent - prevents context pollution while ensuring comprehensive quality checks.
Audit trail maintenance: Create numbered markdown files (01-05) documenting every decision - provides full traceability from raw CSV to raw_* tables.
Follow this process and you'll create clean, well-documented raw_* tables ready for the cleaning-data skill, avoid silent data corruption, and maintain complete audit trail for reproducible imports.
Use when working with Payload CMS projects (payload.config.ts, collections, fields, hooks, access control, Payload API). Use when debugging validation errors, security issues, relationship queries, transactions, or hook behavior.