Understanding Data

Purpose

This component skill guides systematic data profiling before analysis begins. Use it when:

Starting a new analysis session with unfamiliar data
Encountering unexpected query results
Need to understand data structure, quality, or relationships
Referenced by process skills requiring data familiarity

Prerequisites

Data loaded into a relational database (SQLite, PostgreSQL, MySQL, SQL Server, etc.)
SQL query tool available (database CLI, IDE, or query interface)
Analysis workspace created (if using DataPeeker conventions)

Data Understanding Process

Create a TodoWrite checklist for the 4-phase data profiling process:

Phase 1: Schema Discovery
Phase 2: Data Quality Assessment
Phase 3: Distribution Analysis
Phase 4: Relationship Identification

Mark each phase as you complete it. Document all findings in a numbered markdown file.

Phase 1: Schema Discovery

Goal: Understand tables, columns, types, and cardinalities.

List All Tables

-- Get all tables in database (database-specific methods):
-- SQLite: SELECT name FROM sqlite_master WHERE type='table';
-- PostgreSQL: SELECT tablename FROM pg_tables WHERE schemaname = 'public';
-- MySQL: SHOW TABLES;
-- SQL Server: SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_TYPE = 'BASE TABLE';

-- Standard SQL approach (PostgreSQL, MySQL, SQL Server):
SELECT table_name
FROM information_schema.tables
WHERE table_schema = 'your_schema'
ORDER BY table_name;

Document: Table names and their likely business meaning.

Note: Use database-specific CLI commands or syntax as appropriate for your database engine.

Examine Table Schemas

For each table of interest:

-- Get column information (database-specific):
-- SQLite: PRAGMA table_info(table_name);
-- PostgreSQL: \d table_name (CLI) or SELECT * FROM information_schema.columns WHERE table_name = 'table_name';
-- MySQL: DESCRIBE table_name; or SHOW COLUMNS FROM table_name;
-- SQL Server: EXEC sp_columns table_name; or SELECT * FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'table_name';

-- Standard SQL approach:
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'your_table'
ORDER BY ordinal_position;

Document:

Column names and types
Which columns are likely keys/identifiers
Which columns are measures vs dimensions
Missing columns you'd expect to see

Check Row Counts

-- Count rows in each table
SELECT 'table_name' as table_name, COUNT(*) as row_count
FROM table_name;

Document:

Relative sizes of tables
Whether counts match expectations
Any suspiciously small/large tables

Identify Unique Identifiers

-- Check column uniqueness
SELECT COUNT(*) as total_rows,
       COUNT(DISTINCT column_name) as unique_values,
       COUNT(*) - COUNT(DISTINCT column_name) as duplicate_count
FROM table_name;

Test for each candidate key column.

Document:

Which columns are unique (potential primary keys)
Which columns have high cardinality (useful for grouping)
Which columns have low cardinality (categories/flags)

Phase 2: Data Quality Assessment

Goal: Identify missing data, invalid values, and data quality issues.

Check for NULL Values

-- Count NULLs for each important column
SELECT
  COUNT(*) as total_rows,
  COUNT(column1) as column1_non_null,
  COUNT(*) - COUNT(column1) as column1_null_count,
  ROUND(100.0 * (COUNT(*) - COUNT(column1)) / COUNT(*), 2) as column1_null_pct
FROM table_name;

Document:

Which columns have missing data
Percentage of missingness
Whether missingness is random or systematic

Check for Empty Strings

-- Find empty or whitespace-only strings
SELECT COUNT(*) as empty_string_count
FROM table_name
WHERE TRIM(text_column) = ''
   OR text_column IS NULL;

Document: Columns with empty string issues.

Examine Value Ranges

For numeric columns:

-- Get min, max, and outlier candidates
SELECT
  MIN(numeric_column) as min_value,
  MAX(numeric_column) as max_value,
  AVG(numeric_column) as avg_value,
  COUNT(*) as total_count,
  COUNT(CASE WHEN numeric_column < 0 THEN 1 END) as negative_count,
  COUNT(CASE WHEN numeric_column = 0 THEN 1 END) as zero_count
FROM table_name;

Document:

Impossible values (negative prices, future dates, etc.)
Extreme outliers
Suspicious patterns (many zeros, rounded numbers)

Check Date/Time Validity

-- Examine date ranges and formats
SELECT
  MIN(date_column) as earliest_date,
  MAX(date_column) as latest_date,
  COUNT(DISTINCT date_column) as unique_dates,
  COUNT(*) as total_rows
FROM table_name;

Document:

Date ranges (do they make sense for the business context?)
Gaps in date coverage
Future dates where inappropriate

Examine Categorical Values

-- Get frequency distribution for categorical column
SELECT
  category_column,
  COUNT(*) as frequency,
  ROUND(100.0 * COUNT(*) / (SELECT COUNT(*) FROM table_name), 2) as percentage
FROM table_name
GROUP BY category_column
ORDER BY frequency DESC;

Document:

Valid categories and their distributions
Unexpected categories
Misspellings or data entry variations
Extremely rare categories

Phase 3: Distribution Analysis

Goal: Understand how data is distributed across key dimensions.

Time Distribution

For any table with dates/timestamps:

-- Group by time period to see distribution
-- Extract year-month (database-specific function):
-- SQLite: STRFTIME('%Y-%m', date_column)
-- PostgreSQL: TO_CHAR(date_column, 'YYYY-MM')
-- MySQL: DATE_FORMAT(date_column, '%Y-%m')
-- SQL Server: FORMAT(date_column, 'yyyy-MM')

-- Example using SQLite syntax:
SELECT
  STRFTIME('%Y-%m', date_column) as year_month,
  COUNT(*) as row_count
FROM table_name
GROUP BY year_month
ORDER BY year_month;

Look for:

Seasonal patterns
Growth/decline trends
Gaps or spikes
Incomplete periods (partial months)

Document: Time coverage and patterns.

Segmentation Distribution

For key categorical dimensions:

-- Distribution by segment
SELECT
  segment_column,
  COUNT(*) as count,
  COUNT(DISTINCT id_column) as unique_entities,
  ROUND(AVG(numeric_measure), 2) as avg_measure
FROM table_name
GROUP BY segment_column
ORDER BY count DESC;

Document:

How data is distributed across segments
Whether segments are balanced or skewed
Segments with insufficient data for analysis

Value Distribution Buckets

For continuous numeric measures:

-- Create distribution buckets
SELECT
  CASE
    WHEN value < 10 THEN '0-9'
    WHEN value < 50 THEN '10-49'
    WHEN value < 100 THEN '50-99'
    WHEN value < 500 THEN '100-499'
    ELSE '500+'
  END as value_bucket,
  COUNT(*) as frequency
FROM table_name
GROUP BY value_bucket
ORDER BY value_bucket;

Adjust buckets to your data range.

Document:

Shape of distribution (normal, skewed, bimodal?)
Presence of outliers
Whether transformations might be needed

Correlation Exploration

For related numeric columns:

-- Check if two measures move together
SELECT
  segment,
  COUNT(*) as n,
  ROUND(AVG(measure1), 2) as avg_measure1,
  ROUND(AVG(measure2), 2) as avg_measure2
FROM table_name
GROUP BY segment
ORDER BY avg_measure1;

Look for: Whether segments with high measure1 also have high measure2.

Document: Observed correlations or anti-correlations.

Phase 4: Relationship Identification

Goal: Understand how tables relate to each other.

Foreign Key Discovery

-- Find potential foreign keys by checking if column values exist in another table
SELECT
  'table_a.fk_column references table_b.id_column' as relationship,
  COUNT(*) as rows_in_table_a,
  COUNT(DISTINCT a.fk_column) as unique_fk_values,
  (SELECT COUNT(DISTINCT id_column) FROM table_b) as unique_ids_in_table_b
FROM table_a a;

Test: Do all fk_column values exist in table_b? (Referential integrity)

Document:

Confirmed foreign key relationships
Orphaned records (FKs with no matching primary key)
Unused primary keys (keys with no foreign key references)

Join Cardinality

-- Understand join behavior before using it
SELECT
  'table_a to table_b' as join_direction,
  COUNT(*) as rows_before_join,
  COUNT(b.id) as matches_found,
  COUNT(*) - COUNT(b.id) as rows_without_match
FROM table_a a
LEFT JOIN table_b b ON a.fk_column = b.id_column;

Document:

One-to-one, one-to-many, or many-to-many?
Percentage of successful joins
Whether LEFT/INNER/OUTER JOIN is appropriate

Cross-Table Consistency

-- Check if aggregates match between tables
SELECT 'table_a' as source, SUM(amount) as total FROM table_a
UNION ALL
SELECT 'table_b' as source, SUM(amount) as total FROM table_b;

For fact tables that should reconcile.

Document:

Whether totals match expectations
Discrepancies that need explanation

Hierarchical Relationships

-- Find parent-child relationships
SELECT
  parent_id,
  COUNT(*) as child_count,
  COUNT(DISTINCT child_category) as distinct_child_types
FROM table_name
GROUP BY parent_id
ORDER BY child_count DESC;

Document:

Hierarchical structures (customers > orders > line items)
Depth of hierarchy
Orphaned children or childless parents

Documentation Requirements

After completing all 4 phases, create a summary document:

Data Profile Summary

## Data Profile Summary

### Tables
- `table_name` (X rows): Description and purpose

### Key Findings
- Data Quality: [Major issues or "clean"]
- Coverage: [Date ranges, completeness]
- Relationships: [How tables connect]

### Analysis Implications
- [What this data can/cannot answer]
- [Segments available for analysis]
- [Known limitations]

### Recommended Filters
- [Filters to apply for clean data]
- [Time periods to focus on]

Common Pitfalls

DON'T:

Assume column names accurately describe contents
Skip Phase 2 (data quality) - dirty data leads to wrong conclusions
Ignore NULL values - they often have business meaning
Forget to check join cardinality before complex queries

DO:

Verify assumptions with queries
Document surprises and anomalies
Note data quality issues for later interpretation
Keep a running list of questions for data owners

When to Re-Profile

Re-run portions of this skill when:

Query results don't match expectations
You discover a new table mid-analysis
You suspect data quality issues
Time period changes significantly (new data loaded)

Integration with Process Skills

Process skills reference this component skill with:

If unfamiliar with the data, use the `understanding-data` component skill to profile tables before proceeding.

This ensures analysts don't make assumptions about data structure or quality.

Database Engine Specifics

This skill provides database-agnostic guidance with examples in multiple SQL dialects.

Related skills for database-specific implementation:

using-sqlite - SQLite CLI usage, syntax, and optimizations
using-postgresql - PostgreSQL-specific features (if available)
Other database-specific skills for CLI commands, performance tuning, and engine-specific features

understanding-data

Understanding Data

Purpose

Prerequisites

Data Understanding Process

Phase 1: Schema Discovery

List All Tables

Examine Table Schemas

Check Row Counts

Identify Unique Identifiers

Phase 2: Data Quality Assessment

Check for NULL Values

Check for Empty Strings

Examine Value Ranges

Check Date/Time Validity

Examine Categorical Values

Phase 3: Distribution Analysis

Time Distribution

Segmentation Distribution

Value Distribution Buckets

Correlation Exploration

Phase 4: Relationship Identification

Foreign Key Discovery

Join Cardinality

Cross-Table Consistency

Hierarchical Relationships

Documentation Requirements

Data Profile Summary

Common Pitfalls

When to Re-Profile

Integration with Process Skills

Database Engine Specifics

Similar Skills