Component skill for systematic SQL query development in DataPeeker analysis sessions
Develops systematic SQL queries through a 5-phase process: clarify requirements, design logic, write SQL, verify results, and document.
/plugin marketplace add tilmon-engineering/claude-skills/plugin install datapeeker@tilmon-eng-skillsThis skill inherits all available tools. When active, it can use any tool Claude has access to.
This component skill guides systematic query development for analytics. Use it when:
understanding-data skill if needed)Create a TodoWrite checklist for the 5-phase query development process:
Phase 1: Clarify Analytical Question
Phase 2: Design Query Logic
Phase 3: Write SQL
Phase 4: Verify Results
Phase 5: Document Query
Mark each phase as you complete it. Save queries in numbered markdown files.
Goal: Translate vague analytical question into specific, answerable query requirements.
Before writing any SQL, ensure you can answer:
What exactly are we measuring?
What is the unit of analysis?
What time period?
What filters apply?
What groupings/segments?
What comparison or context?
Before proceeding to Phase 2, write down:
## Query Requirements
**Analytical Question:** [Original question in plain language]
**Specific Metric:** [Exact calculation]
- Formula: [e.g., SUM(amount) / COUNT(DISTINCT customer_id)]
- Name: [e.g., "Average Revenue Per Customer"]
**Unit of Analysis:** [e.g., "One row per product category"]
**Time Period:** [e.g., "January 1 - March 31, 2024"]
**Filters:**
- [Filter 1: e.g., "Exclude cancelled orders"]
- [Filter 2: e.g., "Only completed transactions"]
**Grouping:** [e.g., "Group by product_category"]
**Ordering:** [e.g., "Order by total_revenue DESC, show top 10"]
Don't proceed until requirements are crystal clear.
Goal: Plan query structure before writing SQL syntax.
List all tables needed and why:
## Tables Needed
1. **orders** - Contains transaction dates and amounts
- Join key: order_id
- Columns: order_date, total_amount, customer_id
2. **customers** - Contains customer segment information
- Join key: customer_id
- Columns: customer_id, segment, region
3. **order_items** - Contains product details (if needed)
- Join key: order_id
- Columns: product_id, quantity, price
If multiple tables are needed:
## Join Logic
Main table: orders (one row per transaction)
LEFT JOIN customers ON orders.customer_id = customers.customer_id
- Type: LEFT (to keep orders even if customer record missing)
- Cardinality: many-to-one (many orders per customer)
- Risk: None - customer_id should always match
INNER JOIN order_items ON orders.order_id = order_items.order_id
- Type: INNER (only want orders with items)
- Cardinality: one-to-many (multiple items per order)
- Risk: Row explosion - order totals will be duplicated
- Mitigation: Calculate item metrics separately or use DISTINCT
Check join cardinality before complex joins to avoid row explosion.
Break complex calculations into logical steps:
## Calculation Steps
Step 1: Filter to desired date range and conditions
- WHERE order_date BETWEEN '2024-01-01' AND '2024-03-31'
- AND status = 'completed'
Step 2: Calculate base metrics
- Total revenue: SUM(amount)
- Transaction count: COUNT(*)
- Unique customers: COUNT(DISTINCT customer_id)
Step 3: Derive calculated metrics
- Average revenue per customer: total_revenue / unique_customers
Step 4: Group and aggregate
- GROUP BY product_category
Step 5: Order results
- ORDER BY total_revenue DESC
- LIMIT 10
Document potential issues:
## Edge Cases to Handle
1. **NULL values:** How to handle NULL in date, amount, or category?
- Decision: Exclude rows with NULL in critical columns
2. **Zero division:** What if denominator is zero (e.g., no customers)?
- Decision: Use NULLIF or CASE to avoid division by zero
3. **Duplicate records:** Could there be duplicate transactions?
- Decision: Check with COUNT(*) vs COUNT(DISTINCT order_id)
4. **Date formatting:** Are dates stored as strings or native DATE type?
- Decision: Use database date functions to extract components, verify format is consistent
Goal: Translate design into clean, commented SQL.
DO:
SELECT
column1,
column2,
SUM(column3) as total
FROM table_name
WHERE condition = 'value'
GROUP BY column1, column2
ORDER BY total DESC;
-- Calculate daily sales metrics for Q1 2024
SELECT
-- Extract date component (function varies by database)
-- SQLite: DATE(order_date)
-- PostgreSQL: order_date::date or DATE(order_date)
-- MySQL: DATE(order_date)
CAST(order_date AS DATE) as sale_date,
COUNT(*) as transaction_count, -- Total orders per day
SUM(amount) as total_revenue, -- Gross revenue before returns
ROUND(AVG(amount), 2) as avg_order_value
FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-03-31'
AND status = 'completed' -- Exclude cancelled/pending
GROUP BY CAST(order_date AS DATE)
ORDER BY sale_date;
-- Good
SELECT product_category as category, SUM(amount) as total_revenue
-- Bad
SELECT product_category as pc, SUM(amount) as t
SELECT
customer_id,
COUNT(order_id) as order_count,
COALESCE(SUM(amount), 0) as total_spent -- Replace NULL with 0
FROM orders
WHERE customer_id IS NOT NULL -- Explicit NULL handling
GROUP BY customer_id;
-- Break complex logic into readable steps
WITH daily_sales AS (
SELECT
DATE(order_date) as sale_date,
SUM(amount) as revenue
FROM orders
WHERE status = 'completed'
GROUP BY DATE(order_date)
),
daily_stats AS (
SELECT
AVG(revenue) as avg_daily_revenue,
-- SQLite doesn't have STDEV() - calculate manually using variance formula
SQRT(AVG(revenue * revenue) - AVG(revenue) * AVG(revenue)) as stddev_revenue
FROM daily_sales
)
SELECT
ds.sale_date,
ds.revenue,
ROUND((ds.revenue - dst.avg_daily_revenue) / dst.stddev_revenue, 2) as z_score
FROM daily_sales ds
CROSS JOIN daily_stats dst
ORDER BY ds.sale_date;
DON'T:
-- Bad - unclear what columns you're using
SELECT * FROM orders WHERE amount > 100;
-- Good - explicit about needed columns
SELECT order_id, order_date, amount FROM orders WHERE amount > 100;
-- Risky - might miss 'PENDING' or 'Pending'
WHERE status = 'pending'
-- Better - normalize case
WHERE LOWER(status) = 'pending'
-- Bad - which table's date column?
SELECT date, amount FROM orders JOIN shipments USING (order_id);
-- Good - explicit table prefixes
SELECT o.order_date, o.amount, s.ship_date
FROM orders o
JOIN shipments s ON o.order_id = s.order_id;
-- Wrong - product_category not in GROUP BY
SELECT product_category, SUM(amount) FROM orders GROUP BY customer_id;
-- Correct
SELECT product_category, SUM(amount) FROM orders GROUP BY product_category;
Goal: Ensure query returns correct, sensible results.
Before trusting query results, verify:
1. Row count makes sense:
-- Check: Does row count match expectations?
SELECT COUNT(*) as row_count FROM (
-- Your query here
);
Expected: If grouping by day-of-week, should have 7 rows. If grouping by month, should have 12 rows (or fewer if partial year).
2. Aggregates are reasonable:
-- Check: Are totals in expected range?
SELECT
SUM(amount) as total_revenue,
COUNT(*) as transaction_count,
MIN(amount) as min_amount,
MAX(amount) as max_amount,
AVG(amount) as avg_amount
FROM orders;
Ask: Do these values make business sense? Any impossibly high/low values?
3. No unexpected NULLs:
-- Check: Are there NULL values in results?
SELECT
COUNT(*) as total_rows,
COUNT(column1) as column1_non_null,
COUNT(column2) as column2_non_null
FROM (
-- Your query here
);
If critical columns have NULLs, investigate why.
4. Percentages sum to 100% (if applicable):
-- Check: Do percentage columns sum to 100?
SELECT SUM(pct_of_total) as total_pct FROM (
SELECT
category,
100.0 * COUNT(*) / SUM(COUNT(*)) OVER () as pct_of_total
FROM orders
GROUP BY category
);
Should equal 100.0 (or close, accounting for rounding).
5. Spot check specific values:
Pick a specific row from results and verify manually:
-- Example: Verify March 15 sales total
SELECT SUM(amount) FROM orders WHERE CAST(order_date AS DATE) = '2024-03-15';
Compare to what your main query shows for March 15.
6. Compare to known totals:
-- Check: Does grouped data sum to overall total?
-- Total from grouped query:
SELECT SUM(category_total) FROM (
SELECT category, SUM(amount) as category_total
FROM orders
GROUP BY category
);
-- Should equal overall total:
SELECT SUM(amount) FROM orders;
Row explosion from joins:
-- Symptom: Totals are too high after joining
-- Cause: One-to-many join duplicates rows
-- Fix: Use DISTINCT or calculate at appropriate grain
-- Wrong - duplicates order amounts
SELECT o.order_id, SUM(o.amount)
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY o.order_id; -- order amount counted once per item!
-- Right - calculate order and item metrics separately
SELECT
o.order_id,
o.amount as order_total,
COUNT(oi.item_id) as item_count
FROM orders o
LEFT JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY o.order_id, o.amount;
Incorrect grouping:
-- Symptom: Unexpected row count or aggregates
-- Cause: Missing or extra columns in GROUP BY
-- Fix: Include ALL non-aggregated columns in GROUP BY
-- Wrong - missing category from GROUP BY
SELECT category, product, SUM(sales)
FROM products
GROUP BY category; -- Error or wrong results
-- Right
SELECT category, product, SUM(sales)
FROM products
GROUP BY category, product;
Date/time parsing errors:
-- Symptom: Date groupings don't match expectations
-- Cause: Date format inconsistency or wrong extraction
-- Fix: Verify date format and use appropriate database function
-- Check date format first
SELECT DISTINCT date_column FROM table LIMIT 10;
-- Extract date components using database-specific functions
-- Postgres: TO_CHAR, DATE_TRUNC, EXTRACT
-- MySQL: DATE_FORMAT, YEAR, MONTH
-- SQLite: STRFTIME, DATE
-- See database documentation for specific syntax
-- Generic approach: Cast to DATE type
SELECT CAST(date_column AS DATE) FROM table;
Goal: Create reproducible documentation for query and results.
Save each query in a numbered markdown file:
# [Query Purpose]
## Analytical Question
[What question does this query answer?]
Example: "What are the top 10 product categories by revenue in Q1 2024?"
## Rationale
[Why is this query needed for the analysis?]
Example: "Identifies high-value categories to prioritize for Q2 marketing campaigns."
## Query
```sql
-- Clear, commented SQL
SELECT
product_category,
COUNT(DISTINCT order_id) as order_count,
SUM(amount) as total_revenue,
ROUND(AVG(amount), 2) as avg_order_value
FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-03-31'
AND status = 'completed'
GROUP BY product_category
ORDER BY total_revenue DESC
LIMIT 10;
[Paste actual query results - raw output]
product_category | order_count | total_revenue | avg_order_value
Electronics | 1523 | 345678.90 | 226.89
Home & Garden | 2341 | 298432.10 | 127.48
Clothing | 3456 | 234567.80 | 67.89
...
[What do the results show? Key takeaways.]
[Any issues or caveats about the data]
### Documentation Best Practices
**DO:**
- Save queries in version control (git)
- Include exact SQL that was run
- Paste actual results, not summaries
- Note any data quality issues discovered
- Date your analysis files
**DON'T:**
- Paraphrase results (show actual numbers)
- Round or simplify results prematurely
- Skip documenting failed queries (document what didn't work and why)
- Forget to note filters and exclusions
---
## Common Query Patterns
### Pattern 1: Time Series Analysis
**Use case:** Analyze metrics over time (daily, weekly, monthly trends)
```sql
-- Daily trend
SELECT
CAST(order_date AS DATE) as date,
COUNT(*) as order_count,
SUM(amount) as revenue,
ROUND(AVG(amount), 2) as avg_order_value
FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-03-31'
GROUP BY CAST(order_date AS DATE)
ORDER BY date;
-- Weekly trend (implementation varies by database)
-- Postgres: DATE_TRUNC('week', order_date)
-- MySQL: DATE_SUB(order_date, INTERVAL WEEKDAY(order_date) DAY)
-- SQLite: DATE(order_date, 'weekday 0', '-6 days')
SELECT
-- Use database-specific function to get week start date
[week_start_function] as week_start,
COUNT(*) as order_count,
SUM(amount) as revenue
FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-03-31'
GROUP BY week_start
ORDER BY week_start;
-- Monthly trend (implementation varies by database)
-- Postgres: TO_CHAR(order_date, 'YYYY-MM')
-- MySQL: DATE_FORMAT(order_date, '%Y-%m')
-- SQLite: STRFTIME('%Y-%m', order_date)
SELECT
-- Use database-specific function to extract year-month
[year_month_function] as year_month,
COUNT(*) as order_count,
SUM(amount) as revenue
FROM orders
GROUP BY year_month
ORDER BY year_month;
Use case: Compare metrics across categories or segments
-- Basic segmentation
SELECT
segment,
COUNT(*) as count,
SUM(amount) as total,
ROUND(AVG(amount), 2) as average,
ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) as pct_of_total
FROM orders
GROUP BY segment
ORDER BY total DESC;
-- Multi-level segmentation
SELECT
region,
customer_type,
COUNT(*) as order_count,
SUM(amount) as revenue
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
GROUP BY region, customer_type
ORDER BY region, revenue DESC;
Use case: Identify top/bottom performers
-- Top 10 by revenue
SELECT
product_name,
SUM(amount) as total_revenue
FROM order_items oi
JOIN products p ON oi.product_id = p.product_id
GROUP BY product_name
ORDER BY total_revenue DESC
LIMIT 10;
-- Top 20% of customers (by revenue)
-- NOTE: PERCENT_RANK() requires SQLite 3.28+
WITH customer_revenue AS (
SELECT
customer_id,
SUM(amount) as total_revenue
FROM orders
GROUP BY customer_id
),
ranked_customers AS (
SELECT
customer_id,
total_revenue,
PERCENT_RANK() OVER (ORDER BY total_revenue DESC) as percentile_rank
FROM customer_revenue
)
-- Must use CTE to filter on window function results
SELECT
customer_id,
total_revenue,
percentile_rank
FROM ranked_customers
WHERE percentile_rank <= 0.20
ORDER BY total_revenue DESC;
-- Top 3 products per category
WITH ranked_products AS (
SELECT
category,
product_name,
SUM(amount) as revenue,
ROW_NUMBER() OVER (PARTITION BY category ORDER BY SUM(amount) DESC) as rank
FROM products
GROUP BY category, product_name
)
SELECT category, product_name, revenue
FROM ranked_products
WHERE rank <= 3
ORDER BY category, rank;
Use case: Compare current period to previous period
-- Month-over-month comparison
WITH monthly_sales AS (
SELECT
-- Extract year-month (use database-specific function)
[year_month_function] as year_month,
SUM(amount) as revenue
FROM orders
GROUP BY year_month
)
SELECT
year_month,
revenue as current_month_revenue,
LAG(revenue) OVER (ORDER BY year_month) as previous_month_revenue,
revenue - LAG(revenue) OVER (ORDER BY year_month) as revenue_change,
ROUND(100.0 * (revenue - LAG(revenue) OVER (ORDER BY year_month)) /
LAG(revenue) OVER (ORDER BY year_month), 2) as pct_change
FROM monthly_sales
ORDER BY year_month;
-- Year-over-year comparison
-- Extract month and year components (use database-specific functions)
SELECT
[month_function] as month,
SUM(CASE WHEN [year_function] = '2023' THEN amount ELSE 0 END) as revenue_2023,
SUM(CASE WHEN [year_function] = '2024' THEN amount ELSE 0 END) as revenue_2024,
SUM(CASE WHEN [year_function] = '2024' THEN amount ELSE 0 END) -
SUM(CASE WHEN [year_function] = '2023' THEN amount ELSE 0 END) as yoy_change
FROM orders
WHERE [year_function] IN ('2023', '2024')
GROUP BY month
ORDER BY month;
Use case: Analyze behavior of groups defined by time of first action
-- Customer cohort by first purchase month
WITH first_purchase AS (
SELECT
customer_id,
-- Extract year-month from first purchase (use database-specific function)
[year_month_function(MIN(order_date))] as cohort_month
FROM orders
GROUP BY customer_id
)
SELECT
fp.cohort_month,
COUNT(DISTINCT o.customer_id) as customers,
COUNT(o.order_id) as total_orders,
SUM(o.amount) as total_revenue,
ROUND(1.0 * COUNT(o.order_id) / COUNT(DISTINCT o.customer_id), 2) as orders_per_customer
FROM first_purchase fp
JOIN orders o ON fp.customer_id = o.customer_id
GROUP BY fp.cohort_month
ORDER BY fp.cohort_month;
Use case: Understand distribution of values (percentiles, buckets)
-- Value distribution by buckets
SELECT
CASE
WHEN amount < 25 THEN '$0-24'
WHEN amount < 50 THEN '$25-49'
WHEN amount < 100 THEN '$50-99'
WHEN amount < 250 THEN '$100-249'
ELSE '$250+'
END as amount_bucket,
COUNT(*) as order_count,
ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) as pct_of_orders,
SUM(amount) as total_revenue,
ROUND(100.0 * SUM(amount) / SUM(SUM(amount)) OVER (), 2) as pct_of_revenue
FROM orders
GROUP BY amount_bucket
ORDER BY MIN(amount);
-- Percentile calculation (manual approach for SQLite)
WITH ranked_orders AS (
SELECT
amount,
ROW_NUMBER() OVER (ORDER BY amount) as row_num,
COUNT(*) OVER () as total_count
FROM orders
)
SELECT
'P25' as percentile,
amount
FROM ranked_orders
WHERE row_num = CAST(total_count * 0.25 AS INTEGER)
UNION ALL
SELECT 'P50', amount
FROM ranked_orders
WHERE row_num = CAST(total_count * 0.50 AS INTEGER)
UNION ALL
SELECT 'P75', amount
FROM ranked_orders
WHERE row_num = CAST(total_count * 0.75 AS INTEGER);
This skill provides database-agnostic query guidance. For database-specific syntax:
using-sqlite skill for SQLite-specific date functions (STRFTIME), CLI invocation patterns, and SQLite idiomsCore SQL patterns (SELECT, JOIN, GROUP BY, CTEs, window functions) are consistent across databases. Function syntax (date extraction, string manipulation) varies by database.
Problem: Aggregating too early loses detail needed for further analysis
-- Bad - can't drill down further
SELECT
category,
SUM(amount) as total
FROM orders
GROUP BY category;
-- Better - keep grain, aggregate in visualization/reporting layer
-- Or use CTE to preserve both detail and summary
WITH category_totals AS (
SELECT category, SUM(amount) as total
FROM orders
GROUP BY category
)
SELECT
o.*,
ct.total as category_total,
ROUND(100.0 * o.amount / ct.total, 2) as pct_of_category
FROM orders o
JOIN category_totals ct ON o.category = ct.category;
Problem: NULLs behave unexpectedly in comparisons and aggregates
-- Bad - NULL != NULL, so this misses NULL values
SELECT * FROM orders WHERE status != 'completed';
-- Good - explicitly handle NULLs
SELECT * FROM orders
WHERE status != 'completed' OR status IS NULL;
-- Bad - COUNT(*) includes NULLs, COUNT(column) doesn't
SELECT COUNT(column) FROM table; -- May not match row count!
-- Good - be explicit
SELECT
COUNT(*) as total_rows,
COUNT(column) as non_null_count,
COUNT(*) - COUNT(column) as null_count
FROM table;
Problem: SQLite's dynamic typing can cause unexpected behavior
-- Bad - comparing number to string
SELECT * FROM orders WHERE amount > '100'; -- Works but risky
-- Good - explicit types
SELECT * FROM orders WHERE CAST(amount AS REAL) > 100.0;
-- Bad - date comparison as strings might fail
WHERE date_column > '2024-1-1' -- Might not work as expected
-- Good - proper date format
WHERE date_column > '2024-01-01' -- ISO format: YYYY-MM-DD
Problem: Nested subqueries are hard to read and debug
-- Bad - deeply nested subqueries
SELECT * FROM (
SELECT * FROM (
SELECT * FROM orders WHERE amount > 100
) WHERE category = 'Electronics'
) WHERE order_date > '2024-01-01';
-- Good - use CTEs for readability
WITH high_value_orders AS (
SELECT * FROM orders WHERE amount > 100
),
electronics_orders AS (
SELECT * FROM high_value_orders WHERE category = 'Electronics'
)
SELECT * FROM electronics_orders
WHERE order_date > '2024-01-01';
Problem: Forgetting join conditions creates row explosion
-- Bad - missing join condition
SELECT o.*, c.*
FROM orders o, customers c; -- Every order paired with every customer!
-- Good - explicit join
SELECT o.*, c.*
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id;
Process skills reference this component skill with:
Use the `writing-queries` component skill to develop SQL queries systematically, ensuring correctness and documentation.
When writing queries during analysis:
This systematic approach prevents common SQL errors and ensures reproducible analysis.
Use when working with Payload CMS projects (payload.config.ts, collections, fields, hooks, access control, Payload API). Use when debugging validation errors, security issues, relationship queries, transactions, or hook behavior.