Master SQL querying, database design, data retrieval, and become proficient with relational and modern databases
Write efficient SQL queries, design database schemas, and optimize performance for large datasets. Use this when you need to retrieve data from relational databases, join multiple tables, or analyze millions of records beyond spreadsheet limits.
/plugin marketplace add pluginagentmarketplace/custom-plugin-data-analyst/plugin install data-analyst-roadmap@pluginagentmarketplace-data-analystsonnetThe SQL & Databases Expert role equips you with the technical skills to retrieve, manipulate, and analyze data directly from databases. Unlike spreadsheets, databases handle millions of records efficiently, enable real-time data access, and support multiple concurrent users. This role bridges the gap between raw data and analysis, teaching you how data is actually stored, organized, and accessed in enterprise environments.
Why This Matters: 95% of enterprise data lives in databases. SQL proficiency is the most demanded skill in data analytics and opens doors to advanced roles including analytics engineering, data engineering, and business intelligence.
This learning journey transforms you from a spreadsheet analyst to a database-fluent professional who can:
Timeline: 12-16 weeks of focused learning | Skill Level: Intermediate Developer
-- SQL Statement Structure
SELECT column1, column2 FROM table_name WHERE condition ORDER BY column1;
-- Basic Data Types
INT / BIGINT -- Integer numbers
DECIMAL(10,2) -- Numbers with decimals
VARCHAR(255) -- Text, variable length
CHAR(10) -- Text, fixed length
DATE -- Date only (YYYY-MM-DD)
DATETIME/TIMESTAMP -- Date and time
BOOLEAN -- TRUE/FALSE
-- Database Creation Example
CREATE DATABASE company_analytics;
USE company_analytics;
CREATE TABLE employees (
employee_id INT PRIMARY KEY AUTO_INCREMENT,
first_name VARCHAR(50) NOT NULL,
last_name VARCHAR(50) NOT NULL,
department VARCHAR(50),
salary DECIMAL(10,2),
hire_date DATE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Basic SELECT
SELECT employee_id, first_name, last_name FROM employees;
-- Select all columns
SELECT * FROM employees;
-- WHERE conditions
SELECT * FROM employees WHERE department = 'Sales';
SELECT * FROM employees WHERE salary > 50000;
SELECT * FROM employees WHERE hire_date >= '2023-01-01';
-- Multiple conditions
SELECT * FROM employees
WHERE department = 'Sales' AND salary > 50000;
SELECT * FROM employees
WHERE department = 'Sales' OR department = 'Marketing';
SELECT * FROM employees
WHERE salary BETWEEN 50000 AND 100000;
SELECT * FROM employees
WHERE first_name LIKE 'J%'; -- Names starting with J
SELECT * FROM employees
WHERE department IS NULL; -- Find missing values
SELECT * FROM employees
WHERE employee_id IN (1, 5, 10);
-- ORDER BY
SELECT * FROM employees ORDER BY salary DESC; -- Highest to lowest
SELECT * FROM employees ORDER BY last_name ASC, first_name ASC;
-- LIMIT results
SELECT * FROM employees LIMIT 10; -- First 10 records
SELECT * FROM employees LIMIT 10 OFFSET 20; -- Skip 20, get next 10 (pagination)
-- Combining all
SELECT employee_id, first_name, salary
FROM employees
WHERE department = 'Sales'
ORDER BY salary DESC
LIMIT 5; -- Top 5 sales people by salary
-- Aggregate Functions
SELECT COUNT(*) FROM employees; -- Number of records
SELECT COUNT(DISTINCT department) FROM employees; -- Unique values
SELECT SUM(salary) FROM employees;
SELECT AVG(salary) FROM employees;
SELECT MIN(salary) FROM employees;
SELECT MAX(salary) FROM employees;
-- GROUP BY
SELECT department, COUNT(*) as employee_count
FROM employees
GROUP BY department;
SELECT department, AVG(salary) as avg_salary
FROM employees
GROUP BY department
ORDER BY avg_salary DESC;
-- HAVING (filter groups after aggregation)
SELECT department, COUNT(*) as count
FROM employees
GROUP BY department
HAVING COUNT(*) > 5; -- Departments with more than 5 employees
-- String Functions
SELECT UPPER(first_name) FROM employees;
SELECT LOWER(last_name) FROM employees;
SELECT CONCAT(first_name, ' ', last_name) as full_name FROM employees;
SELECT SUBSTRING(first_name, 1, 3) FROM employees; -- First 3 characters
SELECT LENGTH(first_name) FROM employees;
SELECT TRIM(department) FROM employees; -- Remove spaces
-- Date Functions
SELECT CURDATE(); -- Today's date
SELECT NOW(); -- Current date and time
SELECT DATE_FORMAT(hire_date, '%Y-%m-%d') FROM employees;
SELECT YEAR(hire_date) FROM employees;
SELECT MONTH(hire_date) FROM employees;
SELECT DAY(hire_date) FROM employees;
SELECT DATEDIFF(CURDATE(), hire_date) as days_employed FROM employees;
-- Calculate age from hire date
SELECT
first_name,
hire_date,
YEAR(CURDATE()) - YEAR(hire_date) as years_employed
FROM employees;
Database Schema Example:
EMPLOYEES Table:
├── employee_id (Primary Key)
├── first_name
├── last_name
├── department_id (Foreign Key -> DEPARTMENTS)
└── salary
DEPARTMENTS Table:
├── department_id (Primary Key)
├── department_name
└── budget
PROJECTS Table:
├── project_id (Primary Key)
├── project_name
├── department_id (Foreign Key -> DEPARTMENTS)
└── budget
ASSIGNMENTS Table:
├── assignment_id (Primary Key)
├── employee_id (Foreign Key -> EMPLOYEES)
├── project_id (Foreign Key -> PROJECTS)
└── hours_allocated
-- INNER JOIN (most common)
-- Returns only matching records from both tables
SELECT
e.first_name,
e.last_name,
d.department_name,
e.salary
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id;
-- LEFT JOIN (keep all from left table)
-- Keep all employees even if no department assigned
SELECT
e.first_name,
d.department_name
FROM employees e
LEFT JOIN departments d ON e.department_id = d.department_id;
-- RIGHT JOIN (keep all from right table)
SELECT
e.first_name,
d.department_name
FROM employees e
RIGHT JOIN departments d ON e.department_id = d.department_id;
-- FULL OUTER JOIN (all records from both)
SELECT
e.first_name,
d.department_name
FROM employees e
FULL OUTER JOIN departments d ON e.department_id = d.department_id;
-- CROSS JOIN (Cartesian product)
SELECT COUNT(*) FROM employees CROSS JOIN departments; -- Every employee x every department
-- Multiple JOINs
SELECT
e.first_name,
d.department_name,
p.project_name,
a.hours_allocated
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id
INNER JOIN assignments a ON e.employee_id = a.employee_id
INNER JOIN projects p ON a.project_id = p.project_id;
-- Subquery in WHERE clause
SELECT * FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
-- Find all employees earning more than average
-- Subquery in SELECT clause
SELECT
department_id,
(SELECT COUNT(*) FROM employees e2 WHERE e2.department_id = e1.department_id) as emp_count
FROM employees e1;
-- Common Table Expression (CTE) - more readable
WITH avg_salary AS (
SELECT AVG(salary) as avg_sal FROM employees
)
SELECT * FROM employees, avg_salary
WHERE salary > avg_salary.avg_sal;
-- Multiple CTEs
WITH department_stats AS (
SELECT
department_id,
COUNT(*) as emp_count,
AVG(salary) as avg_salary
FROM employees
GROUP BY department_id
),
high_salary_depts AS (
SELECT * FROM department_stats
WHERE avg_salary > 75000
)
SELECT * FROM high_salary_depts;
-- GROUP BY with multiple columns
SELECT
department_id,
YEAR(hire_date) as hire_year,
COUNT(*) as count,
AVG(salary) as avg_salary
FROM employees
GROUP BY department_id, YEAR(hire_date)
ORDER BY department_id, hire_year;
-- HAVING clause for group filtering
SELECT
department_id,
COUNT(*) as emp_count,
AVG(salary) as avg_salary
FROM employees
GROUP BY department_id
HAVING COUNT(*) > 5 AND AVG(salary) > 60000
ORDER BY emp_count DESC;
-- Window Functions (available in MySQL 8.0+, PostgreSQL, SQL Server)
SELECT
first_name,
salary,
department_id,
ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) as rank_in_dept,
RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) as dense_rank,
AVG(salary) OVER (PARTITION BY department_id) as dept_avg_salary
FROM employees;
-- Running totals
SELECT
employee_id,
salary,
SUM(salary) OVER (ORDER BY employee_id) as running_total
FROM employees;
-- UNION (removes duplicates)
SELECT first_name, 'Employee' as type FROM employees
UNION
SELECT customer_name, 'Customer' as type FROM customers;
-- UNION ALL (keeps duplicates)
SELECT salary FROM employees
UNION ALL
SELECT contract_rate FROM contractors;
-- INTERSECT (common values)
SELECT employee_id FROM employees
INTERSECT
SELECT employee_id FROM project_assignments;
-- EXCEPT (in first but not second)
SELECT employee_id FROM employees
EXCEPT
SELECT employee_id FROM project_assignments;
-- View execution plan (varies by database)
EXPLAIN SELECT * FROM employees WHERE salary > 100000;
-- More detailed explanation (MySQL)
EXPLAIN EXTENDED SELECT * FROM employees WHERE salary > 100000;
-- PostgreSQL EXPLAIN ANALYZE
EXPLAIN ANALYZE SELECT * FROM employees WHERE salary > 100000;
-- Key metrics to understand:
-- - Query type (SIMPLE, PRIMARY, SUBQUERY, UNION)
-- - Table access method (seq scan vs index scan)
-- - Rows examined vs rows returned
-- - Filter cost
-- Create index
CREATE INDEX idx_department ON employees(department_id);
CREATE INDEX idx_salary ON employees(salary);
-- Composite index (for multiple conditions)
CREATE INDEX idx_dept_salary ON employees(department_id, salary);
-- When to use indexes:
✓ Columns used in WHERE clause frequently
✓ Columns used in JOIN conditions
✓ Columns used in ORDER BY
✓ Columns used for filtering in large tables
✗ Columns with low selectivity (few unique values)
✗ Small tables (full table scan faster)
✗ Columns with many NULL values
✗ Columns modified frequently
-- View indexes
SHOW INDEXES FROM employees;
SHOW INDEX FROM employees; -- MySQL
-- Drop index
DROP INDEX idx_salary ON employees;
-- BAD: SELECT * (retrieves unnecessary columns)
SELECT * FROM employees WHERE department_id = 1;
-- GOOD: Select only needed columns
SELECT employee_id, first_name, salary FROM employees WHERE department_id = 1;
-- BAD: Multiple subqueries
SELECT * FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees)
AND department_id IN (SELECT department_id FROM departments WHERE budget > 1000000);
-- GOOD: Use CTE instead
WITH dept_budgets AS (
SELECT department_id FROM departments WHERE budget > 1000000
),
avg_salary AS (
SELECT AVG(salary) as avg_sal FROM employees
)
SELECT e.* FROM employees e, avg_salary
WHERE e.salary > avg_salary.avg_sal
AND e.department_id IN (SELECT department_id FROM dept_budgets);
-- BAD: Function in WHERE clause (prevents index use)
SELECT * FROM employees WHERE YEAR(hire_date) = 2023;
-- GOOD: Date range instead
SELECT * FROM employees
WHERE hire_date >= '2023-01-01' AND hire_date < '2024-01-01';
-- BAD: LIKE with leading wildcard (slow)
SELECT * FROM employees WHERE first_name LIKE '%john%';
-- BETTER: Use FULLTEXT search or limit wildcard position
SELECT * FROM employees WHERE first_name LIKE 'john%';
-- Use appropriate data types
✓ INT for IDs (not VARCHAR)
✓ DECIMAL(10,2) for money (not FLOAT)
✓ DATE for dates (not VARCHAR)
✓ SMALLINT for small ranges
✓ VARCHAR(n) with appropriate length
-- Avoid data type conversions
-- BAD: Implicit conversion wastes CPU
SELECT * FROM employees WHERE employee_id = '123'; -- String to INT
-- GOOD: Explicit matching types
SELECT * FROM employees WHERE employee_id = 123;
-- Denormalization for analytics (after optimization fails)
-- Sometimes calculated fields speed up analysis
ALTER TABLE employees ADD COLUMN dept_name VARCHAR(50);
UPDATE employees e
SET dept_name = (SELECT department_name FROM departments d WHERE d.department_id = e.department_id);
Normalization Levels:
0NF (Unnormalized): Repeating groups in table
EMPLOYEES table with multiple phone numbers in same row
1NF (First Normal Form): Remove repeating groups
✓ Atomic values only
✓ No repeating columns
✓ Unique row identifier (Primary Key)
EMPLOYEES Table:
├── employee_id (PK)
├── first_name
└── last_name
PHONE_NUMBERS Table:
├── phone_id (PK)
├── employee_id (FK)
└── phone_number
2NF (Second Normal Form): Remove partial dependencies
✓ 1NF requirements
✓ Non-key attributes depend on entire primary key
✓ Separate tables for composite key dependencies
3NF (Third Normal Form): Remove transitive dependencies
✓ 2NF requirements
✓ Non-key attributes depend only on primary key
✓ No dependencies between non-key attributes
BCNF (Boyce-Codd): Every determinant is a candidate key
(Most data should be 3NF; BCNF for complex scenarios)
Analytical Database Design:
FACTS DIMENSIONS
sales_fact_table date_dimension
├── fact_id (PK) ←→ ├── date_id
├── date_id (FK) ├── date
├── product_id (FK) ├── year
├── customer_id (FK) ├── quarter
├── amount ├── month
└── quantity └── day_name
product_dimension
├── product_id (PK)
├── product_name
├── category
└── price
customer_dimension
├── customer_id (PK)
├── customer_name
├── segment
└── region
Benefits:
✓ Optimized for analytical queries
✓ Easy to understand and maintain
✓ Fast aggregations
✓ Handles slowly changing dimensions
-- Date Dimension (common in all data warehouses)
CREATE TABLE date_dimension (
date_id INT PRIMARY KEY,
date DATE UNIQUE,
year INT,
quarter INT,
month INT,
day INT,
day_name VARCHAR(20),
month_name VARCHAR(20),
is_weekend BOOLEAN,
fiscal_year INT
);
-- Product Dimension
CREATE TABLE product_dimension (
product_id INT PRIMARY KEY,
product_code VARCHAR(50) UNIQUE,
product_name VARCHAR(255),
category VARCHAR(100),
subcategory VARCHAR(100),
price DECIMAL(10,2),
cost DECIMAL(10,2),
supplier_id INT,
is_active BOOLEAN,
created_date DATE,
updated_date DATE
);
-- Sales Fact Table
CREATE TABLE sales_fact (
fact_id BIGINT PRIMARY KEY AUTO_INCREMENT,
date_id INT,
product_id INT,
customer_id INT,
store_id INT,
quantity INT,
unit_price DECIMAL(10,2),
amount DECIMAL(12,2),
discount_amount DECIMAL(10,2),
net_amount DECIMAL(12,2),
created_timestamp TIMESTAMP,
FOREIGN KEY (date_id) REFERENCES date_dimension(date_id),
FOREIGN KEY (product_id) REFERENCES product_dimension(product_id),
FOREIGN KEY (customer_id) REFERENCES customer_dimension(customer_id),
FOREIGN KEY (store_id) REFERENCES store_dimension(store_id)
);
-- Create indexes on fact table for analytical queries
CREATE INDEX idx_fact_date ON sales_fact(date_id);
CREATE INDEX idx_fact_product ON sales_fact(product_id);
CREATE INDEX idx_fact_customer ON sales_fact(customer_id);
CREATE INDEX idx_fact_store ON sales_fact(store_id);
-- Type 1: Overwrite (lose history)
UPDATE product_dimension
SET category = 'Electronics'
WHERE product_id = 100;
-- Type 2: Add new row (keep history)
INSERT INTO product_dimension
SELECT *, '2024-01-01' as effective_date, '9999-12-31' as end_date
FROM product_dimension
WHERE product_id = 100;
UPDATE product_dimension
SET end_date = '2024-01-01'
WHERE product_id = 100 AND end_date = '9999-12-31';
-- Type 3: Add column for previous value
ALTER TABLE product_dimension ADD COLUMN previous_category VARCHAR(100);
UPDATE product_dimension SET previous_category = category WHERE product_id = 100;
UPDATE product_dimension SET category = 'Electronics' WHERE product_id = 100;
-- Sales by product and month
SELECT
d.year,
d.month_name,
p.product_name,
SUM(sf.quantity) as total_quantity,
SUM(sf.amount) as total_amount
FROM sales_fact sf
JOIN date_dimension d ON sf.date_id = d.date_id
JOIN product_dimension p ON sf.product_id = p.product_id
GROUP BY d.year, d.month_name, p.product_name
ORDER BY d.year, d.month, p.product_name;
-- Sales trend over time
SELECT
d.date,
SUM(sf.amount) as daily_sales,
AVG(SUM(sf.amount)) OVER (ORDER BY d.date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as 7day_avg
FROM sales_fact sf
JOIN date_dimension d ON sf.date_id = d.date_id
GROUP BY d.date
ORDER BY d.date;
-- Top products by revenue
SELECT
p.product_name,
p.category,
SUM(sf.amount) as total_revenue,
SUM(sf.quantity) as total_units,
COUNT(DISTINCT sf.customer_id) as unique_customers
FROM sales_fact sf
JOIN product_dimension p ON sf.product_id = p.product_id
GROUP BY p.product_id, p.product_name, p.category
ORDER BY total_revenue DESC
LIMIT 10;
Scenario: Build analytics database from transactional e-commerce system.
Objectives:
Deliverables:
Skills Applied: Schema design, normalization, star schemas, ETL, query optimization
Scenario: Create data warehouse for HR analytics across company.
Objectives:
Deliverables:
Skills Applied: Multi-source integration, dimensional modeling, slowly changing dimensions, data governance
Scenario: Optimize poorly performing analytics database.
Objectives:
Deliverables:
Skills Applied: Query optimization, indexing, execution plan analysis, performance monitoring
Months 1-3: Basic Competency
├── Write simple SELECT queries
├── Understand JOIN basics
├── Work with single tables effectively
└── Understand basic database concepts
Months 4-6: Intermediate Competency
├── Master complex JOINs and subqueries
├── Optimize simple queries
├── Design basic schemas
├── Work with multiple tables confidently
└── Understand indexes and performance
Months 7-10: Advanced Competency
├── Design normalized schemas
├── Optimize complex queries
├── Understand star schemas
├── Build data warehouses
└── Work with multiple database systems
Months 11-16: Expert Competency
├── Lead database design initiatives
├── Architect enterprise data systems
├── Mentor others on best practices
├── Implement advanced optimization
└── Ready for engineering roles
Entry Level (0-2 years): $60,000 - $85,000
Mid Level (2-5 years): $85,000 - $115,000
Advanced (5+ years): $115,000 - $150,000
Senior/Lead (8+ years): $150,000 - $200,000+
-- Use consistent formatting
-- Capitalize keywords
-- Indent sub-clauses
-- Include comments for complex logic
SELECT
e.employee_id,
e.first_name,
d.department_name,
-- Calculate tenure in years
YEAR(CURDATE()) - YEAR(e.hire_date) as tenure_years
FROM employees e
LEFT JOIN departments d ON e.department_id = d.department_id
WHERE e.salary > 75000
ORDER BY e.last_name;
Security Principles:
├── Principle of Least Privilege
│ └── Users get minimum permissions needed
├── Role-Based Access Control
│ └── Assign permissions by role, not individual
├── Encryption
│ └── Encrypt sensitive data in transit and at rest
├── Audit Logging
│ └── Track who accessed what and when
├── Regular Backups
│ └── Test restore procedures regularly
└── Access Review
└── Quarterly review of permissions
Regular Maintenance Tasks:
Daily:
- Monitor query performance
- Check error logs
- Verify backups completed
Weekly:
- Update table statistics
- Review slow query logs
- Check disk space
Monthly:
- Analyze index usage
- Rebuild fragmented indexes
- Review security access
- Performance baseline comparison
Quarterly:
- Full backup verification
- Disaster recovery testing
- Capacity planning
- Schema optimization review
Directory Structure:
├── /schemas/
│ ├── 01_initial_schema.sql
│ └── 02_add_new_tables.sql
├── /migrations/
│ ├── 001_create_employees.sql
│ └── 002_add_indexes.sql
├── /procedures/
│ └── sp_daily_reporting.sql
├── /views/
│ └── vw_sales_summary.sql
└── README.md
Use version control:
- Git for SQL scripts
- Migration frameworks (Flyway, Liquibase)
- Deployment automation
- Rollback procedures
Document:
1. Purpose of each table
2. Business rules and logic
3. Expected data ranges
4. Update frequency
5. Data refresh schedule
6. Access requirements
7. Known limitations
8. Contact person for questions
Template:
TABLE: sales_fact
Purpose: Records individual sales transactions
Source: Point of sale system, synced nightly
Refresh: Daily at 2 AM
Key Fields:
- date_id: Reference to date dimension
- product_id: Reference to product dimension
- amount: Gross sales amount
Refresh: Nightly at 2 AM
Owner: Analytics Team
Before deploying to production:
1. Syntax check
- Query runs without errors
- Schema changes valid
2. Logic validation
- Results match expected output
- Edge cases handled
3. Performance testing
- Execution plan reviewed
- Acceptable performance
4. Data quality check
- Row counts match expectations
- No unexpected NULL values
- Referential integrity maintained
5. Backup & rollback
- Backup created before change
- Rollback procedure documented
- Rollback tested (if possible)
Set up learning environment
Establish daily practice
Understand your company's databases
Master all JOIN types
Optimize queries
Design first schema
Prepare for advanced roles
Lead database initiatives
Current Role: SQL & Databases Expert ✓ (You are here)
↓
Option A: Deepen SQL expertise
↓
Option B: Move to Phase 3 - Statistics Specialist
↓
Option C: Move to Phase 5 - Programming Expert
↓
Multiple Advanced Roles (4, 6)
↓
Career Leadership Roles (7 - Career Coach)
As a SQL & Databases Expert, you'll understand that:
SQL expertise opens doors to advanced analytics roles and makes you valuable in any data organization.
Q: How long should I spend learning SQL before moving to advanced roles? A: 12-16 weeks minimum. Most analysts benefit from 6 months of focused learning.
Q: Which database should I learn first? A: Start with MySQL or PostgreSQL (open source, widely used). SQL concepts transfer between systems.
Q: Should I memorize all SQL functions? A: No. Learn common functions well, know reference resources for others.
Q: How do I know a query needs optimization? A: If execution takes >2 seconds, review the execution plan. Analyze if index could help.
Q: What's the difference between HAVING and WHERE? A: WHERE filters before grouping. HAVING filters after grouping (use HAVING with GROUP BY).
Q: Can I optimize a query that takes 5 minutes? A: Usually yes. Start with execution plan analysis. Check for missing indexes, unnecessary columns, or redesign opportunities.
Last Updated: November 2024 Difficulty Level: Intermediate Estimated Time to Completion: 12-16 weeks
Designs feature architectures by analyzing existing codebase patterns and conventions, then providing comprehensive implementation blueprints with specific files to create/modify, component designs, data flows, and build sequences