Identify foreign key relationships between tables using heuristics, value overlap analysis, and referential integrity checks
Identifies foreign key relationships between tables using naming patterns, value overlap analysis, and referential integrity checks. Use this when you need to understand table relationships before joining data or validating referential integrity.
/plugin marketplace add tilmon-engineering/claude-skills/plugin install datapeeker@tilmon-eng-skillsThis skill inherits all available tools. When active, it can use any tool Claude has access to.
This component skill guides systematic foreign key relationship detection in relational databases. Use it when:
Create a TodoWrite checklist for the 5-phase FK detection process:
Phase 1: Candidate Identification - pending
Phase 2: Value Overlap Analysis - pending
Phase 3: Cardinality Assessment - pending
Phase 4: Referential Integrity Validation - pending
Phase 5: Relationship Documentation - pending
Mark each phase as you complete it. Document all findings in structured format.
Goal: Identify columns that are likely foreign keys based on naming patterns, data types, and uniqueness.
Common FK naming patterns:
_id (e.g., customer_id, product_id)Id (e.g., customerId, productId)id (but only in child tables)fk_ (e.g., fk_customer)customer in orders table)-- List all columns across all tables
SELECT
m.name as table_name,
p.name as column_name,
p.type as column_type
FROM sqlite_master m
JOIN pragma_table_info(m.name) p
WHERE m.type = 'table'
AND m.name NOT LIKE 'sqlite_%'
AND (
p.name LIKE '%_id'
OR p.name LIKE '%Id'
OR p.name LIKE 'fk_%'
OR p.name = 'id'
)
ORDER BY m.name, p.name;
Document:
Primary key characteristics:
id, [table]_id, or similar-- Find columns likely to be primary keys
SELECT
m.name as table_name,
p.name as column_name,
p.type as column_type,
p.pk as is_primary_key
FROM sqlite_master m
JOIN pragma_table_info(m.name) p
WHERE m.type = 'table'
AND m.name NOT LIKE 'sqlite_%'
AND (
p.pk = 1 -- Explicitly defined PK
OR p.name = 'id'
OR p.name = m.name || '_id'
)
ORDER BY m.name;
Document:
Heuristic: A column named customer_id likely references a table named customers or customer.
-- Cross-reference FK column names with table names
-- (Pseudo-query - implement with string matching logic)
-- For each FK candidate like 'customer_id':
-- 1. Strip suffix (_id, Id)
-- 2. Look for table named 'customers', 'customer', or similar
-- 3. Record as potential relationship
Document:
orders.customer_id → customers.id)customer_id → customer table)customer_id → customers table)Goal: Validate FK candidates by checking if their values actually exist in the proposed parent table.
For each candidate FK relationship identified in Phase 1:
-- Calculate what percentage of FK values exist in parent table
WITH fk_values AS (
SELECT DISTINCT child_fk_column as value
FROM child_table
WHERE child_fk_column IS NOT NULL
),
pk_values AS (
SELECT DISTINCT parent_pk_column as value
FROM parent_table
WHERE parent_pk_column IS NOT NULL
),
overlap AS (
SELECT COUNT(*) as matching_count
FROM fk_values fk
WHERE fk.value IN (SELECT value FROM pk_values)
)
SELECT
(SELECT COUNT(*) FROM fk_values) as total_fk_values,
(SELECT COUNT(*) FROM pk_values) as total_pk_values,
overlap.matching_count,
ROUND(100.0 * overlap.matching_count / (SELECT COUNT(*) FROM fk_values), 2) as match_percentage
FROM overlap;
Interpret match percentage:
Document:
For relationships with <100% match:
-- Find child records with FK values that don't exist in parent
SELECT
child_table.rowid,
child_table.child_fk_column as orphaned_value,
COUNT(*) OVER (PARTITION BY child_table.child_fk_column) as occurrences
FROM child_table
LEFT JOIN parent_table ON child_table.child_fk_column = parent_table.parent_pk_column
WHERE parent_table.parent_pk_column IS NULL
AND child_table.child_fk_column IS NOT NULL
LIMIT 20;
Document:
-- Find parent records not referenced by any child
SELECT
parent_table.parent_pk_column as unused_pk_value,
COUNT(*) as occurrence_count
FROM parent_table
LEFT JOIN child_table ON parent_table.parent_pk_column = child_table.child_fk_column
WHERE child_table.child_fk_column IS NULL
AND parent_table.parent_pk_column IS NOT NULL
LIMIT 20;
Document:
Goal: Determine the relationship type (one-to-one, one-to-many, many-to-many).
How many child records per parent record?
-- Average number of child records per parent
SELECT
COUNT(*) as total_child_records,
COUNT(DISTINCT child_fk_column) as distinct_fk_values,
ROUND(1.0 * COUNT(*) / NULLIF(COUNT(DISTINCT child_fk_column), 0), 2) as avg_children_per_parent,
MIN(child_count) as min_children,
MAX(child_count) as max_children
FROM child_table
CROSS JOIN (
SELECT
child_fk_column as fk,
COUNT(*) as child_count
FROM child_table
WHERE child_fk_column IS NOT NULL
GROUP BY child_fk_column
);
Interpret cardinality:
Document:
Junction table characteristics:
-- Find tables with multiple FK candidates (potential junction tables)
SELECT
table_name,
COUNT(*) as fk_column_count,
GROUP_CONCAT(column_name, ', ') as fk_columns
FROM (
SELECT
m.name as table_name,
p.name as column_name
FROM sqlite_master m
JOIN pragma_table_info(m.name) p
WHERE m.type = 'table'
AND m.name NOT LIKE 'sqlite_%'
AND (p.name LIKE '%_id' OR p.name LIKE 'fk_%')
)
GROUP BY table_name
HAVING COUNT(*) >= 2
ORDER BY fk_column_count DESC;
Document:
Hierarchical data pattern:
employee.manager_id → employee.id)-- Find columns that might reference the same table
SELECT
table_name,
column_name,
type
FROM (
SELECT
m.name as table_name,
p.name as column_name,
p.type as type
FROM sqlite_master m
JOIN pragma_table_info(m.name) p
WHERE m.type = 'table'
AND m.name NOT LIKE 'sqlite_%'
AND (
p.name LIKE 'parent_%'
OR p.name LIKE 'manager_%'
OR p.name LIKE '%_parent_id'
)
);
Document:
Goal: Quantify integrity violations and assess data quality impact.
For each confirmed FK relationship:
-- Comprehensive referential integrity check
WITH integrity_check AS (
SELECT
COUNT(*) as total_child_records,
COUNT(child_fk_column) as non_null_fk_count,
COUNT(*) - COUNT(child_fk_column) as null_fk_count,
SUM(CASE WHEN p.parent_pk_column IS NULL AND c.child_fk_column IS NOT NULL THEN 1 ELSE 0 END) as orphaned_count
FROM child_table c
LEFT JOIN parent_table p ON c.child_fk_column = p.parent_pk_column
)
SELECT
total_child_records,
non_null_fk_count,
null_fk_count,
ROUND(100.0 * null_fk_count / total_child_records, 2) as null_fk_pct,
orphaned_count,
ROUND(100.0 * orphaned_count / non_null_fk_count, 2) as orphaned_pct,
non_null_fk_count - orphaned_count as valid_fk_count,
ROUND(100.0 * (non_null_fk_count - orphaned_count) / non_null_fk_count, 2) as integrity_pct
FROM integrity_check;
Document:
Business impact depends on:
-- Analyze orphaned records by recency
SELECT
CASE
WHEN date_column >= date('now', '-7 days') THEN 'Last 7 days'
WHEN date_column >= date('now', '-30 days') THEN 'Last 30 days'
WHEN date_column >= date('now', '-90 days') THEN 'Last 90 days'
ELSE 'Older than 90 days'
END as recency,
COUNT(*) as orphaned_count
FROM child_table c
LEFT JOIN parent_table p ON c.child_fk_column = p.parent_pk_column
WHERE p.parent_pk_column IS NULL
AND c.child_fk_column IS NOT NULL
AND c.date_column IS NOT NULL
GROUP BY recency
ORDER BY MIN(c.date_column);
Document:
If multiple columns together form a FK:
-- Check integrity for composite FK
WITH composite_fk_values AS (
SELECT DISTINCT
child_table.fk_column1,
child_table.fk_column2
FROM child_table
WHERE child_table.fk_column1 IS NOT NULL
AND child_table.fk_column2 IS NOT NULL
),
composite_pk_values AS (
SELECT DISTINCT
parent_table.pk_column1,
parent_table.pk_column2
FROM parent_table
)
SELECT
COUNT(*) as total_composite_fk_values,
SUM(CASE WHEN pk.pk_column1 IS NULL THEN 1 ELSE 0 END) as orphaned_count
FROM composite_fk_values fk
LEFT JOIN composite_pk_values pk
ON fk.fk_column1 = pk.pk_column1
AND fk.fk_column2 = pk.pk_column2;
Document:
Goal: Create structured documentation of all discovered relationships for use in cleaning and analysis.
Document each confirmed relationship:
## Foreign Key Relationships
### High Confidence Relationships (>95% integrity)
#### orders.customer_id → customers.id
- **Relationship Type:** Many-to-one
- **Child Table:** orders (1,523 rows)
- **Parent Table:** customers (342 rows)
- **Match Percentage:** 98.2%
- **Cardinality:** Avg 4.5 orders per customer (min: 1, max: 47)
- **NULL FKs:** 12 rows (0.8%)
- **Orphaned FKs:** 15 rows (1.0%)
- **Recommended Join:** LEFT JOIN (to preserve orphaned orders)
- **Cleaning Action:** Investigate 15 orphaned orders, flag for review
### Medium Confidence Relationships (80-95% integrity)
#### products.category_id → categories.id
- **Relationship Type:** Many-to-one
- **Child Table:** products (856 rows)
- **Parent Table:** categories (24 rows)
- **Match Percentage:** 87.3%
- **Cardinality:** Avg 35.7 products per category (min: 2, max: 142)
- **NULL FKs:** 89 rows (10.4%)
- **Orphaned FKs:** 20 rows (2.4%)
- **Recommended Join:** INNER JOIN (if categorized products only needed)
- **Cleaning Action:** Exclude or recategorize 20 orphaned products
### Low Confidence / Unconfirmed (<80% integrity)
#### transactions.merchant_id → merchants.id
- **Relationship Type:** Uncertain
- **Match Percentage:** 67.8%
- **Issue:** Large number of orphaned merchant_id values
- **Recommendation:** Verify with data owner - may be wrong parent table
For each relationship:
## Join Recommendations
### orders ⟶ customers
**Recommended SQL:**
```sql
-- Use LEFT JOIN to preserve all orders (including orphans)
SELECT
o.*,
c.customer_name,
c.customer_segment
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.id;
-- Alternative: INNER JOIN if orphans should be excluded
SELECT
o.*,
c.customer_name
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id;
-- Note: Excludes 15 orders (1.0%) with invalid customer_id
Join Impact:
### Document Data Quality Implications
```markdown
## Data Quality Implications
### Orphaned Records Summary
Total orphaned records across all relationships: 35 (2.1% of all child records)
| Child Table | FK Column | Orphan Count | % of Child Table | Impact |
|-------------|-----------|--------------|------------------|--------|
| orders | customer_id | 15 | 1.0% | Low - recent orders, may resolve |
| products | category_id | 20 | 2.4% | Medium - affects category analysis |
### Recommended Cleaning Actions
**High Priority:**
1. products.category_id orphans (20 rows) - CREATE placeholder category "Uncategorized" or exclude from analysis
2. orders.customer_id orphans (15 rows) - FLAG for customer service review
**Medium Priority:**
3. NULL customer_id in orders (12 rows) - Investigate if legitimate (guest checkout?) or data entry error
### Analysis Limitations
Due to referential integrity issues:
- Customer-level aggregations will exclude 1.0% of orders (if using INNER JOIN)
- Category-level product analysis may be incomplete (2.4% of products uncategorized)
- Time-series trends should use LEFT JOIN to preserve all records
importing-data (Phase 5: Quality Assessment)After importing tables, run FK detection to include in quality report:
## Foreign Key Relationships (from detect-foreign-keys skill)
High Confidence:
- orders.customer_id → customers.id (98% integrity, 15 orphans)
- ...
Medium Confidence:
- products.category_id → categories.id (87% integrity, 20 orphans)
cleaning-data (Phase 1: Scope Definition)Use FK findings to inform cleaning scope:
## Referential Integrity Issues
From detect-foreign-keys analysis:
- **orders.customer_id:** 15 orphaned records (1.0%) - Priority: HIGH
- Recommended action: Flag for review, preserve with LEFT JOIN
understanding-data (Phase 4: Relationship Identification)This skill provides the systematic process for Phase 4:
## Phase 4: Relationship Identification
Use the `detect-foreign-keys` component skill to systematically identify and validate all foreign key relationships.
DON'T:
DO:
Re-run this skill when:
After completing this skill, you should have:
This documentation feeds into importing-data quality reports and cleaning-data scope definitions, ensuring relationship-aware data quality management.
Use when working with Payload CMS projects (payload.config.ts, collections, fields, hooks, access control, Payload API). Use when debugging validation errors, security issues, relationship queries, transactions, or hook behavior.