From harness-claude
Models vertices and edges in SQL tables with recursive CTEs for social graphs, dependencies, recommendations, fraud detection. Guides limits and graph DB transitions.
npx claudepluginhub intense-visions/harness-engineering --plugin harness-claudeThis skill uses the workspace's default tool permissions.
> Modeling vertices and edges in SQL tables for social graphs, dependency networks, and recommendation systems with recursive queries -- and knowing when SQL stops being practical.
Designs graph schemas, models relationships, optimizes traversals and queries for SurrealDB and general graph databases. Use for knowledge graphs, social networks, recommendations, fraud detection.
Designs, reviews, and refactors graph database schemas for Neo4j, Memgraph, Neptune using 46 prioritized rules across 8 categories with Cypher examples focused on modeling correctness.
Designs, reviews, and refactors Neo4j graph data models, detecting anti-patterns like generic labels and supernodes, migrating relational schemas to graph, and enforcing constraints and indexes.
Share bugs, ideas, or general feedback.
Modeling vertices and edges in SQL tables for social graphs, dependency networks, and recommendation systems with recursive queries -- and knowing when SQL stops being practical.
Basic graph schema -- vertices (nodes) and edges (relationships):
CREATE TABLE vertices (
id serial PRIMARY KEY,
label varchar NOT NULL,
properties jsonb DEFAULT '{}'
);
CREATE TABLE edges (
id serial PRIMARY KEY,
source_id int NOT NULL REFERENCES vertices(id) ON DELETE CASCADE,
target_id int NOT NULL REFERENCES vertices(id) ON DELETE CASCADE,
label varchar NOT NULL,
weight float DEFAULT 1.0,
properties jsonb DEFAULT '{}'
);
CREATE INDEX idx_edges_source ON edges (source_id);
CREATE INDEX idx_edges_target ON edges (target_id);
Directed vs undirected: For undirected graphs, either store both directions (A->B and B->A) or query with WHERE source_id = ? OR target_id = ?. Storing both directions doubles storage but simplifies and speeds up queries.
Path queries via recursive CTE:
-- All nodes reachable from vertex 1 within 3 hops
WITH RECURSIVE reachable AS (
SELECT target_id AS id, 1 AS hops, ARRAY[source_id, target_id] AS path
FROM edges WHERE source_id = 1
UNION ALL
SELECT e.target_id, r.hops + 1, r.path || e.target_id
FROM edges e
JOIN reachable r ON e.source_id = r.id
WHERE r.hops < 3
AND e.target_id != ALL(r.path) -- cycle detection
)
SELECT DISTINCT id, hops FROM reachable;
The path array tracks visited nodes for cycle detection. The hops < 3 clause bounds traversal depth.
When SQL stops being practical:
At these thresholds, consider a dedicated graph database (Neo4j, Amazon Neptune) or a graph extension (Apache AGE for PostgreSQL).
Social network "people you may know" (friends-of-friends):
CREATE TABLE users (
id serial PRIMARY KEY,
name varchar NOT NULL
);
CREATE TABLE friendships (
user_id int NOT NULL REFERENCES users(id),
friend_id int NOT NULL REFERENCES users(id),
PRIMARY KEY (user_id, friend_id)
);
CREATE INDEX idx_friendships_friend ON friendships (friend_id);
Direct friends of user 1:
SELECT u.* FROM users u
JOIN friendships f ON u.id = f.friend_id
WHERE f.user_id = 1;
Friends-of-friends (2-hop) excluding direct friends:
WITH direct_friends AS (
SELECT friend_id FROM friendships WHERE user_id = 1
),
friends_of_friends AS (
SELECT DISTINCT f2.friend_id
FROM friendships f1
JOIN friendships f2 ON f1.friend_id = f2.user_id
WHERE f1.user_id = 1
AND f2.friend_id != 1
AND f2.friend_id NOT IN (SELECT friend_id FROM direct_friends)
)
SELECT u.*, count(*) AS mutual_friends
FROM friends_of_friends fof
JOIN friendships f ON fof.friend_id = f.friend_id
JOIN direct_friends df ON f.user_id = df.friend_id
JOIN users u ON u.id = fof.friend_id
GROUP BY u.id, u.name
ORDER BY mutual_friends DESC;
The performance cliff: On a graph with 1M edges, 2-hop traversal completes in ~50ms. 3-hop: ~2 seconds. 4-hop: timeout. Each additional hop multiplies the working set by the average node degree.
hops < N bound.source_id and target_id on the edges table. Reverse traversals (who points to X?) are common and require the target index.Apache AGE extension adds the Cypher query language to PostgreSQL:
-- After installing AGE extension
SELECT * FROM cypher('social_graph', $$
MATCH (a:User {name: 'Alice'})-[:FRIENDS_WITH*2..3]-(b:User)
WHERE a <> b
RETURN DISTINCT b.name
$$) AS (name agtype);
AGE allows graph queries without leaving PostgreSQL -- useful when the graph is a secondary concern alongside relational data.
Recursive CTE with CYCLE clause (PostgreSQL 14+):
WITH RECURSIVE paths AS (
SELECT source_id, target_id, 1 AS depth FROM edges WHERE source_id = 1
UNION ALL
SELECT p.source_id, e.target_id, p.depth + 1
FROM paths p JOIN edges e ON p.target_id = e.source_id
WHERE p.depth < 5
)
CYCLE target_id SET is_cycle USING path
SELECT * FROM paths WHERE NOT is_cycle;
JSONB properties on vertices and edges enable flexible schema evolution. Add a GIN index for filtered graph queries: CREATE INDEX ON vertices USING gin (properties);.
Weighted shortest path (Dijkstra in SQL): Technically possible with recursive CTEs tracking cumulative weight, but impractical for large graphs. Each step explores all outgoing edges, and pruning (stopping when a shorter path is already found) requires subquery tricks that the optimizer handles poorly. For weighted shortest path, export to an application-layer library.
Bidirectional search: Start from both source and target, meeting in the middle. Reduces traversal from O(b^d) to O(2 * b^(d/2)) where b is branching factor and d is depth. Implementable in SQL with two recursive CTEs joined on intersection, but complex to write correctly.
Materialized graph views: Pre-compute common traversals (e.g., 2-hop friend connections) in a materialized table. Refresh periodically. Trades storage and freshness for query speed.
Hybrid architecture: Use PostgreSQL for transactional graph storage and integrity. Export to Neo4j or NetworkX for analytics (PageRank, community detection, path algorithms). This avoids a full migration to a graph database while enabling graph-native operations where SQL falls short.
MySQL 8.0+ supports recursive CTEs for graph traversal with compatible syntax.
Key MySQL differences:
CYCLE clause -- cycle detection requires manual path tracking with a VARCHAR or JSON columnJSON column type with generated columns for indexing vertex/edge propertiescte_max_recursion_depth (default 1000) limits traversal depthFraud detection platform modeling transaction networks. 5M accounts as vertices, 50M transactions as edges. The core query: "who transacted with someone who transacted with suspect X?" (2-hop). PostgreSQL with indexed edges: 120ms for 2-hop queries. 3-hop queries: 8 seconds -- unacceptable for real-time alerting. The team moved 3+ hop analysis to Neo4j while keeping transactional data and 1-2 hop queries in PostgreSQL. The hybrid approach saved 6 months of engineering time compared to a full migration to a graph database. PostgreSQL handles ACID transactions on account balances; Neo4j handles network pattern detection.