7-phase High-Level Design framework: clarify requirements, estimate scale, design components, model data, define APIs, deep dive on bottlenecks, and enumerate failure modes. Use for any system design question or production architecture task.
From sde-system-designnpx claudepluginhub chavangorakh1999/sde-skills --plugin sde-system-designThis skill uses the workspace's default tool permissions.
Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Structured methodology for designing production systems. Works for interview prep and real architecture decisions. Each phase builds on the last — skip none.
System to design: $ARGUMENTS
If no system is provided, ask: "What system are you designing? What's the scale and key constraints?"
Never design in a vacuum. Extract:
Functional requirements — what the system must do:
Non-functional requirements — how the system must perform:
Constraints and assumptions:
Out of scope — explicitly state what you're NOT designing.
Back-of-envelope numbers that drive architecture decisions. Quantify before designing.
Traffic:
DAU = X million
Reads per user per day = Y
Writes per user per day = Z
Read QPS = DAU × Y / 86,400
Write QPS = DAU × Z / 86,400
Peak QPS = avg × 3-5x (account for traffic spikes)
Storage:
Object size = X KB (e.g., tweet = ~300 bytes, image = ~200 KB)
Daily writes = Write QPS × 86,400
3-year storage = Daily writes × 365 × 3
Bandwidth:
Inbound = Write QPS × avg object size
Outbound = Read QPS × avg response size
Cache:
Hot data = 20% of reads hit 80% of data (Pareto)
Cache size = hot data × avg object size
State your assumptions explicitly. Numbers don't need to be perfect — they need to inform decisions.
Draw the system as boxes and arrows. For each component, state its role and why it exists.
Standard components to consider:
[Clients] -> [CDN] -> [Load Balancer] -> [API Gateway]
|
+---------+---------+---------+
| | |
[Service A] [Service B] [Service C]
| | |
[Cache] [Message Queue] [Search Index]
| | |
[Primary DB] [Worker] [Object Store]
|
[Read Replica]
For each service/component, explain:
Common patterns:
Define the core entities and relationships before writing APIs.
For each entity:
Access pattern analysis:
Query: "Get user by email" -> Index on email (unique)
Query: "Get posts by user, sorted by time" -> Composite index (user_id, created_at DESC)
Query: "Get all comments for a post" -> Index on post_id
State the dominant query patterns first, then design the schema to serve them.
Define the external contract. Use REST unless there's a specific reason for GraphQL or gRPC.
For each endpoint:
POST /api/v1/users
Authorization: Bearer <token>
Request: { email, password, displayName }
Response: { id, email, displayName, createdAt }
Errors: 400 (validation), 409 (email taken), 500 (server error)
Include:
/v1/ is simplest)Pick 2-3 hardest sub-problems and solve them in detail. Common picks:
Feed generation:
Search:
Notifications (real-time):
Media upload:
Distributed rate limiting:
Enumerate what can fail and how the system handles it.
| Component | Failure | Detection | Recovery |
|---|---|---|---|
| Primary DB | Crash | Health check + replica lag monitor | Promote replica, update DNS (< 30s) |
| Cache (Redis) | Eviction / miss | Cache hit rate metric | Serve from DB, warm cache async |
| Message queue | Consumer lag | Queue depth metric > threshold | Scale consumers, alert |
| External API | Timeout / 5xx | Circuit breaker (half-open after 30s) | Fallback response or queue for retry |
| API server | Memory leak | RSS growth + OOM kill | Horizontal scaling + auto-restart |
For each failure: detection -> recovery -> prevention
Also address:
## System Design: [System Name]
### Requirements
**Functional:** ...
**Non-functional:** ...
**Out of scope:** ...
### Capacity Estimates
| Metric | Calculation | Result |
|--------|-------------|--------|
### Architecture Diagram
[ASCII or Mermaid diagram]
### Component Breakdown
| Component | Role | Technology | Rationale |
### Data Model
[Schema with fields, indexes, relationships]
### API Contract
[Key endpoints with request/response]
### Deep Dives
[2-3 detailed sub-problems solved]
### Failure Modes
[Table: component, failure, detection, recovery]
### Tradeoffs
[3-5 explicit tradeoffs made in this design]