System Design

Overview

Design a robust, scalable system from scratch or plan major architectural changes. This command guides you through senior-level system design considerations including requirements, architecture, scalability, reliability, and tradeoffs.

Steps

1. Gather Requirements

Functional Requirements

What features does the system need?
What are the core use cases?
What operations will users perform?
What data needs to be stored and retrieved?
What are the inputs and outputs?

Non-Functional Requirements

Scale: How many users? Requests per second? Data volume?
Performance: Latency requirements? Throughput targets?
Availability: Uptime requirements (99.9%, 99.99%, etc.)?
Consistency: Strong consistency or eventual consistency?
Durability: Data loss tolerance?
Security: Authentication, authorization, encryption needs?
Compliance: GDPR, HIPAA, SOC2, etc.?

Constraints

Budget limitations
Timeline constraints
Team expertise
Technology restrictions
Regulatory requirements

2. Capacity Estimation

Calculate:

Traffic Estimates: QPS (queries per second), peak vs. average
Storage Estimates: Data size per record × number of records × growth rate
Bandwidth Estimates: Request size × QPS
Memory Estimates: Cache size, in-memory data structures
Compute Requirements: CPU/RAM per instance × number of instances

3. Define APIs

Design clean, RESTful APIs:

Endpoints: What operations are exposed?
Request/Response: What data is sent and received?
Authentication: How are requests authenticated?
Rate Limiting: What are the limits?
Versioning: How will APIs evolve?
Error Handling: What error codes and messages?

Example:

POST /api/v1/users
GET /api/v1/users/{id}
PUT /api/v1/users/{id}
DELETE /api/v1/users/{id}
GET /api/v1/users?page=1&limit=20

4. Data Model Design

Database Selection

Relational (SQL): PostgreSQL, MySQL
- Use for: Structured data, ACID transactions, complex queries
NoSQL Document: MongoDB, Couchbase
- Use for: Flexible schema, hierarchical data
NoSQL Key-Value: Redis, DynamoDB
- Use for: Simple lookups, caching, sessions
NoSQL Wide-Column: Cassandra, HBase
- Use for: High write throughput, time-series data
Graph: Neo4j, Amazon Neptune
- Use for: Relationship-heavy data (social graphs, recommendations)

Schema Design

Define entities and relationships
Identify primary keys and foreign keys
Plan for indexes
Consider partitioning/sharding strategy
Design for query patterns

5. High-Level Architecture

Architecture Patterns

Monolithic: Single deployable unit
- Pros: Simple, easy to develop/deploy initially
- Cons: Hard to scale, couples everything together
Microservices: Independent services
- Pros: Scalable, independent deployment, technology diversity
- Cons: Complex, distributed system challenges
Serverless: Event-driven functions
- Pros: Auto-scaling, pay-per-use, no server management
- Cons: Cold starts, vendor lock-in, complex debugging
Event-Driven: Async message passing
- Pros: Decoupled, resilient, scalable
- Cons: Complexity, eventual consistency

Core Components

Load Balancer: Distribute traffic (nginx, HAProxy, ALB)
API Gateway: Single entry point, rate limiting, auth
Application Servers: Business logic
Cache Layer: Redis, Memcached
Message Queue: RabbitMQ, Kafka, SQS
Database: Primary and replica(s)
Object Storage: S3, GCS for files/media
CDN: CloudFront, Cloudflare for static assets
Search: Elasticsearch, Algolia
Monitoring: Prometheus, Grafana, DataDog

6. Detailed Component Design

For each major component:

Responsibility: What does it do?
APIs: How do other components interact with it?
Data: What data does it store/process?
Scale: How does it scale?
Failure Modes: What happens when it fails?

7. Scalability Strategy

Horizontal Scaling

Stateless services (can add more instances)
Load balancing across instances
Database sharding
Read replicas for databases

Vertical Scaling

Increase CPU/RAM of instances
More limited than horizontal scaling

Caching Strategy

Cache Aside: App reads from cache, falls back to DB
Write-Through: Writes go to cache and DB simultaneously
Write-Behind: Writes go to cache, async to DB
What to cache: Hot data, expensive queries
Cache invalidation: TTL, event-based

Database Scaling

Replication: Master-slave for read scaling
Sharding: Partition data across databases
Denormalization: Trade storage for read performance
CQRS: Separate read and write models

8. Reliability & Resilience

Availability

Redundancy: No single point of failure
Replication: Multiple copies of data
Health Checks: Detect and remove unhealthy instances
Auto-scaling: Handle traffic spikes

Failure Handling

Retry Logic: Exponential backoff for transient failures
Circuit Breaker: Stop calling failing services
Fallback: Degrade gracefully
Timeouts: Don't wait forever
Idempotency: Safe to retry operations

Disaster Recovery

Backups: Regular automated backups
Multi-region: Deploy across regions
RPO/RTO: Recovery Point/Time Objectives
Runbooks: Incident response procedures

9. Security Design

Authentication: OAuth 2.0, JWT, SSO
Authorization: RBAC, ABAC, policies
Encryption: TLS in transit, encryption at rest
Secrets Management: Vault, AWS Secrets Manager
Network Security: VPC, security groups, WAF
Rate Limiting: Prevent abuse
Audit Logging: Track security events
Penetration Testing: Regular security assessments

10. Monitoring & Observability

Metrics: Response times, error rates, throughput
Logging: Centralized logging (ELK, Splunk)
Tracing: Distributed tracing (Jaeger, Zipkin)
Alerting: PagerDuty, OpsGenie
Dashboards: Real-time system health visibility
SLOs/SLIs: Define and track service levels

11. Tradeoffs & Alternatives

For each major decision, document:

Why this approach?
What alternatives were considered?
What are the tradeoffs?
What could change this decision?

Example: "Chose MongoDB over PostgreSQL because..."

12. Create Design Artifacts

Generate:

Architecture Diagram: High-level component view
Sequence Diagrams: Key user flows
Data Flow Diagram: How data moves through system
Deployment Diagram: Infrastructure layout
API Specification: OpenAPI/Swagger docs

Checklist

Examples

Example 1: Design a URL Shortener

/system-design

Design a URL shortening service like bit.ly:
- 100M URLs shortened per day
- Read-heavy (100:1 read-to-write ratio)
- Low latency (<100ms)
- High availability (99.99%)
- Custom short URLs supported

Example 2: Design a Chat System

/system-design

Design a real-time chat application:
- Support 1M concurrent users
- 1-on-1 and group chats
- Message history
- Read receipts
- File sharing
- End-to-end encryption

Example 3: Design a Video Streaming Platform

/system-design

Design a video streaming service:
- 10M daily active users
- Upload and stream videos
- Different quality options (480p, 720p, 1080p, 4K)
- Recommendations based on viewing history
- Social features (likes, comments, shares)

Best Practices

Start High-Level: Begin with big picture, then drill down
Ask Clarifying Questions: Don't assume, verify requirements
Calculate Numbers: Estimate scale and capacity
Consider Tradeoffs: No perfect solution, document choices
Think About Failure: How does each component fail?
Be Pragmatic: Balance perfection with deadlines and budget
Iterate: Design is iterative, refine as you go
Document Decisions: Explain the "why" behind choices

Common Patterns & Techniques

Load Distribution

Round-robin
Least connections
Consistent hashing
Geo-based routing

Data Consistency

Eventual consistency
Strong consistency
Causal consistency
Read-your-writes consistency

Communication

Synchronous (REST, gRPC)
Asynchronous (message queues)
Pub/Sub (event-driven)
WebSockets (real-time)

Caching

CDN (edge caching)
Application cache (Redis)
Database query cache
Object cache

Related Commands

/architecture-review: Review existing architecture
/refactor-strategy: Plan migration from old to new design
/capacity-planning: Deep dive on capacity estimation
/database-design: Focus on data modeling

/system-design