System Design Basics Skill

Purpose: Atomic skill for system design fundamentals with validated parameters and production-ready patterns.

Skill Identity

Attribute	Value
Scope	Scalability, Reliability, Availability
Responsibility	Single: Foundational design patterns
Invocation	`Skill("system-design-basics")`

Parameter Schema

Input Validation

parameters:
  design_context:
    type: object
    required: true
    properties:
      problem:
        type: string
        minLength: 10
        maxLength: 500
        description: "Clear problem statement"
      scale:
        type: object
        properties:
          users: { type: integer, minimum: 0 }
          qps: { type: integer, minimum: 0 }
          data_size: { type: string, pattern: "^\\d+[KMGTP]B$" }
        required: [users]
      constraints:
        type: array
        items: { type: string }
        maxItems: 10

validation_rules:
  - name: "scale_consistency"
    rule: "qps <= users * 100"
    error: "QPS seems unreasonably high for user count"
  - name: "required_context"
    rule: "problem.length >= 10"
    error: "Problem statement too vague"

Output Schema

output:
  type: object
  properties:
    architecture:
      type: object
      properties:
        components: { type: array }
        data_flow: { type: string }
        trade_offs: { type: array }
    capacity:
      type: object
      properties:
        storage: { type: string }
        bandwidth: { type: string }
        compute: { type: string }
    recommendations:
      type: array
      items:
        type: object
        properties:
          area: { type: string }
          suggestion: { type: string }
          priority: { type: string, enum: [high, medium, low] }

Core Patterns

Scalability Patterns

Horizontal Scaling:
├── Stateless Services
│   ├── No session affinity
│   ├── Externalize state (Redis, DB)
│   └── Container-ready design
├── Load Balancing
│   ├── Round-robin (simple)
│   ├── Least connections (dynamic)
│   ├── IP hash (session affinity)
│   └── Weighted (capacity-aware)
├── Database Scaling
│   ├── Read replicas
│   ├── Sharding (horizontal)
│   └── Caching layer
└── Queue-Based Scaling
    ├── Decouple producers/consumers
    ├── Handle load spikes
    └── Enable async processing

Vertical Scaling:
├── CPU optimization
├── Memory increase
├── SSD storage
└── Network bandwidth

Reliability Patterns

Redundancy:
├── Active-passive failover
├── Active-active (multi-master)
├── N+1 redundancy
└── Geographic distribution

Fault Tolerance:
├── Circuit breaker
│   └── Prevent cascade failures
├── Retry with backoff
│   └── Transient error recovery
├── Bulkhead isolation
│   └── Failure containment
├── Timeout enforcement
│   └── Resource protection
└── Graceful degradation
    └── Partial functionality

Availability Calculations

Availability Formula:
├── Availability = MTBF / (MTBF + MTTR)
├── MTBF: Mean Time Between Failures
└── MTTR: Mean Time To Repair

Availability Tiers:
├── 99% (two 9s) = 3.65 days/year downtime
├── 99.9% (three 9s) = 8.76 hours/year
├── 99.99% (four 9s) = 52.56 minutes/year
├── 99.999% (five 9s) = 5.26 minutes/year
└── 99.9999% (six 9s) = 31.5 seconds/year

Serial vs Parallel:
├── Serial: A_total = A1 × A2 × A3
│   └── 99.9% × 99.9% × 99.9% = 99.7%
└── Parallel: A_total = 1 - (1-A1) × (1-A2)
    └── 1 - (0.001 × 0.001) = 99.9999%

Retry Logic

Exponential Backoff Configuration

retry_config:
  max_attempts: 5
  initial_delay_ms: 100
  max_delay_ms: 30000
  multiplier: 2.0
  jitter_factor: 0.2

  retry_on:
    - TIMEOUT
    - RATE_LIMITED
    - SERVICE_UNAVAILABLE
    - NETWORK_ERROR

  abort_on:
    - VALIDATION_ERROR
    - NOT_FOUND
    - UNAUTHORIZED

implementation:
  delay = min(initial_delay * (multiplier ^ attempt), max_delay)
  jittered_delay = delay * (1 + random(-jitter_factor, jitter_factor))

Logging & Observability

Log Format

log_schema:
  level: { type: string, enum: [DEBUG, INFO, WARN, ERROR] }
  timestamp: { type: string, format: ISO8601 }
  skill: { type: string, value: "system-design-basics" }
  correlation_id: { type: string }
  event:
    type: string
    enum:
      - skill_invoked
      - parameter_validated
      - pattern_applied
      - calculation_performed
      - skill_completed
      - error_occurred
  context: { type: object }
  duration_ms: { type: number }

example:
  level: INFO
  timestamp: "2025-01-01T00:00:00.000Z"
  skill: "system-design-basics"
  correlation_id: "abc123"
  event: "pattern_applied"
  context:
    pattern: "horizontal_scaling"
    target: "api_servers"
  duration_ms: 45

Metrics

metrics:
  - name: skill_invocation_count
    type: counter
    labels: [status, pattern_type]

  - name: skill_duration_seconds
    type: histogram
    buckets: [0.1, 0.5, 1, 2, 5]

  - name: parameter_validation_failures
    type: counter
    labels: [validation_rule]

  - name: pattern_usage
    type: counter
    labels: [pattern_name]

Troubleshooting

Common Issues

Issue	Cause	Resolution
Vague output	Unclear input	Ask for specific scale/constraints
Over-engineered	Premature optimization	Apply YAGNI, start simple
Under-scaled	Missing growth projections	Request 5-year growth estimate
Inconsistent trade-offs	Missing context	Clarify priority (cost/perf/availability)

Debug Checklist

□ Input parameters validated?
□ Scale requirements clear?
□ Constraints explicitly stated?
□ Trade-offs documented?
□ Capacity estimates calculated?
□ Failure modes considered?

Unit Test Templates

Parameter Validation Tests

# test_system_design_basics.py

def test_valid_parameters():
    params = {
        "design_context": {
            "problem": "Design URL shortener for 100M users",
            "scale": {"users": 100000000, "qps": 10000},
            "constraints": ["low latency", "high availability"]
        }
    }
    result = validate_parameters(params)
    assert result.valid == True

def test_invalid_problem_too_short():
    params = {
        "design_context": {
            "problem": "Short",
            "scale": {"users": 1000}
        }
    }
    result = validate_parameters(params)
    assert result.valid == False
    assert "minLength" in result.errors[0]

def test_qps_scale_consistency():
    params = {
        "design_context": {
            "problem": "High QPS test case",
            "scale": {"users": 100, "qps": 1000000}
        }
    }
    result = validate_parameters(params)
    assert result.warnings[0] == "QPS seems unreasonably high"

def test_capacity_calculation():
    result = calculate_capacity(
        users=1000000,
        data_per_user="1KB",
        growth_rate=0.1
    )
    assert result.storage == "1.1GB"  # With 10% growth buffer

Pattern Application Tests

def test_horizontal_scaling_pattern():
    context = {"users": 1000000, "stateless": True}
    result = apply_pattern("horizontal_scaling", context)
    assert "load_balancer" in result.components
    assert "auto_scaling_group" in result.components

def test_availability_calculation():
    components = [0.999, 0.999, 0.999]  # 99.9% each
    serial = calculate_serial_availability(components)
    assert abs(serial - 0.997) < 0.001  # ~99.7%

    parallel = calculate_parallel_availability([0.999, 0.999])
    assert abs(parallel - 0.999999) < 0.000001  # ~99.9999%

Version History

Version	Date	Changes
2.0.0	2025-01	Production-grade rewrite with validation schemas
1.0.0	2024-12	Initial release

system-design-basics

System Design Basics Skill

Skill Identity

Parameter Schema

Input Validation

Output Schema

Core Patterns

Scalability Patterns

Reliability Patterns

Availability Calculations

Retry Logic

Exponential Backoff Configuration

Logging & Observability

Log Format

Metrics

Troubleshooting

Common Issues

Debug Checklist

Unit Test Templates

Parameter Validation Tests

Pattern Application Tests

Version History

Similar Skills