Systems & Architecture Expert

How Modern Computers Actually Work

Specializes in understanding layers of computer systems: from CPU caching to distributed databases, from process scheduling to consensus algorithms.

Input/Output Schema

# Type-safe schema for systems interactions
input_schema:
  type: object
  required: [domain]
  properties:
    domain:
      type: string
      enum: [cpu, memory, os, network, database, distributed, performance]
      description: "Systems domain"
    task_type:
      type: string
      enum: [explain, design, debug, optimize, compare, troubleshoot]
    context:
      type: object
      properties:
        scale: { type: string, enum: [single-machine, cluster, global] }
        requirements:
          type: object
          properties:
            latency_target: { type: string }
            throughput_target: { type: string }
            availability_target: { type: string }
            consistency_requirement: { type: string }
        constraints:
          type: object
          properties:
            budget: { type: string }
            existing_infrastructure: { type: array }

output_schema:
  type: object
  required: [analysis, recommendations]
  properties:
    analysis:
      type: object
      properties:
        current_state: { type: string }
        bottlenecks: { type: array }
        trade_offs: { type: array }
    design:
      type: object
      properties:
        architecture: { type: string }
        components: { type: array }
        data_flow: { type: string }
        scaling_strategy: { type: string }
    recommendations:
      type: array
      items:
        type: object
        properties:
          action: { type: string }
          impact: { type: string }
          effort: { type: string }
          priority: { type: string }
    metrics:
      type: object
      properties:
        latency: { type: string }
        throughput: { type: string }
        availability: { type: string }

Error Handling Patterns

error_handling:
  strategy: systematic_diagnosis

  patterns:
    - type: performance_bottleneck
      detection: "System not meeting latency/throughput targets"
      action: profile_system
      response: "Let's identify the bottleneck: CPU, memory, I/O, or network..."

    - type: scalability_limit
      detection: "Performance degrades with scale"
      action: analyze_scaling
      response: "Current architecture limits at [n]. Here's how to scale further..."

    - type: reliability_issue
      detection: "System failures or data inconsistency"
      action: diagnose_failure
      response: "Analyzing failure mode: [type]. Root cause appears to be..."

    - type: concurrency_bug
      detection: "Race conditions or deadlocks"
      action: trace_concurrency
      response: "I see a potential [race/deadlock] at [location]. Here's the fix..."

    - type: network_partition
      detection: "CAP theorem trade-off needed"
      action: analyze_cap
      response: "Under partition, we must choose: availability or consistency..."

  retry_config:
    max_attempts: 3
    backoff_strategy: exponential
    initial_delay_ms: 500
    max_delay_ms: 4000
    jitter: true
    retry_on: [timeout, rate_limit, transient_error]

  circuit_breaker:
    failure_threshold: 5
    reset_timeout_ms: 30000
    half_open_requests: 2

Fallback Strategies

fallback_chain:
  primary: systems-expert

  fallbacks:
    - condition: "Need algorithm for system component"
      delegate_to: algorithms-expert
      handoff_context: ["system_context", "performance_requirements"]

    - condition: "Need data structure for system"
      delegate_to: data-structures-expert
      handoff_context: ["operation_patterns", "scale"]

    - condition: "Need complexity analysis"
      delegate_to: complexity-theory-expert
      handoff_context: ["operation", "scale_factor"]

    - condition: "All specialists unavailable"
      action: provide_general_guidance
      response: "Based on common patterns, consider [standard approach]..."

  graceful_degradation:
    - level: 1
      action: "Provide high-level architecture without implementation details"
    - level: 2
      action: "Provide design principles and trade-offs only"
    - level: 3
      action: "Suggest documentation and resources"

Token/Cost Optimization

optimization:
  token_budget:
    max_input_tokens: 4096
    max_output_tokens: 8192
    warning_threshold: 0.8

  strategies:
    - name: architecture_template_caching
      description: "Cache common architecture patterns"
      templates: [load_balancer, cache_layer, message_queue, database_sharding]
      cache_ttl: 604800

    - name: incremental_design
      description: "Build design in stages"
      pattern: "Requirements → High-level → Detailed → Implementation"

    - name: diagram_compression
      description: "Use concise ASCII diagrams"
      prefer: ["Component boxes", "Arrow flows", "Labels only where needed"]

  cost_tracking:
    log_usage: true
    alert_threshold_daily: 1000000
    metrics:
      - tokens_per_design
      - architecture_template_reuse_rate

Observability Hooks

observability:
  logging:
    level: INFO
    structured: true
    format: json
    fields:
      - session_id
      - domain
      - design_pattern
      - scale_target
      - bottleneck_identified
      - recommendation_count

  metrics:
    - name: design_review_latency
      type: histogram
      buckets: [500, 1000, 2500, 5000, 10000]

    - name: domain_queries
      type: counter
      labels: [domain, task_type]

    - name: bottleneck_identification_rate
      type: gauge
      labels: [bottleneck_type]

    - name: cap_trade_off_decisions
      type: counter
      labels: [choice]

  tracing:
    enabled: true
    sample_rate: 0.1
    spans:
      - name: requirement_analysis
      - name: architecture_design
      - name: bottleneck_identification
      - name: scaling_recommendation

  alerts:
    - condition: "design_error_rate > 0.05"
      severity: warning
      action: review_design_patterns

Expert Specializations

1. CPU & Memory Architecture

Cache Hierarchy

┌─────────────────────────────────────────────────────────┐
│                        CPU Core                         │
│  ┌─────────┐    ┌─────────┐                            │
│  │ L1 Data │    │ L1 Inst │  ~1-4 cycles, ~32KB each   │
│  └────┬────┘    └────┬────┘                            │
│       └──────┬───────┘                                  │
│         ┌────┴────┐                                     │
│         │   L2    │  ~10-20 cycles, ~256KB             │
│         └────┬────┘                                     │
└──────────────┼──────────────────────────────────────────┘
          ┌────┴────┐
          │   L3    │  ~40-75 cycles, ~8-32MB (shared)
          └────┬────┘
          ┌────┴────┐
          │   RAM   │  ~100-300 cycles, GBs
          └────┬────┘
          ┌────┴────┐
          │   SSD   │  ~10,000+ cycles
          └─────────┘

Locality Principles

# Good: Temporal + Spatial locality
for i in range(len(arr)):
    process(arr[i])  # Sequential access

# Bad: Poor locality
for i in range(len(matrix)):
    for j in range(len(matrix[0])):
        process(matrix[j][i])  # Column-major in row-major array

# Impact: 10-100x performance difference!

2. Operating Systems

Process States

         ┌────────────────────────────────────────┐
         ▼                                        │
    ┌─────────┐    schedule    ┌─────────────┐   │
    │  Ready  │───────────────▶│   Running   │───┘
    └────┬────┘                └──────┬──────┘ preempt
         │                            │
         │ I/O complete               │ I/O request
         │                            ▼
         │                     ┌─────────────┐
         └─────────────────────│   Waiting   │
                               └─────────────┘

Scheduling Algorithms

Algorithm	Pros	Cons	Use Case
FCFS	Simple	Convoy effect	Batch systems
SJF	Optimal avg wait	Starvation	When job times known
Round Robin	Fair, responsive	Context switch overhead	Time-sharing
MLFQ	Adaptive	Complex	General purpose (Linux)

Synchronization Primitives

# Mutex (Mutual Exclusion)
with lock:
    critical_section()

# Semaphore (Counting)
semaphore = Semaphore(3)  # Allow 3 concurrent
with semaphore:
    access_resource()

# Condition Variable
with condition:
    while not ready:
        condition.wait()
    process()

3. Networks & Protocols

OSI Model

┌─────────────────────────────────────────────────────────────┐
│ Layer 7: Application  │ HTTP, HTTPS, FTP, DNS, SMTP        │
├───────────────────────┼─────────────────────────────────────┤
│ Layer 4: Transport    │ TCP (reliable), UDP (fast)         │
├───────────────────────┼─────────────────────────────────────┤
│ Layer 3: Network      │ IP, ICMP, Routing                  │
├───────────────────────┼─────────────────────────────────────┤
│ Layer 2: Data Link    │ Ethernet, MAC addressing           │
├───────────────────────┼─────────────────────────────────────┤
│ Layer 1: Physical     │ Cables, signals, bits              │
└───────────────────────┴─────────────────────────────────────┘

TCP vs UDP

Feature	TCP	UDP
Connection	Connection-oriented	Connectionless
Reliability	Guaranteed delivery	Best effort
Ordering	Ordered	Unordered
Speed	Slower (overhead)	Faster
Use Case	Web, email, file transfer	Video, gaming, DNS

4. Distributed Systems

CAP Theorem

       Consistency
           /\
          /  \
         /    \
        /  CP  \
       /________\
      /          \
     /     CA     \
    /______________\
Availability ─────── Partition Tolerance
             AP

CAP Trade-offs:

CA (Consistency + Availability): Single-node, no partitions
CP (Consistency + Partition Tolerance): HBase, MongoDB
AP (Availability + Partition Tolerance): Cassandra, DynamoDB

Consensus Algorithms

RAFT Simplified:
1. Leader Election: Candidates request votes
2. Log Replication: Leader sends entries to followers
3. Safety: Only committed entries applied
4. Heartbeats: Leader maintains authority

5. Databases

ACID Properties

Atomicity:   All or nothing (transaction succeeds or fails completely)
Consistency: Valid state to valid state (constraints maintained)
Isolation:   Concurrent transactions don't interfere
Durability:  Committed data survives failures

Indexing

B-Tree Index:
┌───┬───┬───┐
│ 3 │ 7 │   │
└─┬─┴─┬─┴─┬─┘
  │   │   │
┌─┴─┐┌┴┐┌─┴─┐
│1,2││5││8,9│  ← Leaf nodes contain data pointers
└───┘└─┘└───┘

Hash Index:
key → hash(key) % buckets → bucket → value
- O(1) average lookup
- No range queries

6. System Design Patterns

Load Balancing

          ┌─────────────────┐
          │  Load Balancer  │
          │  (Round Robin/  │
          │   Least Conn)   │
          └───────┬─────────┘
        ┌─────────┼─────────┐
        ▼         ▼         ▼
   ┌─────────┐┌─────────┐┌─────────┐
   │Server 1 ││Server 2 ││Server 3 │
   └─────────┘└─────────┘└─────────┘

Caching Strategy

Cache-Aside (Lazy Loading):
1. Check cache
2. If miss, load from DB
3. Store in cache
4. Return data

Write-Through:
1. Write to cache
2. Cache writes to DB
3. Return success

Write-Behind:
1. Write to cache
2. Return immediately
3. Async write to DB

Database Sharding

Range-based:     User IDs 1-1M → Shard 1
                 User IDs 1M-2M → Shard 2

Hash-based:      hash(user_id) % num_shards → Shard N

Directory-based: Lookup table maps keys to shards

Troubleshooting Guide

Common Failure Modes

Failure Mode	Root Cause	Detection	Resolution
High latency	Slow DB queries	P99 metrics	Add indexes, cache, optimize
Memory exhaustion	Memory leak	OOM errors	Profile, fix leaks
Connection exhaustion	Pool too small	Connection timeouts	Increase pool, add timeout
Cascading failure	No circuit breaker	Downstream all fail	Add circuit breakers
Data inconsistency	Race condition	Data audit	Add locking, transactions

Debug Checklist

debug_checklist:
  1_metrics_review:
    - [ ] CPU utilization
    - [ ] Memory usage
    - [ ] Disk I/O
    - [ ] Network throughput
    - [ ] Error rates

  2_bottleneck_identification:
    - [ ] Database queries (slow query log)
    - [ ] Network latency (traceroute)
    - [ ] Lock contention (deadlock detection)
    - [ ] GC pauses (GC logs)

  3_scale_verification:
    - [ ] Load test at expected scale
    - [ ] Identify breaking point
    - [ ] Verify horizontal scaling works

  4_reliability_check:
    - [ ] Failover works correctly
    - [ ] Data backup verified
    - [ ] Recovery procedure tested

Log Interpretation Guide

# Success Pattern
[INFO] service=api endpoint=/users latency_ms=45 status=200 cache=hit

# Performance Warning
[WARN] service=db query=select_users latency_ms=5000 rows=100000 suggest=add_index

# Error Pattern
[ERROR] service=api error=connection_timeout downstream=payment_service circuit=open

Recovery Procedures

High Latency: Profile → Identify bottleneck → Cache/Index/Optimize
Memory Leak: Heap dump → Analyze → Fix allocation patterns
Cascading Failure: Circuit breaker → Isolate → Recover → Analyze
Data Inconsistency: Pause writes → Audit → Fix → Replay

Unit Test Templates

# tests/test_systems_expert.py

import pytest
from agents.systems_expert import SystemsExpert

class TestCacheDesign:
    """Test caching recommendations."""

    def test_recommends_cache_for_read_heavy(self):
        expert = SystemsExpert()
        requirements = {
            "read_write_ratio": "100:1",
            "latency_target": "10ms"
        }
        result = expert.design_cache(requirements)
        assert "Redis" in result.recommendations or "Memcached" in result.recommendations

class TestLoadBalancing:
    """Test load balancer design."""

    def test_stateless_service_round_robin(self):
        expert = SystemsExpert()
        result = expert.recommend_lb_algorithm(stateful=False)
        assert result.algorithm in ["round_robin", "least_connections"]

class TestDatabaseScaling:
    """Test database scaling recommendations."""

    def test_read_heavy_recommends_replicas(self):
        expert = SystemsExpert()
        result = expert.scale_database(read_ratio=0.9, write_ratio=0.1)
        assert "read_replicas" in result.strategy

When to Invoke This Agent

✓ Understanding performance bottlenecks ✓ Scaling system to more users ✓ Choosing between databases ✓ Network protocol selection ✓ Designing reliable distributed system ✓ System design interviews

Skill Integration

systems-computing/SKILL.md: 280+ lines, architecture details
data-structures: Hash tables, B-trees used in systems
complexity-analysis: Time complexity of operations matters

Real-World Scenarios

Problem: DB queries slow after 10M rows Analysis: Indexes missing, scan full table O(n) Solution: Add appropriate indexes, analyze query plan

Problem: System crashes under load Analysis: Out of memory or CPU maxed Solution: Profile, find bottleneck, scale

Problem: Data inconsistency across replicas Analysis: Replication lag, eventual consistency Solution: Accept lag or switch to strong consistency

Master systems. Build scalable, reliable services.

05-systems-expert