Distributed Systems Skill

Purpose: Atomic skill for distributed systems patterns with formal correctness guarantees.

Skill Identity

Attribute	Value
Scope	Consensus, Replication, Partitioning
Responsibility	Single: Distributed coordination patterns
Invocation	`Skill("distributed-systems")`

Parameter Schema

Input Validation

parameters:
  distributed_context:
    type: object
    required: true
    properties:
      problem_type:
        type: string
        enum: [consensus, replication, partitioning, transaction, conflict_resolution]
        required: true
      cluster_config:
        type: object
        properties:
          nodes: { type: integer, minimum: 1, maximum: 1000 }
          failure_tolerance: { type: integer, minimum: 0 }
          regions: { type: array, items: { type: string } }
        required: [nodes]
      consistency_requirement:
        type: string
        enum: [linearizable, sequential, causal, eventual]
        default: eventual
      network_assumptions:
        type: object
        properties:
          latency_ms: { type: integer, minimum: 1 }
          partition_probability: { type: number, minimum: 0, maximum: 1 }
          byzantine: { type: boolean, default: false }

validation_rules:
  - name: "quorum_possible"
    rule: "nodes >= 2 * failure_tolerance + 1"
    error: "Cannot tolerate f failures with fewer than 2f+1 nodes"
  - name: "byzantine_nodes"
    rule: "!byzantine || nodes >= 3 * failure_tolerance + 1"
    error: "Byzantine tolerance requires 3f+1 nodes"

Output Schema

output:
  type: object
  properties:
    protocol:
      type: object
      properties:
        name: { type: string }
        description: { type: string }
        correctness_guarantee: { type: string }
    configuration:
      type: object
      properties:
        quorum_size: { type: integer }
        timeout_ms: { type: integer }
        heartbeat_ms: { type: integer }
    failure_scenarios:
      type: array
      items:
        type: object
        properties:
          scenario: { type: string }
          behavior: { type: string }
          recovery: { type: string }

Core Patterns

Consensus Algorithms

Raft (Crash Fault Tolerant):
├── Leader Election
│   ├── Term-based voting
│   ├── Randomized timeout: 150-300ms
│   └── At most one leader per term
├── Log Replication
│   ├── Append-only entries
│   ├── Commit on majority ack
│   └── Log matching property
└── Safety Invariants
    ├── Election Safety
    ├── Leader Append-Only
    ├── Log Matching
    ├── Leader Completeness
    └── State Machine Safety

Paxos (Multi-Paxos):
├── Phase 1: Prepare
│   ├── Proposer: Send prepare(n)
│   └── Acceptor: Promise or reject
├── Phase 2: Accept
│   ├── Proposer: Send accept(n, v)
│   └── Acceptor: Accept or reject
└── Phase 3: Learn
    └── Learners: Apply committed value

PBFT (Byzantine Fault Tolerant):
├── Requires: 3f + 1 nodes
├── Phases: Pre-prepare → Prepare → Commit
├── Tolerates: f malicious nodes
└── Use: Blockchain, untrusted environments

Replication Strategies

Synchronous:
├── Write waits for all replicas
├── Guarantees: Strong consistency
├── Trade-off: Higher latency
└── Formula: latency = max(replica_latencies)

Asynchronous:
├── Write returns after primary
├── Guarantees: Eventual consistency
├── Trade-off: Potential data loss
└── Risk: RPO > 0

Semi-Synchronous:
├── Write waits for at least one replica
├── Guarantees: Durability with quorum
├── Trade-off: Balance of consistency/latency
└── Common: MySQL semi-sync, PostgreSQL sync replicas

Partitioning Schemes

Consistent Hashing:
├── Virtual nodes for balance
├── Minimal redistribution on change
├── Formula: position = hash(key) mod ring_size
└── Rebalance: Only k/n keys move

Range Partitioning:
├── Ordered data locality
├── Supports range queries
├── Risk: Hot spots
└── Mitigation: Split busy ranges

Directory-Based:
├── Centralized routing table
├── Maximum flexibility
├── Trade-off: Single point of failure
└── Mitigation: Replicated directory

Retry Logic

Distributed Operation Retry

retry_config:
  consensus_operations:
    max_attempts: 10
    initial_delay_ms: 50
    max_delay_ms: 10000
    multiplier: 2.0
    jitter_factor: 0.25

  idempotency:
    required: true
    key_format: "{request_id}:{operation_type}"
    dedup_window_seconds: 300

  retry_on:
    - LEADER_NOT_FOUND
    - QUORUM_UNREACHABLE
    - TIMEOUT
    - NETWORK_PARTITION_HEALING

  abort_on:
    - INVALID_TERM
    - DUPLICATE_REQUEST
    - PERMANENT_FAILURE

Logging & Observability

Log Format

log_schema:
  level: { type: string }
  timestamp: { type: string, format: ISO8601 }
  skill: { type: string, value: "distributed-systems" }
  node_id: { type: string }
  term: { type: integer }
  event:
    type: string
    enum:
      - election_started
      - leader_elected
      - log_replicated
      - commit_applied
      - partition_detected
      - partition_healed
      - conflict_resolved
  context: { type: object }

examples:
  - level: INFO
    event: leader_elected
    context: { node_id: "node-3", term: 42 }

  - level: WARN
    event: partition_detected
    context: { isolated_nodes: ["node-2", "node-5"] }

  - level: INFO
    event: conflict_resolved
    context: { key: "user:123", strategy: "LWW" }

Metrics

metrics:
  - name: leader_elections_total
    type: counter
    labels: [result]  # success, timeout, split_vote

  - name: replication_lag_seconds
    type: gauge
    labels: [follower_id]

  - name: consensus_latency_seconds
    type: histogram
    labels: [operation_type]
    buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5]

  - name: partition_events_total
    type: counter
    labels: [type]  # detected, healed

Troubleshooting

Common Issues

Issue	Cause	Resolution
Split-brain	Network partition	Implement proper fencing
Leader flapping	Unstable network	Increase election timeout
Stale reads	Replication lag	Route to leader or read-repair
Deadlock	Lock ordering	Timeout + retry with backoff
Data divergence	Concurrent writes	Implement conflict resolution

Debug Checklist

□ Quorum size correct (n/2 + 1)?
□ Clock skew within bounds?
□ Network partition tested?
□ Fencing mechanism in place?
□ Idempotency guaranteed?
□ Recovery procedure documented?

Unit Test Templates

Consensus Tests

# test_distributed_systems.py

def test_quorum_calculation():
    assert calculate_quorum(3) == 2
    assert calculate_quorum(5) == 3
    assert calculate_quorum(7) == 4

def test_failure_tolerance_validation():
    # 5 nodes can tolerate 2 failures
    result = validate_cluster(nodes=5, failure_tolerance=2)
    assert result.valid == True

    # 5 nodes cannot tolerate 3 failures
    result = validate_cluster(nodes=5, failure_tolerance=3)
    assert result.valid == False
    assert "2f+1" in result.error

def test_byzantine_tolerance():
    # PBFT requires 3f+1 nodes
    result = validate_cluster(
        nodes=4, failure_tolerance=1, byzantine=True
    )
    assert result.valid == True  # 3*1+1 = 4

    result = validate_cluster(
        nodes=4, failure_tolerance=2, byzantine=True
    )
    assert result.valid == False  # 3*2+1 = 7 needed

def test_consistent_hashing():
    ring = ConsistentHashRing(virtual_nodes=100)
    ring.add_node("node-1")
    ring.add_node("node-2")

    key = "user:123"
    node = ring.get_node(key)
    assert node in ["node-1", "node-2"]

    # Minimal redistribution
    ring.add_node("node-3")
    # Only ~1/3 of keys should move

Replication Tests

def test_sync_replication():
    result = simulate_sync_replication(
        replicas=3,
        latencies_ms=[10, 20, 30]
    )
    assert result.total_latency_ms == 30  # max of all

def test_async_replication_lag():
    result = simulate_async_replication(
        write_rate=1000,
        apply_rate=800,
        duration_seconds=60
    )
    assert result.lag_seconds > 0
    assert result.lag_messages == 200 * 60  # 200/s gap

def test_conflict_resolution_lww():
    v1 = {"value": "A", "timestamp": 1000}
    v2 = {"value": "B", "timestamp": 1001}
    result = resolve_conflict_lww(v1, v2)
    assert result.value == "B"  # Later timestamp wins

Version History

Version	Date	Changes
2.0.0	2025-01	Production-grade rewrite with formal validation
1.0.0	2024-12	Initial release

distributed-systems

Distributed Systems Skill

Skill Identity

Parameter Schema

Input Validation

Output Schema

Core Patterns

Consensus Algorithms

Replication Strategies

Partitioning Schemes

Retry Logic

Distributed Operation Retry

Logging & Observability

Log Format

Metrics

Troubleshooting

Common Issues

Debug Checklist

Unit Test Templates

Consensus Tests

Replication Tests

Version History

Similar Skills