Production-grade distributed systems skill for consensus protocols, replication, partitioning, and consistency models
Designs production-grade distributed systems with formal correctness guarantees. Triggers when you need consensus protocols, replication strategies, or partitioning schemes for fault-tolerant architectures.
/plugin marketplace add pluginagentmarketplace/custom-plugin-system-design/plugin install custom-plugin-system-design@pluginagentmarketplace-system-designThis skill inherits all available tools. When active, it can use any tool Claude has access to.
assets/config.yamlassets/schema.jsonreferences/GUIDE.mdreferences/PATTERNS.mdscripts/validate.pyPurpose: Atomic skill for distributed systems patterns with formal correctness guarantees.
| Attribute | Value |
|---|---|
| Scope | Consensus, Replication, Partitioning |
| Responsibility | Single: Distributed coordination patterns |
| Invocation | Skill("distributed-systems") |
parameters:
distributed_context:
type: object
required: true
properties:
problem_type:
type: string
enum: [consensus, replication, partitioning, transaction, conflict_resolution]
required: true
cluster_config:
type: object
properties:
nodes: { type: integer, minimum: 1, maximum: 1000 }
failure_tolerance: { type: integer, minimum: 0 }
regions: { type: array, items: { type: string } }
required: [nodes]
consistency_requirement:
type: string
enum: [linearizable, sequential, causal, eventual]
default: eventual
network_assumptions:
type: object
properties:
latency_ms: { type: integer, minimum: 1 }
partition_probability: { type: number, minimum: 0, maximum: 1 }
byzantine: { type: boolean, default: false }
validation_rules:
- name: "quorum_possible"
rule: "nodes >= 2 * failure_tolerance + 1"
error: "Cannot tolerate f failures with fewer than 2f+1 nodes"
- name: "byzantine_nodes"
rule: "!byzantine || nodes >= 3 * failure_tolerance + 1"
error: "Byzantine tolerance requires 3f+1 nodes"
output:
type: object
properties:
protocol:
type: object
properties:
name: { type: string }
description: { type: string }
correctness_guarantee: { type: string }
configuration:
type: object
properties:
quorum_size: { type: integer }
timeout_ms: { type: integer }
heartbeat_ms: { type: integer }
failure_scenarios:
type: array
items:
type: object
properties:
scenario: { type: string }
behavior: { type: string }
recovery: { type: string }
Raft (Crash Fault Tolerant):
├── Leader Election
│ ├── Term-based voting
│ ├── Randomized timeout: 150-300ms
│ └── At most one leader per term
├── Log Replication
│ ├── Append-only entries
│ ├── Commit on majority ack
│ └── Log matching property
└── Safety Invariants
├── Election Safety
├── Leader Append-Only
├── Log Matching
├── Leader Completeness
└── State Machine Safety
Paxos (Multi-Paxos):
├── Phase 1: Prepare
│ ├── Proposer: Send prepare(n)
│ └── Acceptor: Promise or reject
├── Phase 2: Accept
│ ├── Proposer: Send accept(n, v)
│ └── Acceptor: Accept or reject
└── Phase 3: Learn
└── Learners: Apply committed value
PBFT (Byzantine Fault Tolerant):
├── Requires: 3f + 1 nodes
├── Phases: Pre-prepare → Prepare → Commit
├── Tolerates: f malicious nodes
└── Use: Blockchain, untrusted environments
Synchronous:
├── Write waits for all replicas
├── Guarantees: Strong consistency
├── Trade-off: Higher latency
└── Formula: latency = max(replica_latencies)
Asynchronous:
├── Write returns after primary
├── Guarantees: Eventual consistency
├── Trade-off: Potential data loss
└── Risk: RPO > 0
Semi-Synchronous:
├── Write waits for at least one replica
├── Guarantees: Durability with quorum
├── Trade-off: Balance of consistency/latency
└── Common: MySQL semi-sync, PostgreSQL sync replicas
Consistent Hashing:
├── Virtual nodes for balance
├── Minimal redistribution on change
├── Formula: position = hash(key) mod ring_size
└── Rebalance: Only k/n keys move
Range Partitioning:
├── Ordered data locality
├── Supports range queries
├── Risk: Hot spots
└── Mitigation: Split busy ranges
Directory-Based:
├── Centralized routing table
├── Maximum flexibility
├── Trade-off: Single point of failure
└── Mitigation: Replicated directory
retry_config:
consensus_operations:
max_attempts: 10
initial_delay_ms: 50
max_delay_ms: 10000
multiplier: 2.0
jitter_factor: 0.25
idempotency:
required: true
key_format: "{request_id}:{operation_type}"
dedup_window_seconds: 300
retry_on:
- LEADER_NOT_FOUND
- QUORUM_UNREACHABLE
- TIMEOUT
- NETWORK_PARTITION_HEALING
abort_on:
- INVALID_TERM
- DUPLICATE_REQUEST
- PERMANENT_FAILURE
log_schema:
level: { type: string }
timestamp: { type: string, format: ISO8601 }
skill: { type: string, value: "distributed-systems" }
node_id: { type: string }
term: { type: integer }
event:
type: string
enum:
- election_started
- leader_elected
- log_replicated
- commit_applied
- partition_detected
- partition_healed
- conflict_resolved
context: { type: object }
examples:
- level: INFO
event: leader_elected
context: { node_id: "node-3", term: 42 }
- level: WARN
event: partition_detected
context: { isolated_nodes: ["node-2", "node-5"] }
- level: INFO
event: conflict_resolved
context: { key: "user:123", strategy: "LWW" }
metrics:
- name: leader_elections_total
type: counter
labels: [result] # success, timeout, split_vote
- name: replication_lag_seconds
type: gauge
labels: [follower_id]
- name: consensus_latency_seconds
type: histogram
labels: [operation_type]
buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5]
- name: partition_events_total
type: counter
labels: [type] # detected, healed
| Issue | Cause | Resolution |
|---|---|---|
| Split-brain | Network partition | Implement proper fencing |
| Leader flapping | Unstable network | Increase election timeout |
| Stale reads | Replication lag | Route to leader or read-repair |
| Deadlock | Lock ordering | Timeout + retry with backoff |
| Data divergence | Concurrent writes | Implement conflict resolution |
□ Quorum size correct (n/2 + 1)?
□ Clock skew within bounds?
□ Network partition tested?
□ Fencing mechanism in place?
□ Idempotency guaranteed?
□ Recovery procedure documented?
# test_distributed_systems.py
def test_quorum_calculation():
assert calculate_quorum(3) == 2
assert calculate_quorum(5) == 3
assert calculate_quorum(7) == 4
def test_failure_tolerance_validation():
# 5 nodes can tolerate 2 failures
result = validate_cluster(nodes=5, failure_tolerance=2)
assert result.valid == True
# 5 nodes cannot tolerate 3 failures
result = validate_cluster(nodes=5, failure_tolerance=3)
assert result.valid == False
assert "2f+1" in result.error
def test_byzantine_tolerance():
# PBFT requires 3f+1 nodes
result = validate_cluster(
nodes=4, failure_tolerance=1, byzantine=True
)
assert result.valid == True # 3*1+1 = 4
result = validate_cluster(
nodes=4, failure_tolerance=2, byzantine=True
)
assert result.valid == False # 3*2+1 = 7 needed
def test_consistent_hashing():
ring = ConsistentHashRing(virtual_nodes=100)
ring.add_node("node-1")
ring.add_node("node-2")
key = "user:123"
node = ring.get_node(key)
assert node in ["node-1", "node-2"]
# Minimal redistribution
ring.add_node("node-3")
# Only ~1/3 of keys should move
def test_sync_replication():
result = simulate_sync_replication(
replicas=3,
latencies_ms=[10, 20, 30]
)
assert result.total_latency_ms == 30 # max of all
def test_async_replication_lag():
result = simulate_async_replication(
write_rate=1000,
apply_rate=800,
duration_seconds=60
)
assert result.lag_seconds > 0
assert result.lag_messages == 200 * 60 # 200/s gap
def test_conflict_resolution_lww():
v1 = {"value": "A", "timestamp": 1000}
v2 = {"value": "B", "timestamp": 1001}
result = resolve_conflict_lww(v1, v2)
assert result.value == "B" # Later timestamp wins
| Version | Date | Changes |
|---|---|---|
| 2.0.0 | 2025-01 | Production-grade rewrite with formal validation |
| 1.0.0 | 2024-12 | Initial release |
This skill should be used when the user asks to "create a slash command", "add a command", "write a custom command", "define command arguments", "use command frontmatter", "organize commands", "create command with file references", "interactive command", "use AskUserQuestion in command", or needs guidance on slash command structure, YAML frontmatter fields, dynamic arguments, bash execution in commands, user interaction patterns, or command development best practices for Claude Code.
This skill should be used when the user asks to "create an agent", "add an agent", "write a subagent", "agent frontmatter", "when to use description", "agent examples", "agent tools", "agent colors", "autonomous agent", or needs guidance on agent structure, system prompts, triggering conditions, or agent development best practices for Claude Code plugins.
This skill should be used when the user asks to "create a hook", "add a PreToolUse/PostToolUse/Stop hook", "validate tool use", "implement prompt-based hooks", "use ${CLAUDE_PLUGIN_ROOT}", "set up event-driven automation", "block dangerous commands", or mentions hook events (PreToolUse, PostToolUse, Stop, SubagentStop, SessionStart, SessionEnd, UserPromptSubmit, PreCompact, Notification). Provides comprehensive guidance for creating and implementing Claude Code plugin hooks with focus on advanced prompt-based hooks API.