From harness-claude
Explains CAP Theorem tradeoffs between consistency and availability during network partitions in distributed systems. Useful for database architecture, replication strategies, and multi-datacenter designs with PostgreSQL examples.
npx claudepluginhub intense-visions/harness-engineering --plugin harness-claudeThis skill uses the workspace's default tool permissions.
> In a distributed system, when a network partition occurs, you must choose between consistency (every read returns the most recent write) and availability (every non-failing node returns a response) -- you cannot have both simultaneously.
Guides distributed systems design: CAP theorem analysis, Raft/Paxos consensus, sharding/partitioning, eventual consistency, leader election, distributed locking (Redlock/ZooKeeper).
Explains eventual consistency including BASE vs ACID, PostgreSQL replication lag measurement, and Last-Write-Wins conflict resolution for distributed systems with read replicas.
Guides database selection with RDBMS vs NoSQL catalogs, CAP/ACID/BASE theory, OLTP/OLAP distinctions, and specifics for PostgreSQL, MySQL, MongoDB, Redis, DynamoDB.
Share bugs, ideas, or general feedback.
In a distributed system, when a network partition occurs, you must choose between consistency (every read returns the most recent write) and availability (every non-failing node returns a response) -- you cannot have both simultaneously.
Consistency (C): Linearizability -- every read receives the most recent write or an error. All nodes see the same data at the same time. This is NOT the same as ACID consistency (constraint satisfaction). CAP consistency is about distributed agreement on the current value.
Availability (A): Every request to a non-failing node receives a response (not an error), though it may not contain the most recent write. The system continues to operate even if some nodes cannot communicate.
Partition Tolerance (P): The system continues to operate despite arbitrary message loss or delay between nodes. Network partitions are not a choice -- they happen in every distributed system. Cables get cut, switches fail, cloud AZs lose connectivity.
Since partitions are inevitable in distributed systems, the real choice is between C and A during a partition. During normal operation (no partition), you can have all three.
Concrete scenario:
Two PostgreSQL nodes (Primary in US-East, Replica in EU-West). The network link between them fails.
PostgreSQL with synchronous replication:
-- postgresql.conf on primary
synchronous_standby_names = 'replica1'
With this configuration, COMMIT does not return until the replica confirms it received the WAL. If the replica is unreachable, writes block -- the system trades availability for consistency.
Other CP systems: etcd, ZooKeeper, Consul, Google Spanner (uses TrueTime to achieve CP with high availability through consensus).
PostgreSQL with asynchronous replication:
The default replication mode. The primary writes to WAL, sends it to replicas asynchronously, and returns COMMIT immediately. During a partition, the primary keeps writing and replicas serve increasingly stale data.
Other AP systems: Cassandra (tunable per query), DynamoDB (default mode), CouchDB, DNS.
Most production systems do not pick one side globally. Instead, they tune consistency per operation:
Operation Consistency Why
───────────────────────── ────────────── ──────────────────────────
Read account balance Strong (CP) Financial accuracy required
Read product catalog Eventual (AP) Stale price for 2 seconds is acceptable
Read user profile Eventual (AP) Name/avatar lag is invisible
Write payment Strong (CP) Double-charge prevention
Write analytics event Eventual (AP) Losing one event is tolerable
DynamoDB makes this explicit: ConsistentRead: true routes to the leader (strong), ConsistentRead: false routes to any replica (eventual).
A SaaS application deploys PostgreSQL in US-East (primary) and EU-West (replica).
This is an AP configuration for reads and a CP configuration for writes -- a common hybrid approach.
Using CAP to justify eventual consistency when strong consistency is achievable. If your system runs on a single PostgreSQL node, CAP does not apply. CAP is about distributed systems with network partitions between nodes.
Treating single-node PostgreSQL as a "CAP choice." A single-node database is not a distributed system. It provides strong consistency by default. CAP becomes relevant only when you add replication or distribute data.
Claiming a system is "CA" (consistent and available, not partition-tolerant). This is impossible in a network. Every real distributed system experiences partitions. A "CA" system is just one that has not been tested under partition conditions.
Using CAP as the sole criterion for database selection. CAP tells you about behavior during partitions. It says nothing about performance, query language, operational complexity, cost, or ecosystem maturity.
"Pick 2 of 3" is misleading. You always need P (partitions happen whether you want them or not). The real choice is C vs. A during partitions. During normal operation, all three are achievable.
CAP says nothing about latency. A system can be "consistent" under CAP but take 10 seconds to respond. CAP guarantees are about correctness, not performance.
CAP applies only during partitions. During normal operation, most systems provide both consistency and availability. The tradeoff is triggered only when nodes cannot communicate.
PACELC extends CAP to address behavior during normal operation:
Examples:
ConsistentRead: truePACELC is more useful for engineering decisions because it covers the common case (no partition) where the latency/consistency tradeoff matters most.
Martin Kleppmann's 2015 article "Please stop calling databases CP or AP" argues that CAP is too imprecise for real engineering decisions:
This is correct. Use CAP as a mental model for understanding the fundamental tradeoff, but describe your system's actual guarantees in concrete terms.
A messaging platform deployed Cassandra across 5 regions for chat history. Default consistency level was ONE (AP -- lowest latency, eventual consistency). Problem: users occasionally saw messages out of order or missed recent messages when reading from a different region than they wrote to.
Solution: Changed write consistency to LOCAL_QUORUM (majority of nodes in the local datacenter must confirm) and read consistency to LOCAL_QUORUM. This provided strong consistency within each region while maintaining availability across regions. Cross-region reads were still eventually consistent, but users rarely read chat history from a different region than they posted from.