From harness-claude
Guides PostgreSQL ACID implementation via WAL, fsync, synchronous_commit tuning for durability-performance tradeoffs. Aids crash recovery debugging and data loss investigation.
npx claudepluginhub intense-visions/harness-engineering --plugin harness-claudeThis skill uses the workspace's default tool permissions.
> The mechanisms that make ACID guarantees real: Write-Ahead Logging ensures atomicity and durability, fsync ensures persistence to physical media, and crash recovery replays the WAL to restore a consistent state.
Explains ACID properties (Atomicity, Consistency, Isolation, Durability) with PostgreSQL SQL examples for designing transactional workflows like payments, evaluating databases, and debugging inconsistencies.
Guides PostgreSQL operations: schema design, index selection (B-tree/GIN/GiST/BRIN), query tuning (EXPLAIN ANALYZE), backups (pg_dump/pg_basebackup), replication, vacuum tuning, monitoring (pg_stat_statements), RLS, extensions (PostGIS/timescaledb).
Optimizes PostgreSQL queries using EXPLAIN ANALYZE, configures streaming/logical replication, tunes VACUUM/autovacuum, handles JSONB operations/indexing, manages extensions, and monitors via pg_stat views.
Share bugs, ideas, or general feedback.
The mechanisms that make ACID guarantees real: Write-Ahead Logging ensures atomicity and durability, fsync ensures persistence to physical media, and crash recovery replays the WAL to restore a consistent state.
The core invariant: changes are written to the WAL before they are written to data files. This is what makes crash recovery possible.
The write path in PostgreSQL:
COMMITfsync on WAL segment file)COMMIT returns success to the clientThe data files may be out of date at any moment -- that is fine, because the WAL contains everything needed to reconstruct the current state.
Inspecting WAL state:
-- Current WAL position
SELECT pg_current_wal_lsn();
-- WAL level (minimal, replica, logical)
SHOW wal_level;
-- Current WAL segment file
SELECT pg_walfile_name(pg_current_wal_lsn());
fsync forces the operating system to flush file data from kernel buffers to physical storage. Without fsync, a power failure can lose data that the OS reported as "written."
The synchronous_commit tradeoff:
-- Default: full durability, every COMMIT waits for WAL fsync
SET synchronous_commit = on;
-- Relaxed: COMMIT returns before WAL fsync, ~5x throughput gain
-- Risk: last few milliseconds of committed transactions may be lost on crash
SET synchronous_commit = off;
When synchronous_commit = off is acceptable:
When it is NOT acceptable:
-- Critical table: full durability (default)
BEGIN;
SET LOCAL synchronous_commit = on;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
COMMIT;
-- Non-critical table: relaxed durability for throughput
BEGIN;
SET LOCAL synchronous_commit = off;
INSERT INTO page_views (user_id, path, viewed_at)
VALUES (42, '/dashboard', NOW());
COMMIT;
SET LOCAL scopes the setting to the current transaction only -- it does not affect other connections.
When PostgreSQL starts after a crash, it replays WAL records from the last completed checkpoint:
Inspecting checkpoint state:
# Show last checkpoint location and recovery info
pg_controldata /var/lib/postgresql/data | grep -i checkpoint
Recovery time is proportional to the amount of WAL generated since the last checkpoint. Two settings control this:
-- Maximum time between checkpoints (default: 5 minutes)
SHOW checkpoint_timeout;
-- Maximum WAL size before forcing a checkpoint (default: 1GB)
SHOW max_wal_size;
Shorter checkpoint intervals mean faster recovery but more I/O during normal operation.
A checkpoint writes all dirty buffers to data files and advances the recovery start point:
-- Manual checkpoint (rarely needed)
CHECKPOINT;
-- Enable checkpoint logging for monitoring
ALTER SYSTEM SET log_checkpoints = on;
SELECT pg_reload_conf();
Checkpoint tuning for production:
-- Spread checkpoint I/O over time (0.0-1.0, default 0.9)
SHOW checkpoint_completion_target;
A checkpoint_completion_target of 0.9 means PostgreSQL tries to complete the checkpoint I/O within 90% of the checkpoint interval, avoiding I/O spikes.
A production server crashes and takes 15 minutes to recover. Investigation:
-- Check settings
SHOW checkpoint_timeout; -- 30min (too long!)
SHOW max_wal_size; -- 8GB (too large!)
With a 30-minute checkpoint interval and 8GB of WAL, recovery must replay up to 8GB of changes. Reducing to checkpoint_timeout = 5min and max_wal_size = 1GB brings recovery time under 2 minutes at the cost of more frequent (but smaller) checkpoint I/O.
Disabling fsync for benchmarks and forgetting to re-enable. fsync = off provides dramatic benchmark improvements but guarantees data loss on crash. Never run production with fsync = off.
Setting synchronous_commit = off globally. This should be per-transaction or per-session, not a system-wide default. Critical transactions must have durability guarantees.
Assuming cloud storage means durability is free. EBS volumes can still lose in-flight writes during instance failures. Cloud databases (RDS, Aurora) add replication on top of local fsync, but the application must still use proper transactions.
Ignoring checkpoint warnings in logs. PostgreSQL logs warnings when checkpoints happen too frequently (checkpoints are occurring too frequently). This means max_wal_size is too small for the write workload.
After a checkpoint, the first modification to any page writes the entire page to WAL (not just the change). This protects against partial page writes -- a crash mid-write could leave a page half-old, half-new. The full page image in WAL ensures recovery can restore the complete page.
-- Default: on (do not disable in production)
SHOW full_page_writes;
Disabling full_page_writes reduces WAL volume but risks unrecoverable data corruption if a crash occurs during a partial page write.
MySQL's InnoDB uses a redo log (analogous to PostgreSQL's WAL) and a doublewrite buffer:
ib_logfile0, ib_logfile1). Same write-ahead principle as PostgreSQL WAL.full_page_writes for the same purpose.Key differences: InnoDB redo logs are fixed-size circular buffers (must be sized correctly or writes stall). PostgreSQL WAL segments are created as needed and archived or recycled.
AWS RDS and Aurora replicate WAL to 3 Availability Zones before acknowledging a commit. This provides durability beyond what single-node fsync offers -- the data survives even if the entire AZ is lost. However, the replication adds latency (~1-3ms per commit compared to local-only fsync).
Aurora specifically decouples storage from compute: 6 copies of data across 3 AZs, with a quorum write of 4/6. This means Aurora can lose an entire AZ plus one additional storage node and still serve reads and writes.
A real-time analytics platform ingested 50K events/second. With default synchronous_commit = on, the database sustained 12K inserts/second -- a 4x gap. Analysis showed events were individually committed (auto-commit mode) and each waited for WAL fsync.
Two changes closed the gap: (1) batch inserts in groups of 100 within a single transaction (reducing fsync calls 100x), and (2) SET synchronous_commit = off for the analytics connection pool (acceptable because missed events could be re-sent from the message queue). Throughput reached 60K inserts/second with sub-second crash recovery window.
synchronous_commit = off) is applied only to non-critical data paths, never globally.