Skill

nw-formal-verification-tlaplus

TLA+ and PlusCal for specifying distributed system invariants. Decision heuristics for when formal verification adds value, key patterns, state explosion management, and alternatives comparison.

From nw
Install
1
Run in your terminal
$
npx claudepluginhub nwave-ai/nwave --plugin nw
Tool Access

This skill uses the workspace's default tool permissions.

Skill Content

Formal Verification with TLA+

When to Recommend Formal Verification

Decision Tree

Is the system distributed or concurrent?
|
+-- No --> Complex state machine with high failure cost?
|          +-- No --> NOT cost-effective. Use property-based testing.
|          +-- Yes --> CONSIDER TLA+
|
+-- Yes --> Consensus, coordination, or distributed transactions?
|           +-- Yes --> RECOMMEND TLA+
|           +-- No --> Could concurrency bug cause data loss or safety issues?
|                      +-- Yes --> RECOMMEND TLA+
|                      +-- No --> OFFER as option

Strong Indicators (Recommend)

DomainWhy TLA+ Adds ValueEvidence
Distributed consensus (Paxos, Raft)Subtle interleaving bugs in leader electionRaft TLA+ spec ~400 lines, found implementation bugs
Financial distributed transactionsAtomicity violations cause monetary lossAWS DynamoDB replication verified
Leader election, distributed lockingSplit-brain, deadlock, stale-lockAWS lock manager verified
Eventual consistency / CRDTsConvergence proofs requiredTLA+ CRDT framework verifies SEC
Safety-critical state machinesRegulatory requirementsDO-178C, CENELEC recognize formal methods
Multi-party coordination (sagas, 2PC)Compensation ordering, partial failure2PC is canonical TLA+ example
Data replication protocolsOrdering, consistency under failureElasticsearch, MongoDB, Cosmos DB verified

When NOT to Use

  • Simple CRUD (bugs are in implementation, not design)
  • Single-process without complex state machines
  • Prototypes/MVPs (design will change before verification completes)
  • Performance optimization (TLA+ models correctness, not performance)

Cost-Benefit Reference

  • Learning curve: 2-3 weeks to useful results (AWS engineers, all levels)
  • Typical spec effort: 2-4 weeks part-time for a distributed protocol
  • ROI highest when: bug cost is high, system is long-lived, protocol is novel, concurrency testing is impractical

Core Concepts for Architects

What TLA+ Specifies

TLA+ describes what a system should do (allowed behaviors), not how to implement it. Specifications are mathematical objects checked for correctness before any code exists.

Safety vs. Liveness

Property TypeMeaningExpressionExample
SafetyNothing bad happensInvariant: predicate true in every reachable state"Two processes never hold same lock"
LivenessSomething good eventually happensTemporal: <> (eventually), []<> (infinitely often)"Every request eventually gets response"

Safety violations produce counterexample traces (the debugging artifact). Liveness requires fairness conditions.

PlusCal vs. Raw TLA+

PlusCal compiles to TLA+ with programming-like syntax. Start with PlusCal for first 2-3 specs, then learn raw TLA+ for cases PlusCal cannot express.

Key PlusCal constructs: variables (state) | labels (atomic action boundaries) | either/or (nondeterministic choice) | await (blocking) | process \in 1..N (concurrent processes) | fair process (weak fairness)

Labels define concurrency granularity: everything between two labels is one atomic step. Two processes interleave only at label boundaries.

State Explosion Management

State space grows exponentially: (states per node)^(nodes) x (message permutations).

Containment Strategies

StrategyTechniqueImpact
Bound parametersStart with 2-3 nodes, 2-4 messagesMost bugs appear at small N
Symmetry reductionSYMMETRY Permutations(Nodes)Up to N! reduction
Reduce labelsMerge labels where fine-grained atomicity unnecessaryOrders of magnitude
State constraintsCONSTRAINT Len(log[n]) < MaxLogLengthPrune uninteresting states
AbstractionModel protocol not implementation (TCP -> message set)Dramatic reduction
DecompositionMultiple focused specs, not one monolithEach independently checkable
Progressive refinement2 nodes -> 3 nodes -> add failures -> add livenessIncremental verification
Simulation modejava -jar tla2tools.jar -simulate -depth 100Trades completeness for speed

Memory and Time Budgets

Unique StatesExpected TimeMemoryApproach
< 10KSeconds< 1 GBExhaustive, single thread
10K - 1MMinutes1-4 GBExhaustive, -workers auto
1M - 100MHours4-32 GBExhaustive with constraints
100M - 1BDays32-64 GBLarge instance or simulation
> 1BWeeks60+ GBSimulation, TLAPS, or decompose

Estimation Before Running

  1. Count distinct variable values in model
  2. Multiply domains together for baseline
  3. Start TLC with smallest parameters, observe state count
  4. Extrapolate: doubling a parameter typically squares or cubes the space

Key Specification Patterns

Two-Phase Commit (2PC)

  • Variables: rmState, tmState, tmPrepared, msgs
  • Safety: no RM commits while another aborts (Consistency)
  • State space: 3 RMs ~718 states, 5 RMs ~21,488 states
  • Common mistake: not modeling RM spontaneous abort or unreliable network

Distributed Consensus (Raft)

  • Variables: currentTerm, votedFor, log, state, votesGranted, msgs
  • Safety: at most one leader per term (ElectionSafety)
  • Safety: logs with same index+term are identical (LogMatching)
  • State space: 3 nodes, MaxTerm=2 ~10K-100K states

Saga (Compensating Transactions)

  • Variables: stepState, sagaState, compensateIdx
  • Safety: steps execute in order, compensations in reverse (OrderInvariant)
  • Safety: no completed steps remain after abort (CompensationComplete)
  • Common mistake: not enforcing reverse compensation order

Distributed Lock with Lease

  • Variables: lockHolder, leaseExpiry, clock, nodeState
  • Safety: at most one holder (MutualExclusion)
  • Models crash (node loses awareness) and lease expiry
  • Common mistake: not distinguishing server-side lock state from node belief

CRDT Convergence (G-Counter)

  • Variables: counters (vector per node)
  • Safety: counters monotonically non-decreasing
  • Liveness: all nodes eventually converge after merge
  • Common mistake: merge not commutative, associative, and idempotent

Alternatives Comparison

ToolBest ForLearningDistributed SystemsTemporal Properties
TLA+/PlusCalDistributed protocols, consensus2-3 weeksExcellentNative
AlloyData models, structural properties1-2 weeksAdequateLimited (Alloy 6)
Property-Based TestingImplementation correctnessHours-daysWith stateful testingNone
XState/StatechartsUI workflows, single-processDaysNot applicableNone
Session Types/ScribbleCommunication patternsModerateGood (message patterns)Implicit
TLAPS (proofs)Critical certificationMonthsExcellentFull

Combined Workflow (TLA+ + PBT)

  1. Write TLA+ spec during DESIGN wave; identify invariants
  2. Model-check with TLC to verify design
  3. Implement code during DELIVER wave
  4. Reuse TLA+ invariants as PBT properties
  5. PBT verifies implementation conforms to verified design

Architecture Decision Record Template

## Decision: Use TLA+ for [Component Name]

### Context
[Component] implements [protocol] with [N] participants
and [concurrency/distribution characteristics].

### Problem
Informal reasoning about [failure/interleaving scenario]
is insufficient because [reason].

### Decision
Formally specify [component] in TLA+/PlusCal and verify:
- Safety: [specific invariants]
- Liveness: [specific temporal properties]

### Model Parameters
- Nodes: [2-3 for initial verification]
- Messages: [bounded to N]
- Failure modes: [crash, partition, message loss]

### Estimated Effort
- Specification: [1-2 weeks]
- Model checking: [hours to days]

### Not Modeling (out of scope)
- [performance, serialization, UI, etc.]

Architect's Checklist

  1. Identify components with concurrency, distribution, or complex state machines
  2. Determine safety properties (what must NEVER happen)
  3. Determine liveness properties (what must EVENTUALLY happen)
  4. Estimate model parameters and state space
  5. Assess cost-effectiveness vs. alternatives (decision tree above)
  6. Document decision in ADR with specific invariants and properties
  7. Scope verification: focused specs per subsystem, not one monolith

Industry Precedent

OrganizationSystems VerifiedOutcome
AWS14 projects across 10 systems (DynamoDB, S3, EBS)Found subtle bugs in every system; management actively encourages adoption
Azure Cosmos DBAll 5 consistency levelsSpecs became authoritative reference, replaced ambiguous docs
MongoDBReplication, reconfiguration, transactionsLogless reconfig deployed since 4.4, no protocol bugs
ElasticsearchCluster coordination, data replication4 TLA+ specs + Isabelle proofs, open-sourced
CockroachDBTransaction layerTLA+ specs in repository under docs/tla-plus/
Stats
Parent Repo Stars299
Parent Repo Forks37
Last CommitMar 20, 2026