From grimoire
Diagnoses non-deterministic test failures and eliminates root causes (timing, shared state, concurrency, external dependency, randomness) instead of retrying or skipping.
How this skill is triggered — by the user, by Claude, or both
Slash command
/grimoire:fix-flaky-testThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Quarantine the flaky test, identify the non-determinism root cause from the six categories, and eliminate it — rather than retrying, skipping, or ignoring.
Quarantine the flaky test, identify the non-determinism root cause from the six categories, and eliminate it — rather than retrying, skipping, or ignoring.
Adopted by: Google publishes its flaky test management methodology publicly (Google Testing Blog, 2016–2022); Microsoft, Netflix, Spotify, and Meta each have internal flaky test detection and quarantine systems; Google's Test Infrastructure team built automated flaky test detection into Google's CI; the academic study of flaky tests (Luo et al., FSE 2014) is the foundational empirical research, analyzed 201 flaky tests across 51 open source projects Impact: Google (2016): 16% of all test failures at Google are caused by flaky tests — not product bugs; flaky tests cause developers to re-run CI 40% more often than in codebases with low flakiness rates; Luo et al. (FSE 2014): 45% of flaky tests are caused by async wait issues and test order dependency — both fixable with deterministic patterns; Netflix (2019): reducing flaky test rate from 5% to < 0.5% reduced developer "ignore and re-run" behavior by 70% and improved defect detection confidence Why best: Retrying flaky tests masks the non-determinism without fixing it — retry loops consume CI time and train developers to distrust test failures; skipping flaky tests removes coverage permanently; the only productive response is elimination; flaky tests that are not fixed accumulate — Google data shows flaky tests grow at 1.5× the rate of new tests if not actively managed
Sources: Listfield "Where do our flaky tests come from?" (Google Testing Blog, 2017); Google "Flaky Tests at Google and How We Mitigate Them" (Google Testing Blog, 2016); Luo, Hariri, Eloussi & Marinov "An Empirical Analysis of Flaky Tests" (FSE, 2014); Micco "Flaky Tests at Google" (Google Testing Blog, 2017)
Before diagnosing, quarantine the test so it stops blocking CI without losing coverage tracking:
# Python (pytest) — mark as quarantined, not skipped
@pytest.mark.flaky(reruns=0) # do NOT use reruns — masks the problem
@pytest.mark.quarantine # custom marker; tracks quarantined tests
def test_something_flaky():
...
// Jest — move to a separate quarantine suite
// quarantine.test.js (excluded from main CI run, included in a separate flaky-detection job)
describe.skip('QUARANTINED', () => {
it('something flaky', () => { ... });
});
// JUnit 5
@Tag("quarantine")
@Test
void somethingFlaky() { ... }
Quarantine rules:
Do NOT: add @Retry(3) / --retries / flake_tolerance. Retries mask the flakiness, increase CI time, and prevent the root cause from being found.
Before fixing, reproduce the failure reliably:
# Run the test 50 times and observe the failure rate
for i in $(seq 50); do pytest tests/test_foo.py::test_something -x 2>&1 | tail -1; done | sort | uniq -c
# OR: use pytest-repeat
pytest tests/test_foo.py::test_something --count=50
If you cannot reproduce failure in 50 runs, the test may have been a one-time infrastructure flake (network timeout, resource contention). Monitor for recurrence before investing in diagnosis.
If you CAN reproduce in 50 runs, proceed to root cause identification.
Luo et al. (FSE 2014) categorize 45% of flaky tests as async/timing or order-dependency issues. Check each category:
Category 1: Async / timing dependency
Symptom: test passes locally, fails in CI; failure rate increases under load.
# Problem: fixed sleep instead of waiting for condition
time.sleep(2)
assert element.is_visible() # fails if render takes > 2s
# Fix: wait for condition explicitly
wait.until(lambda: element.is_visible(), timeout=10)
// Problem: not awaiting async operation
const result = fetchData(); // returns Promise, not value
expect(result).toBe('done'); // always fails or always passes by accident
// Fix: await
const result = await fetchData();
expect(result).toBe('done');
Signs: test includes sleep(), setTimeout(), fixed delays; test passes on fast machines, fails on slow CI.
Category 2: Test order dependency / shared state
Symptom: test passes in isolation (pytest -k test_name) but fails in the full suite.
# Diagnose: run the full suite with --randomly-seed=LAST to repeat exact order
pytest --randomly-seed=last
# Find which test contaminates state
pytest --randomly-seed=last -p no:randomly # disable random, find fixed-order failure
Root causes:
Fix patterns:
# Setup/teardown to isolate state
@pytest.fixture(autouse=True)
def reset_global_cache():
cache.clear()
yield
cache.clear()
# Use transactions that roll back after each test (database tests)
@pytest.fixture
def db_session():
session = Session()
session.begin()
yield session
session.rollback()
session.close()
Category 3: Concurrency / race condition
Symptom: test involves threads, async tasks, or parallel execution; fails intermittently with assertion errors or deadlocks.
# Problem: test reads shared state while another thread writes
def test_counter():
counter.increment() # thread A
assert counter.value == 1 # may read between increment calls if B also increments
# Fix: synchronize before asserting
def test_counter():
counter.increment()
counter.wait_for_completion() # barrier
assert counter.value == 1
If the production code has a race condition that the test is exposing: fix the production code, not just the test.
Category 4: External dependency / network
Symptom: test calls real network, file system, or clock; fails when service is slow or unavailable.
# Problem: real HTTP call in unit test
def test_user_creation():
response = requests.post('https://api.service.com/users', ...)
assert response.status_code == 201
# Fix: mock the external dependency
def test_user_creation(requests_mock):
requests_mock.post('https://api.service.com/users', status_code=201)
response = create_user(...)
assert response.status_code == 201
Exception: integration tests that intentionally call real services should run in a separate suite with retry tolerance — not mixed into the unit test suite.
Category 5: Resource leak / environment pollution
Symptom: test fails only after the test suite has been running for a while; OOM errors; file descriptor exhaustion.
Diagnosis:
# Monitor resource usage during test run
pytest tests/ --tb=short 2>&1 | grep -E "ResourceWarning|MemoryError|Too many open files"
Fix: ensure all resources are closed in teardown:
@pytest.fixture
def temp_file():
f = open('test.tmp', 'w')
yield f
f.close()
os.unlink('test.tmp') # always clean up
Category 6: Randomness / non-deterministic data
Symptom: test uses random, uuid, datetime.now(), or shuffled collections.
# Problem: test depends on random order
items = get_items() # returns items in random order
assert items[0].name == 'Alice' # fails when order changes
# Fix: sort before asserting, or use set comparison
assert {item.name for item in items} == {'Alice', 'Bob'}
# Fix: seed random in tests
import random
random.seed(42)
After fixing:
# Run 100 times to confirm flakiness is gone
pytest tests/test_foo.py::test_something --count=100
All 100 runs must pass. If failure rate drops but doesn't reach 0%, the fix is incomplete — the root cause has multiple contributing factors.
After 100 clean runs:
test: fix flaky test_something — async timing dependencyAfter fixing, look for the same pattern in adjacent tests:
# Find all tests using fixed sleeps (common source of timing flakiness)
grep -rn "time.sleep\|setTimeout" tests/
# Find all tests not using db transactions (order-dependency risk)
grep -rn "def test_" tests/integration/ | grep -v "db_session"
File a follow-up ticket to address the pattern, not just the instance.
@Retry(3) as the "fix": reduces visible failures at the cost of 2–3× CI time and zero improvement in code quality; the flakiness remains and will resurface as the suite growstime.sleep(5) to fix a timing issue: adds a 5-second fixed delay; the test still fails when the system is slower than 5 seconds under load; use condition-wait with timeout insteadtest_order_list for shared state contamination without checking test_order_create, test_order_update — the same fixture anti-pattern exists in 3 other tests; find and fix the patternwrite-regression-test and bisect-regression insteadnpx claudepluginhub jeffreytse/grimoire --plugin grimoireDiagnoses and eliminates flaky or nondeterministic tests by classifying failure types (ordering, timing, resource, environment, external, concurrency) and isolating root causes with reproducible fixes.
Triages flaky tests across any framework into root-cause categories (async races, shared state, environment coupling, etc.) and assigns remediation or quarantine paths.
Expert approach to flaky-test-remediation in test automation. Use when working with .