From rune
Test orchestration pipeline for arc Phase 7.7. Provides 4-tier testing (unit, property-based, integration, E2E/browser) with diff-scoped discovery, service startup, and structured reporting. Tier 1.5 property-based testing detects roundtrip/validator/idempotent patterns and generates invariant tests with fast-check, hypothesis, proptest, or rapid. Includes extended tier with checkpoint/resume, contract validation, visual regression, design token compliance, accessibility checks, test history persistence, regression detection, and flaky test identification. Auto-loaded by arc orchestrator during test phase. Trigger keywords: testing, test pipeline, unit test, integration test, E2E test, test discovery, test report, QA, quality assurance, scenario schema, checkpoint, fixture, visual regression, design token, accessibility, test history, regression detection, flaky test, extended tier, contract validation, property-based testing, PBT, fast-check, hypothesis, proptest, invariant.
npx claudepluginhub vinhnxv/rune --plugin runeThis skill uses the workspace's default tool permissions.
This skill provides the knowledge base for the arc pipeline's testing phase.
references/accessibility-check.mdreferences/batch-execution.mdreferences/checkpoint-protocol.mdreferences/design-token-check.mdreferences/evidence-protocol.mdreferences/file-route-mapping.mdreferences/fixture-protocol.mdreferences/flaky-detection.mdreferences/history-protocol.mdreferences/property-based-testing.mdreferences/regression-detection.mdreferences/scenario-schema.mdreferences/scope-detection.mdreferences/secret-scrubbing.mdreferences/service-startup.mdreferences/test-discovery.mdreferences/test-report-template.mdreferences/test-strategy-template.mdreferences/testing-plan-schema.mdreferences/visual-regression.mdGuides TDD workflows, pytest unit/integration/UAT testing strategies, test pyramid organization, coverage requirements, and GenAI validation for code quality.
Generates and runs unit, integration (testcontainers/docker-compose), and Playwright E2E test suites for JS/TS code. Analyzes coverage gaps with parallel test-generator agents, executes tests, and heals failures up to 3 times.
Executes test suites across JS/TS, PHP, Go stacks via npm/pnpm/yarn/phpunit/go test; diagnoses failures, auto-fixes issues, generates coverage reports for unit/integration/E2E/mutation/contract tiers.
Share bugs, ideas, or general feedback.
This skill provides the knowledge base for the arc pipeline's testing phase. It is auto-loaded by the arc orchestrator and injected into test runner agents.
/\
/E2E\ ← Slow, few (max 3 routes)
/------\
/Integr. \ ← Moderate speed, moderate count
/----------\
/ PBT (1.5) \ ← Fast, invariant-based (when PBT lib available)
/--------------\
/ Unit Tests (1) \ ← Fast, many (diff-scoped)
/------------------\
Execution order: Unit → PBT → Integration → E2E (serial by tier, parallel within tier) Failure cascade: Tiers execute serially (unit → PBT → integration → E2E). Tier failures are non-blocking — all enabled tiers execute regardless of prior tier results, based on scope detection and service health.
PBT tests invariants with randomly generated inputs, catching edge cases that example-based tests miss. Runs between unit (Tier 1) and integration (Tier 2) tests.
Skip conditions: No PBT library in dependencies AND no PBT-suitable patterns detected in changed code.
Discovery: Check package.json for fast-check, requirements.txt/pyproject.toml for hypothesis, Cargo.toml for proptest, go.mod for rapid. If library present, run PBT tier. If absent but patterns detected, suggest adding the library.
Timeout: 2x unit test timeout (PBT generation is CPU-intensive). Configurable via talisman.testing.tiers.pbt.timeout_multiplier (default: 2).
See property-based-testing.md for library selection, code templates, discovery protocol, and common property patterns.
| Role | Model | Rationale |
|---|---|---|
| Test orchestration (team lead) | Opus | Complex coordination, strategy |
| Unit test runner | Sonnet | Fast execution, low complexity |
| Contract validator | Sonnet | API/schema validation, non-blocking |
| Integration test runner | Sonnet | Moderate complexity, service interaction |
| E2E browser tester | Sonnet | Browser interaction, snapshot analysis |
| Extended test runner | Sonnet | Long-running scenarios, checkpoint support |
| Failure analyst | Opus (inherit) | Root cause analysis, multi-file reasoning |
Strict enforcement: Team lead (Opus) NEVER executes test commands directly. All test execution happens via Sonnet teammates.
See scope-detection.md for the shared resolveTestScope() algorithm.
Summary:
{ files: string[], source: "pr"|"branch"|"current", label: string }gh) → branch diff → current-branch diff → fallback warn[a-zA-Z0-9._/-]+/rune:test-browser standaloneSee test-discovery.md for the full algorithm.
Summary:
resolveTestScope() — NOT from raw git difflib/, utils/, core/) → trigger full unit suiteSee service-startup.md for the full protocol.
Summary:
See file-route-mapping.md for framework patterns.
See test-report-template.md for the output spec.
Test runner agents MUST echo-back their test strategy before execution:
I will verify:
AC-1 (user authentication) → unit test: test_login_flow
AC-2 (rate limiting) → integration test: test_rate_limiter
AC-5 (audit logging) → no test available (WARN: criteria not covered)
This echo-back is logged to the test strategy document and lets the orchestrator detect criteria misalignment BEFORE tests run. If a criterion has no test, it's flagged early rather than discovered post-execution.
The fix loop classifies failures with discipline failure codes for pattern tracking (see failure-codes.md for full F1-F17 registry):
| F-Code | Name | Meaning | Recovery Action |
|---|---|---|---|
| F3 | PROOF_FAILURE | Implementation is wrong — code doesn't meet criterion | Fix code, re-run |
| F8 | INFRASTRUCTURE_FAILURE | Test itself is broken or infra is misconfigured | Fix test/infra, re-run |
| F17 | CONVERGENCE_STAGNATION | Same test fails same assertion across 2+ fix attempts | Escalate immediately — stop retrying |
F17 detection: When the same test fails with the same assertion message across 2+ fix attempts, the fix loop breaks immediately instead of retrying. This prevents wasting cycles on unfixable issues.
F-code → discipline metrics: Classification feeds into discipline metrics (Shard 5 T5.1) for failure pattern tracking across pipeline runs. Patterns like "F3 on auth tests" recurring across arcs indicate systemic implementation gaps.
Implementation status: F-code classification is emitted via
warn()during fix loops and recorded in convergence history (checkpoint). Structured metrics persistence to a cross-run metrics store is planned as part of the discipline metrics pipeline (Shard 5 T5.1) but not yet implemented — current tracking is per-run via checkpoint data and echo entries.
Test runner detects failure
→ Write structured failure to tier result file
→ Continue remaining tests in tier
→ After all tiers complete:
→ Team lead reads tier results (summary only — Glyph Budget pattern)
→ If failures detected:
→ Spawn test-failure-analyst (Opus, 3-min deadline)
→ Analyst reads: failure traces + source code + error logs
→ Analyst receives plan context via test strategy document (which includes
planFilePath from checkpoint — enabling spec-aware root cause analysis)
→ Analyst produces: root cause + fix proposal + confidence
→ With plan context, analyst can identify "test fails because criterion
AC-X was never implemented" — not just "test fails on line Y"
→ If analyst times out: attach raw test output instead
Phase 7.7 uses sequential batched execution instead of parallel background agents. Each batch = 1 foreground agent (blocking call, zero idle risk).
Execution order: unit batches → PBT → contract → integration → e2e → extended Batch sizing: TARGET_BATCH_DURATION_MS / avg_test_duration (clamped to 1-20) Fix loop: On failure, lead analyzes + fixes + reruns (max 2 retries) Checkpoint: testing-plan.json is both plan AND checkpoint (atomic writes) Fresh context: Stop hook re-injects per batch for unlimited context budget
See batch-execution.md for the full algorithm. See testing-plan-schema.md for the JSON schema.
These rules are MANDATORY — not suggestions. Violation halts the pipeline.
skipped_budget_exhaustedAfter all batches complete, verify:
See scenario-schema.md for the YAML test scenario format.
Summary:
.rune/test-scenarios/*.ymlname, tier (unit/pbt/integration/e2e/extended/contract)testing.scenarios.max_per_run (default 50)testing.scenarios.enabled (default true)See checkpoint-protocol.md for the checkpoint/resume protocol.
Summary:
tmp/arc/{id}/extended-checkpoint.jsonextendedResumeStatetesting.extended_tier.checkpoint_interval_ms (default 300_000ms)testing.extended_tier.timeout_ms (default 3_600_000ms)testing.extended_tier.enabled AND extended scenarios existSee fixture-protocol.md for test data fixture execution.
Summary:
testing.fixtures.enabledSee visual-regression.md for the visual regression protocol.
Summary:
testing.visual_regression.baseline_diragent-browser compare --baseline <path> --current <path> --format jsontesting.visual_regression.threshold (default 0.95 = 95% similarity)diffData.similarity < threshold (below 95% similarity)test-results-e2e.md (non-blocking)testing.visual_regression.enabledSee design-token-check.md for design token compliance checks.
Summary:
testing.design_tokens.enabledSee accessibility-check.md for accessibility validation protocol.
Summary:
test-results-e2e.mdtesting.accessibility.enabledSee history-protocol.md for test history persistence format.
Summary:
.rune/test-history/test-history.jsonl (JSONL rolling window)testing.history.max_entries (default 50)testing.history.enabled (default true)See regression-detection.md for regression signal detection.
Two complementary regression signals are evaluated in STEP 9.5. They use different config keys, different algorithms, and different data granularities — they are NOT the same check:
Signal 1 — Global pass-rate drop (arc-phase-test.md STEP 9.5, inline):
testing.history.pass_rate_drop_threshold (float, 0.0–1.0, default 0.05 = 5% drop)passRateDrop = previousPassRate - currentPassRate; if passRateDrop > threshold → warnupdateCheckpoint({ test_regression_detected: true, regression_pass_rate_drop: passRateDrop }) + warnSignal 2 — Per-test historical series (regression-detection.md, per-test algorithm):
testing.history.regression_threshold (integer, default 7) — minimum recent passing runs out of last 10 to classify as a regressionpassCount = recentRuns.filter(passed).length; if passCount >= threshold → regressionSee flaky-detection.md for flaky test identification.
Summary:
pass_in_some_runs AND fail_in_othersflaky_scores maptesting.flaky_detection.enabled (default true)/^[a-zA-Z0-9._\-\/ ]+$/
Validates test runner commands. Blocks semicolons, pipes, backticks, $().
Applied to ALL commands parsed from project config files (package.json, pytest.ini).
/^[a-zA-Z0-9._\-\/]+$/
Validates all file paths. Rejects .. traversal. Always quote: "$file".
E2E URLs MUST be scoped to localhost or the talisman.testing.tiers.e2e.base_url host.
External URLs are rejected to prevent agent-browser from navigating to untrusted sites.
AWS_*, *_KEY, *_SECRET, *_TOKEN, Bearer , sk-*, ghp_*, JWT tokens, emails redacted before agent ingestion. See secret-scrubbing.md for regex patterns and scrubSecrets() implementation