npx claudepluginhub suriyel/longtaskforagent --plugin long-taskThis skill uses the workspace's default tool permissions.
Run cross-feature and system-wide testing after all features are implemented and passing. Per-feature ST test cases (functional, boundary, UI, security) have already been executed during each Worker cycle via `long-task-feature-st`. This phase focuses on what per-feature testing **cannot** cover: cross-feature interactions, multi-feature E2E workflows, system-wide NFR verification, compatibilit...
Manages test environment lifecycle, executes black-box acceptance testing per feature after quality gates pass, generates ISO/IEC 29119 compliant test case documents
Manages QA: creates test plans, designs test cases, performs exploratory testing, writes bug reports, verifies fixes, defines coverage, signs off releases.
Generates complete layered testing strategy (L1-L4 pyramid), plans, architecture, scenarios, code templates, and CI/CD configs for Backend+APP, Backend+WEB, or Backend+APP+Embedded projects.
Share bugs, ideas, or general feedback.
Run cross-feature and system-wide testing after all features are implemented and passing. Per-feature ST test cases (functional, boundary, UI, security) have already been executed during each Worker cycle via long-task-feature-st. This phase focuses on what per-feature testing cannot cover: cross-feature interactions, multi-feature E2E workflows, system-wide NFR verification, compatibility, and exploratory testing.
Announce at start: "I'm using the long-task-st skill. All features are passing — time for cross-feature system testing."
Core principle: Per-feature ST test cases prove individual features meet their requirements. System testing proves the whole works together across feature boundaries.
Do NOT skip any applicable test category. A "Go" verdict requires evidence from EVERY category that applies to this project. "It probably works" is not evidence. Manual test escalation: When any test scenario in Steps 3-8 cannot be automated (requires physical device, human visual judgment, or external human action), use `AskUserQuestion` to present the test to the human and collect their verdict.Format:
Manual Verification Required: {test description}
Category: {ST step — e.g., Integration, E2E, NFR, Compatibility}
What to verify: {specific check with acceptance criteria}
Expected result: {threshold or criterion from SRS}
Please report: PASS or FAIL on first line, then what you observed.
To skip: respond SKIP {reason}
PASS/FAIL/SKIP. If unparseable, re-prompt once; then record as BLOCKED.MANUAL-PASS or MANUAL-FAIL in the ST report.MANUAL-FAIL has the same severity implications as automated failure — it blocks Go verdict for Critical/Major items.SKIP {reason} records as BLOCKED with the reason (not silently skipped).
You MUST create a TodoWrite task for each step and complete them in order:
python scripts/check_st_readiness.py feature-list.json
"status": "passing" — if any failing, invoke long-task:long-task-work insteaddocs/plans/*-srs.md); Design document exists (docs/plans/*-design.md)long-task-guide.md; if the project uses a file-based config, ensure it is sourced before running the checksenv-guide.md (skip if CLI/library only)
env-guide.md — use "Start All Services" section; run each command with output redirect:
[start command] > /tmp/svc-<slug>-start.log 2>&1 &
sleep 3
head -30 /tmp/svc-<slug>-start.log
env-guide.md — must respond before proceedingenv-guide.md (Services table + Start command); if the fix needs >2 steps, extract to scripts/svc-<slug>-start.sh and reference from env-guide.mdtask-progress.md — required for Step 11 cleanupfeature-list.json — note tech_stack, quality_gates, constraints[], assumptions[]docs/plans/*-ucd.md)task-progress.md — session history contextCreate docs/plans/YYYY-MM-DD-st-plan.md with:
| Category | Applies When | Skip When |
|---|---|---|
| Regression | Always | Never |
| Integration | 2+ features with shared data/state/APIs | Single isolated feature |
| E2E Scenarios | SRS has multi-step user workflows | Pure library/utility projects |
| Performance | SRS has NFR-xxx with response time / throughput targets | No performance NFRs |
| Security | Security NFRs OR project handles user input / auth / external data | Isolated offline tools |
| Compatibility | SRS specifies platform / browser / runtime targets | Single-platform CLI tools |
| Exploratory | Always | Never |
Map EVERY SRS requirement to ST test approach. Reference per-feature test case documents from Worker Step 9:
| Req ID | Requirement | Feature ST Status | System ST Category | ATS Categories | Test Approach | Priority |
|--------|-------------|-------------------|--------------------|----------------|---------------|----------|
| FR-001 | ... | docs/test-cases/feature-1-xxx.md (PASS) | E2E | FUNC,BNDRY,SEC | Scenario: ... | High |
| NFR-001 | ... | docs/test-cases/feature-5-xxx.md (PASS) | Performance | PERF | Load test: ... | Critical |
| IFR-001 | ... | N/A (cross-feature) | Integration | FUNC,BNDRY | Contract test: ... | High |
Every FR-xxx, NFR-xxx, IFR-xxx must appear in the RTM. Requirements without a test approach = gap.
If any requirements require manual verification (from ATS 自动化可行性 column or Feature-ST manual cases), include a Manual column in the RTM:
| Req ID | ... | Manual | Test Approach | Priority |
|--------|-----|--------|---------------|----------|
| FR-010 | ... | Yes: visual-judgment | Manual visual verification via AskUserQuestion | High |
ATS compliance gate (if ATS document exists):
python scripts/check_ats_coverage.py docs/plans/*-ats.md --feature-list feature-list.json --strict
Must exit 0. Any ATS category gap = finding to resolve before proceeding.
Entry (must ALL be true): all features passing, environment provisioned, all required configs present.
Exit (must ALL be true for Go verdict): all regression/integration/E2E tests pass, all NFR thresholds met with measured evidence, no Critical or Major defects open, RTM shows 100% requirement coverage, ATS category compliance verified (if ATS exists: check_ats_coverage.py --strict exits 0).
Run full project test suite using commands from long-task-guide.md
Verify ALL tests pass — zero failures, zero errors
Verify line and branch coverage thresholds met project-wide
Check for new warnings, deprecation notices, dependency conflicts
Any failure → STOP — this is a regression. Diagnose before proceeding.
Record: total tests, passed/failed, line/branch coverage vs thresholds.
Run full-codebase mutation testing. Per-feature mutation during Worker cycles may have only scoped feature tests (when active features > mutation_full_threshold); this step verifies mutation score holds project-wide with the full test suite.
mutation_full command from long-task-guide.mdquality_gates.mutation_score_min from feature-list.jsonSee references/st-recipes.md section "Full Mutation Regression" for per-tool commands and result interpretation.
Record: mutation score vs threshold, surviving mutant count, tool output summary.
Test cross-feature interactions. Read references/st-recipes.md for language-specific patterns and real-vs-contract test classification.
Terminology (see st-recipes.md §1 for details):
For external third-party boundaries: if test credentials are available in required_configs or environment, write real integration tests. Otherwise, use contract tests and record the reason in the Mock Authorization column.
For each pair of features sharing data, state, or API boundaries:
dependencies[] graph in feature-list.json; verify features work in dependency order; test each dependency edgeClassification table (include in ST plan):
| Boundary | Features | Type | Real Tests | Contract Tests | Mock Authorization | Status |
|---|---|---|---|---|---|---|
| shared DB | F1 → F3 | Internal | 2 | 1 | N/A | PASS |
| REST API | F2 → F4 | Internal | 1 | 0 | N/A | PASS |
| GitHub API | F5 → ext | External | 1 | 0 | N/A (user provided token) | PASS |
| Stripe API | F7 → ext | External | 0 | 2 | User confirmed no sandbox | PASS |
Minimum per internal boundary:
External boundary protocol:
required_configs and environment for test credentials/sandbox environmentWrite integration tests in tests/integration/ or tests/st/. Tag each test:
# Integration: Feature A → Feature B (shared DB) [Real]
def test_feature_a_data_consumed_by_feature_b():
...
# Contract: Feature C → External API [Contract]
def test_external_api_response_shape():
...
Run and record results per boundary.
Before E2E scenario testing, verify at least one complete data flow path through the entire system. This catches integration bugs that per-boundary tests miss.
At least ONE smoke test must exercise a real end-to-end data path (input → processing → storage → retrieval → output) using only real services. No mocks.Scaling:
| Project Size | Smoke Tests |
|---|---|
| Tiny (1-5 features) | 1 critical path |
| Small (5-15) | 1-2 critical paths |
| Medium (15-50) | 2-3 critical paths |
| Large (50+) | 3-5 critical paths covering major subsystems |
Record: smoke test description, real services used, pass/fail, execution evidence.
Test complete user workflows that span multiple features from SRS acceptance criteria. Single-feature scenarios are already covered by per-feature ST test cases.
For each user persona in SRS Stakeholders:
For each scenario: set up initial state, execute step-by-step, verify intermediate states AND final outcome, clean up.
UI E2E Testing (only if "ui": true features exist): Use Chrome DevTools MCP for browser-based E2E verification.
Write E2E tests in tests/e2e/ or tests/st/. Run and record results.
Per-feature NFR checks were handled in feature-level ST. This step focuses on system-wide aggregate NFR measurement. For each NFR-xxx in SRS, verify with measured evidence — not estimates.
references/st-recipes.md for benchmarking tools. Record: measured value vs SRS threshold.Skip if SRS does not specify platform/browser/runtime targets.
Record: per-platform/browser PASS/FAIL matrix.
Charter-based, time-boxed sessions to find issues that scripted tests miss.
Create one charter per major feature area:
Charter: Explore [feature area]
with [technique: stress/edge/abuse/workflow variation]
to discover [bugs/usability issues/undocumented behavior]
For each charter: time-box 15-30 minutes; follow intuition — try unexpected inputs, unusual sequences, rapid interactions; log observations in real-time (Bug / Question / Note with severity).
If an exploratory charter identifies issues requiring physical device access or human visual judgment beyond what Chrome DevTools MCP can verify: use AskUserQuestion to collect the tester's findings with the charter context.
After all charters: consolidate findings; cross-reference with RTM for requirement gaps; add new defects to triage queue.
If ANY defects were found in Steps 3-8:
| Severity | Definition | Action |
|---|---|---|
| Critical | System crash, data loss, security breach | BLOCK release — fix immediately |
| Major | Core workflow broken, NFR threshold failed | BLOCK release — fix before release |
| Minor | Non-core affected, workaround exists | Document — fix now or defer (decide with user) |
| Cosmetic | Visual/text issue, no functional impact | Document — defer to next release |
Escape analysis — for each defect, classify where it should have been caught:
| Escaped From | Meaning | Systemic Action |
|---|---|---|
| Unit | TDD should have caught this | Add unit test; review coverage for similar gaps |
| Feature-ST | Per-feature acceptance testing gap | Add test case via increment skill |
| Mock-Leaked | Mock test passed but real integration fails | Replace mock with real integration test |
| Integration | Cross-feature boundary not tested | Add integration test for boundary |
| Spec | Requirement ambiguous or missing | Clarify SRS via increment skill |
Include the "Escaped From" column in the defect table:
| # | Severity | Escaped From | Category | Description | Status | Fix Ref |
|---|
Fix loop (if Critical/Major defects exist):
"status": "failing" in feature-list.json; document in task-progress.mdlong-task:long-task-work to fixFor Minor/Cosmetic deferrals: document in ST report with severity, description, workaround.
Before writing, verify: every SRS requirement appears in RTM; every NFR has a measured value meeting the threshold; every applicable category has results; all defects are classified.
Generate docs/plans/YYYY-MM-DD-st-report.md with these sections:
check_ats_coverage.py --strict output)Real test case counts (total / passed / failed) from all feature ST documents (docs/test-cases/feature-*.md Real Test Case Execution Summary tables); if any manual test cases exist, include a Manual Test Cases row — aggregate manual test case counts (total / MANUAL-PASS / MANUAL-FAIL / BLOCKED) from all Feature-ST documents and System-ST execution stepsdocs/test-cases/feature-*.md Real Test Case Execution Summary tables)docs/plans/*-st-plan.md, docs/plans/*-st-report.md, test files)python scripts/validate_features.py feature-list.jsonenv-guide.md "Stop All Services" section — kill by PID (from task-progress.md, preferred) or by port (fallback)env-guide.md Stop command; if >2 steps are needed, extract to scripts/svc-<slug>-stop.sh and reference from env-guide.mdenv-guide.md "Verify Services Stopped" commands — ports must not respondtask-progress.mdpython scripts/check_retrospective_readiness.py
If exit 0 (records found) AND retro_authorized is true in feature-list.json:
long-task:long-task-retrospectiveIf exit 1 (no records) OR retro_authorized is absent/false → skip to Verdict.
Determine the Go/No-Go verdict based on exit criteria. Record in the ST report:
MANUAL-FAIL results are treated identically to automated FAIL for verdict determination. A Critical/Major defect discovered via manual testing blocks Go verdict, same as automated.
If verdict is Go or Conditional-Go:
long-task:long-task-finalizeIf No-Go → skip (loop back to Worker for fixes; Finalize runs after eventual Go).
| Project Size | Features | ST Depth |
|---|---|---|
| Tiny (1-5) | 1-5 features | Regression + lightweight integration + 1 smoke test + 2-3 E2E scenarios + 1-2 exploratory charters |
| Small (5-15) | 5-15 features | Full regression + integration per shared boundary + 1-2 smoke tests + E2E per persona + NFR spot-checks + 3-5 charters |
| Medium (15-50) | 15-50 features | Full regression + systematic integration + 2-3 smoke tests + comprehensive E2E + full NFR + compatibility matrix + 5-10 charters |
| Large (50+) | 50+ features | Full regression + integration test suite + 3-5 smoke tests + E2E automation + full NFR load testing + full compatibility + security audit + 10+ charters |
Real test case in the ST report's Real Test Cases row is an unresolved defect that blocks a Go verdictcheck_ats_coverage.py --strict must exit 0; every required category must have test coverageCalled by: using-long-task (when feature-list.json exists AND all features passing), or long-task-work (Step 12 when no failing features remain)
Reads: feature-list.json, docs/plans/*-srs.md, docs/plans/*-design.md, docs/plans/*-ucd.md (if UI), docs/test-cases/feature-*.md (per-feature ST from long-task-feature-st), task-progress.md, project config file (if applicable)
May invoke: long-task:long-task-work (if Critical/Major defects found → fix loop), long-task:long-task-finalize (after Go/Conditional-Go verdict)
Produces: docs/plans/YYYY-MM-DD-st-plan.md, docs/plans/YYYY-MM-DD-st-report.md
Read on-demand (via Read tool, NOT Skill tool): references/st-recipes.md