Executes end-user verification tests on real infrastructure using Setup/Action/Assert sequences. Classifies CLI/GUI/SUBJECTIVE tasks for auto-approval or human checkpoints with evidence capture.
From humaninloopnpx claudepluginhub deepeshbodh/human-in-loop --plugin humaninloopThis skill uses the workspace's default tool permissions.
references/EVIDENCE-CAPTURE.mdreferences/REPORT-TEMPLATES.mdreferences/TASK-PARSING.mdreferences/TESTING-EVIDENCE.mdSearches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Executes pre-written implementation plans: critically reviews, follows bite-sized steps exactly, runs verifications, tracks progress with checkpoints, uses git worktrees, stops on blockers.
Execute verification tasks that validate real infrastructure behavior through structured Setup/Action/Assert sequences. Classify tasks at runtime (CLI/GUI/SUBJECTIVE) to determine whether to auto-approve or present human checkpoints. This skill transforms tasks marked with **TEST:** into executable verification sequences with captured evidence.
Violating the letter of the rules is violating the spirit of the rules.
Verification testing exists to catch failures before they reach production. Every shortcut in this process is a potential production incident waiting to happen.
**TEST:**, **TEST:VERIFY**, or **TEST:CONTRACT****HUMAN VERIFICATION** tasks (mapped to unified format)Identify tasks containing verification markers:
- [ ] **TN.X**: **TEST:** - {Description}
- **Setup**: {Prerequisites} (optional)
- **Action**: {Command or instruction}
- **Assert**: {Expected outcome}
- **Capture**: {console, screenshot, logs} (optional)
Supported markers (all normalized to unified format):
**TEST:** - Unified format (preferred)**TEST:VERIFY** - Legacy format**TEST:CONTRACT** - Legacy format**HUMAN VERIFICATION** - Legacy format (maps Setup/Action/Verify fields)See references/TASK-PARSING.md for field marker extraction rules.
Execute in strict order. No skipping steps. No reordering.
1. Parse Task
Extract structured data:
2. Execute Setup
Run setup commands sequentially. Fail fast if any setup fails. Record all output for debugging. Setup failures block action execution.
3. Execute Actions
Run each action respecting modifiers:
| Modifier | Example | Behavior |
|---|---|---|
(background) | npm start (background) | Run async, track PID |
(timeout Ns) | curl ... (timeout 10s) | Override 60s default |
(in {path}) | make build (in ./backend) | Change directory |
Capture all console output. Track background processes. Enforce timeouts. See references/EVIDENCE-CAPTURE.md for capture details.
4. Evaluate Asserts
Check each assert against captured evidence:
| Pattern | Verification |
|---|---|
Console contains "{text}" | Substring match |
Console contains "{text}" (within Ns) | Timed match |
File exists: {path} | test -f {path} |
Response status: {code} | HTTP status check |
5. Generate Report
See references/REPORT-TEMPLATES.md for templates.
6. Present Checkpoint
Ask human to approve, reject, or retry. Human decision gates cycle completion. No proceeding without explicit human approval.
Before execution, classify the task based on Action and Assert content:
| Classification | Criteria | Checkpoint Behavior |
|---|---|---|
| CLI | Backtick commands + measurable asserts | May auto-approve if 100% pass |
| GUI | UI actions (click, tap) or screenshot capture | Always human checkpoint |
| SUBJECTIVE | Qualitative terms (looks, feels, appears) | Always human checkpoint |
Default to SUBJECTIVE if uncertain (safe fallback).
| Status | Meaning |
|---|---|
PASS | All asserts passed |
FAIL | One or more asserts failed |
PARTIAL | Mixed results, needs judgment |
TIMEOUT | Action exceeded time limit |
ERROR | Execution error (not assertion) |
| Type | Capture Method |
|---|---|
console | stdout/stderr from commands |
screenshot | Platform-specific screen capture |
logs | Contents of specified log files |
timing | Duration of each action |
Before presenting checkpoint, verify completion of ALL items:
No presenting partial results. No skipping evidence capture. No proceeding without human approval.
When dispatched as part of an implementation cycle verification, execute quality gates alongside TEST: task verification. Quality gates are command-based checks that always auto-resolve.
tasks.md ## Quality Gates section and/or plan.md build configurationquality_gates frontmatter sectionAdd a quality_gates section to the verification-report YAML frontmatter:
verification:
test_tasks:
total: 2
passed: 2
quality_gates:
lint:
status: pass
command: "pnpm lint"
build:
status: pass
command: "pnpm build"
tests:
status: pass
command: "pnpm test"
passed: 47
failed: 0
skipped: 2
Each quality gate entry includes the command run and its status. For test suites, include pass/fail/skip counts when available.
Quality gates always auto-resolve. No human checkpoint is needed for "did lint pass?" decisions:
Quality gate failures are surfaced through the verification report to the cycle-checkpoint gate, which evaluates them deterministically.
No exceptions:
If any of these thoughts arise, STOP immediately:
All of these mean: Rationalization in progress. Return to the execution sequence. Follow every step.
| Excuse | Reality |
|---|---|
| "Test obviously passed" | Obvious passes hide subtle failures. Capture evidence anyway. |
| "Already ran this before" | Previous runs are stale. Each execution is independent. Run again. |
| "User wants quick answer" | Quick answers without evidence are unreliable. Process protects user. |
| "Simple test case" | Simple tests catch complex bugs. Full process regardless of simplicity. |
| "Evidence capture is slow" | Slow capture beats fast wrong answer. Time investment protects quality. |
| "Can infer the result" | Inference is not verification. Execute and observe. |
| "Same setup as before" | Environments change. Run setup fresh. Validate assumptions. |
| "Just checking one thing" | One thing has dependencies. Full sequence catches hidden failures. |
What goes wrong: Action fails mysteriously because setup was assumed complete.
Fix: Always run setup commands. Always capture setup output. Fail explicitly if setup fails.
What goes wrong: Background processes from previous tests interfere with current test.
Fix: Track all PIDs. Kill processes after test (pass or fail). Verify cleanup completed.
What goes wrong: Critical failure information cut off from report.
Fix: Follow truncation rules in REPORT-TEMPLATES.md. Always include log file locations. Preserve full evidence for human review.
What goes wrong: Claiming PASS when asserts were not actually evaluated.
Fix: Each assert MUST have an explicit pass/fail evaluation. No default to PASS. Unevaluated asserts are failures.
What goes wrong: Continuing execution when human explicitly rejected.
Fix: Rejection gates completion. Human approval is mandatory. Retry or abort on rejection.
What goes wrong: Test runs but human never sees results. No audit trail. No approval gate.
Fix: Every test MUST end with checkpoint presentation. No silent completion. Human-in-loop is the point.