From claude-commands
Defines evidence standards for testing with git provenance capture via Python, server runtime checks, Three Evidence Rule, and real/mock decision tree. Verifies test claims with metadata like commit diffs and PIDs.
npx claudepluginhub joshuarweaver/cascade-code-general-misc-2 --plugin jleechanorg-claude-commandsThis skill uses the workspace's default tool permissions.
**Evidence must prove what you claim.** Mock data cannot prove production behavior.
Reviews evidence bundles for claims/PRs against standards, verifying bundle integrity via checksums and producing PASS/PARTIAL/FAIL verdicts. Invoke with /er.
Accumulates screenshots, videos, logs in .artifacts/<feature=branch>/ for visual regression, E2E results, and PR documentation. Generates structured reports with proof before declaring tasks complete.
Verifies task completion by enforcing fresh automated test runs, runtime evidence review, and spec re-read in /dev workflow Phase 7.
Share bugs, ideas, or general feedback.
Evidence must prove what you claim. Mock data cannot prove production behavior.
Every test MUST capture these at minimum (copy-paste into test setup):
def capture_provenance():
"""REQUIRED: Capture all evidence standards."""
provenance = {}
# === GIT PROVENANCE (MANDATORY) ===
subprocess.run(["git", "fetch", "origin", "main"], timeout=10, capture_output=True)
provenance["git_head"] = subprocess.check_output(
["git", "rev-parse", "HEAD"], text=True).strip()
provenance["git_branch"] = subprocess.check_output(
["git", "branch", "--show-current"], text=True).strip()
provenance["merge_base"] = subprocess.check_output(
["git", "merge-base", "HEAD", "origin/main"], text=True).strip()
provenance["commits_ahead_of_main"] = int(subprocess.check_output(
["git", "rev-list", "--count", "origin/main..HEAD"], text=True).strip())
provenance["diff_stat_vs_main"] = subprocess.check_output(
["git", "diff", "--stat", "origin/main...HEAD"], text=True).strip()
# === SERVER RUNTIME (MANDATORY for server tests) ===
port = BASE_URL.split(":")[-1].rstrip("/")
pids = subprocess.check_output(
["lsof", "-i", f":{port}", "-t"], text=True).strip().split("\n")
provenance["server"] = {
"pid": pids[0] if pids else None,
"port": port,
"process_cmdline": subprocess.check_output(
["ps", "-p", pids[0], "-o", "command="], text=True).strip() if pids else None,
"env_vars": {var: os.environ.get(var) for var in
["WORLDAI_DEV_MODE", "TESTING", "GOOGLE_APPLICATION_CREDENTIALS"]}
}
return provenance
Quick validation: If your metadata.json is missing ANY of these fields, the test is incomplete:
provenance.merge_baseprovenance.commits_ahead_of_mainprovenance.diff_stat_vs_mainprovenance.server.pidprovenance.server.portprovenance.server.process_cmdlineMANDATORY for ANY integration claim:
Before running ANY test, answer:
| Question | If YES → |
|---|---|
| Testing production/preview server behavior? | MUST use real mode |
| Validating actual API responses? | MUST use real mode |
| Checking data integrity (dice, state, persistence)? | MUST use real mode |
| Proving a bug is fixed in production? | MUST use real mode |
| Development workflow validation only? | Mock mode acceptable |
| Unit testing isolated functions? | Mock mode acceptable |
Production mode is NOT required for valid evidence. Local testing with real services
(real LLM APIs, real Firebase, real dice) is sufficient to prove behavior.
If a run artifact records production_mode, production_mode: false is acceptable
for evidence as long as the claim is not about production configuration or prod-only behavior.
| Mode | When to Use | Evidence Value |
|---|---|---|
--production-mode | Final deployment validation | Highest (actual prod config) |
--evidence (local server) | PR validation, feature proof | Valid (real APIs, real data) |
| Mock mode | Unit tests, CI speed | Invalid for behavior claims |
The key requirement is real execution (actual API calls, actual RNG), not production
environment. Evidence from --start-local --evidence is valid proof.
MOCK MODE = INVALID EVIDENCE for:
Mock mode tests ONLY prove:
Mock mode tests NEVER prove:
Required files in every evidence bundle:
| File | Purpose | Required Keys |
|---|---|---|
run.json | Test results | scenarios[*].name, scenarios[*].campaign_id, scenarios[*].errors |
metadata.json | Git/server provenance | git_provenance, server, timestamp |
evidence.md | Human-readable summary | Pass/fail counts matching run.json |
methodology.md | Test methodology | Environment, steps, validation |
README.md | Package manifest | Git commit, branch, collection time |
request_responses.jsonl | Raw MCP captures | Full request/response pairs |
llm_request_responses.jsonl | Raw LLM request/response payloads | type field (request or response) |
DEPRECATED: evidence.json - use run.json + metadata.json instead.
Required for base-class local-server traces:
request_responses.jsonl - MCP client ↔ local server (/mcp) request/response pairsllm_request_responses.jsonl - raw LLM-layer request/response capture streamEvery test MUST emit results["scenarios"] even for single-scenario runs:
# ❌ BAD - Missing scenarios array causes "Total Scenarios: 0"
results = {"test_result": {...}}
# ✅ GOOD - Always include scenarios array
results = {
"scenarios": [
{
"name": "scenario_name",
"campaign_id": "abc123", # Required for log traceability
"passed": True,
"errors": [], # Always include, even if empty
"checks": {...}
}
],
"test_result": {...} # Optional summary
}
ALL evidence files MUST have separate checksum files:
# Generate checksums AFTER finalizing content
sha256sum run.json > run.json.sha256
sha256sum metadata.json > metadata.json.sha256
# Verify checksums
sha256sum -c run.json.sha256
Anti-pattern: Embedding checksums inside JSON files (self-invalidating).
Checksum usability requirement: .sha256 files must reference the local basename
(e.g., run.json), not a nested path like artifacts/run_.../run.json.
This ensures sha256sum -c works when run from the evidence directory.
ALL evidence files require checksums, including:
def _write_checksum_for_file(filepath: Path) -> None:
"""Generate SHA256 checksum file for an existing file."""
content = filepath.read_bytes()
sha256_hash = hashlib.sha256(content).hexdigest()
checksum_file = Path(str(filepath) + ".sha256")
checksum_file.write_text(f"{sha256_hash} {filepath.name}\n")
For integrations relying on streamed server output, done payload evidence must include:
streaming_response_signature.digeststreaming_response_signature.algorithmstreaming_response_signature.schema_versionstreaming_execution_tracerequest_idThe signature is the SHA-256 digest (or HMAC-SHA256 when
STREAM_RESPONSE_SIGNING_SECRET is set) over canonical JSON for:
request_idresponse_textexecution_traceEvidence reviewers (or replay scripts) must verify:
streaming_response_signature is present in real-mode runs.streaming_execution_trace exists and records real provider path (provider / mock_callable) for phases.mock_services_mode is false and no phase uses mock_local_fallback.signature.verification uses identical serialization (sort_keys=True, compact separators, UTF-8).Single-run attribution: If a bundle contains multiple runs, the docs must
name the exact run directory used for claims (e.g., run_YYYYMMDD...). Claims
must be traceable to one run only.
Multi-campaign isolation: If tests create multiple campaigns (e.g., isolated tests for state-sensitive scenarios), evidence.md must include:
Example isolation note in evidence.md:
## ⚠️ Multi-Campaign Isolation Note
This bundle contains **11 campaigns**: 1 shared + 10 isolated.
Each scenario includes its `campaign_id` for traceability.
Per-scenario campaign ID in run.json: When using fresh campaigns per scenario,
the test output must include campaign_id for each scenario entry:
{
"scenarios": [
{
"name": "Skill Check (Stealth)",
"campaign_id": "zuFsywkYErTZpGBGDhDC", // ← Required for log traceability
"dice_audit_events": [...],
"tool_results": [...]
}
]
}
This enables matching server logs (which include campaign_id=...) to specific
scenario results in the evidence bundle.
Doc ↔ data alignment: Any item lists in methodology/evidence must be
derived from actual test inputs or game_state_snapshot.json. Hardcoded or
handwritten lists are invalid.
Threshold capture: If pass/fail depends on thresholds (e.g.,
min_narrative_items), those values must be recorded in run.json or the
methodology so reviewers can verify the criteria.
Environment claims: Only claim env vars that are read from the actual environment during the run (or omit them).
Unsupported claims: CI status, Copilot analysis, or external validations must include their own evidence artifacts, otherwise omit those claims.
Bug-fix classification: If a bundle labels a change as "new feature" to avoid before/after evidence, it must include a justification. Otherwise, for bug-fix claims, include a pre-fix reproduction and a post-fix run.
Claim → Artifact Map: Every evidence.md MUST include a section mapping claims to files:
## Claim → Artifact Map
| Claim | File | Key Field(s) |
|-------|------|--------------|
| DC set before roll (Gemini) | run.json | scenarios[].dice_audit_events[].dc_reasoning |
| DC set before roll (Qwen) | run.json | scenarios[].tool_results[].args.dc_reasoning |
| Executed code proof | gemini3_executed_code.log | dc = X before random.randint() |
Coverage Matrix: For multi-model or multi-scenario tests, include a summary table:
## Coverage Matrix
| Scenario | Gemini 3 | Qwen | Key Params |
|----------|----------|------|------------|
| Attack Roll | Pass (dc=15) | Pass (ac=13) | AC-based |
| Skill Check | Pass (dc=13) | Pass (dc=13) | dc_reasoning required |
| Saving Throw | Pass (dc=15) | Pass (dc=17) | dc + dc_reasoning |
Raw Response Retention: Every scenario MUST have its raw LLM output saved:
raw_{model}_{scenario}.txtExecuted Code Capture (code_execution strategy): When LLM uses code_execution:
Tool Request/Response Pairing (two-phase strategy): When LLM uses tool calling:
args (request) and result (response) togetherTraceability Metadata: Every evidence bundle MUST include:
Evidence Integrity Note: Include a section documenting:
"errors": [])What is NOT Proven (Exclusion List): Explicitly state limitations:
## What This Evidence Does NOT Prove
- Production server behavior (tested on local server)
- Performance under load (single-request tests)
- Edge cases not covered by scenarios (e.g., contested checks)
Every evidence bundle MUST capture:
git fetch origin main # Ensure origin/main is current
git rev-parse HEAD # Exact commit being tested
git rev-parse origin/main # Base comparison point
git branch --show-current # Branch name
git diff --name-only origin/main...HEAD # Files changed
Why: Proves exactly what code was running during evidence capture.
For server-based evidence, capture:
# Process info
ps -eo pid,user,etime,args | grep "mvp_site\|python.*main"
# Listening ports
lsof -i :PORT -P -n | grep LISTEN
# Environment variables (sanitized)
# PID from above
lsof -p $PID 2>/dev/null | grep -E "^p|^fcwd|^n/"
Required env vars to capture:
WORLDAI_DEV_MODEPORTFIREBASE_PROJECT_IDCanonical format with versioning (v1.1.0+):
/tmp/<repo>/<branch>/<test_name>/
├── iteration_001/ # First test run
│ ├── README.md # Package manifest with run_id, iteration
│ ├── README.md.sha256
│ ├── methodology.md # Testing methodology documentation
│ ├── methodology.md.sha256
│ ├── evidence.md # Evidence summary with metrics
│ ├── evidence.md.sha256
│ ├── notes.md # Additional context, TODOs, follow-ups
│ ├── notes.md.sha256
│ ├── metadata.json # Machine-readable: run_id, iteration, bundle_version
│ ├── metadata.json.sha256
│ ├── run.json # Test results
│ ├── run.json.sha256
│ └── artifacts/ # Server logs, lsof, ps output
├── iteration_002/ # Second test run (auto-incremented)
│ └── ...
└── latest -> iteration_002 # Symlink to most recent
Versioning Fields (REQUIRED in metadata.json):
| Field | Description | Example |
|---|---|---|
bundle_version | Evidence format version | "1.1.0" |
run_id | Unique identifier for this run | "llm_guardrails_exploits-003-20260101T221620" |
iteration | Run number within this test | 3 |
Run ID Format: {test_name}-{iteration:03d}-{timestamp}
Example metadata.json with versioning:
{
"test_name": "llm_guardrails_exploits",
"run_id": "llm_guardrails_exploits-003-20260101T221620",
"iteration": 3,
"bundle_version": "1.1.0",
"bundle_timestamp": "2026-01-01T22:16:20.000000+00:00",
"provenance": { ... },
"summary": { ... }
}
Legacy format (still supported for backward compatibility):
/tmp/<repo>/<branch>/<work>/<timestamp>/
├── README.md # Package manifest with git provenance
├── README.md.sha256
├── methodology.md # Testing methodology documentation
├── methodology.md.sha256
├── evidence.md # Evidence summary with metrics
├── evidence.md.sha256
├── notes.md # Additional context, TODOs, follow-ups
├── notes.md.sha256
├── metadata.json # Machine-readable: git_provenance, timestamps
├── metadata.json.sha256
├── pr_diff.txt # Optional (PR mode): full diff origin/main...HEAD
├── pr_diff_summary.txt # Optional (PR mode): diff summary
└── artifacts/ # Copied evidence files (test outputs, logs, etc.)
└── <copied files with checksums>
Which flow to use?
/generatetest or automated test runners, rely on the built-in save_evidence() helper to produce metadata, README, and checksums in one pass.Manual creation (shell-based):
# Set up directory structure
REPO=$(basename $(git rev-parse --show-toplevel))
BRANCH=$(git rev-parse --abbrev-ref HEAD)
WORK="your-work-name"
TIMESTAMP=$(date -u +%Y%m%dT%H%M%SZ)
EVIDENCE_DIR="/tmp/${REPO}/${BRANCH}/${WORK}/${TIMESTAMP}"
mkdir -p "${EVIDENCE_DIR}/artifacts"
# Capture git provenance
git rev-parse HEAD > "${EVIDENCE_DIR}/git_head.txt"
git log -1 --format="%H%n%an <%ae>%n%aI%n%s" > "${EVIDENCE_DIR}/git_commit_info.txt"
git diff --name-only origin/main...HEAD > "${EVIDENCE_DIR}/changed_files.txt"
# Create package manifest and metadata to mirror automated flow
cat > "${EVIDENCE_DIR}/README.md" <<EOF
# Evidence Package Manifest
- Repository: ${REPO}
- Branch: ${BRANCH}
- Work Name: ${WORK}
- Collected At (UTC): ${TIMESTAMP}
EOF
cat > "${EVIDENCE_DIR}/metadata.json" <<EOF
{
"repository": "${REPO}",
"branch": "${BRANCH}",
"work_item": "${WORK}",
"timestamp": "${TIMESTAMP}",
"created_by": "manual_shell_example"
}
EOF
# Create documentation files
echo "# Methodology" > "${EVIDENCE_DIR}/methodology.md"
echo "# Evidence Summary" > "${EVIDENCE_DIR}/evidence.md"
echo "# Notes" > "${EVIDENCE_DIR}/notes.md"
# Generate checksums
cd "${EVIDENCE_DIR}" || { echo "Failed to enter evidence directory" >&2; exit 1; }
shopt -s nullglob
for f in *.md *.txt *.json; do
[ -f "$f" ] && sha256sum "$f" > "${f}.sha256"
done
shopt -u nullglob
# After populating methodology.md, evidence.md, and notes.md with real content,
# regenerate checksums to reflect the final state:
# for f in *.md *.txt *.json; do
# [ -f "$f" ] && sha256sum "$f" > "${f}.sha256"
# done
Alternative format (still valid for specialized tests):
/tmp/{feature}_api_tests_v{N}/
├── full_evidence_transcript.txt # Human-readable log
├── api_completion_test.json # Structured test results
├── api_completion_test.json.sha256
├── post_process_analysis.json # Validation/regression checks
├── post_process_analysis.json.sha256
└── evidence_capture.sh # Reproducible script
Evidence MUST include:
If you claim a specific system instruction or enforcement block was included in a live LLM call:
debug_info.system_instruction_files). Record system_instruction_char_count when available.Runtime Capture Mechanism (Your Project):
# Start server (system instruction capture is always enabled)
CAPTURE_SYSTEM_INSTRUCTION_MAX_CHARS=120000 \
WORLDAI_DEV_MODE=true \
PORT=8005 \
python -m mvp_site.main serve
When enabled, full system instruction text appears in debug_info.system_instruction_text (optional).
Prompt Tracking (default):
Capture prompt filenames (and char count when available):
debug_info.system_instruction_files: List of prompt files loaded (e.g., ["prompts/master_directive.md", "prompts/game_state_instruction.md"])debug_info.system_instruction_char_count: Total character count of combined prompts (optional)This proves which prompts were used without the ~100KB overhead per response. The file list provides provenance while keeping evidence bundles manageable.
Evidence Mode Documentation (MANDATORY when using lightweight tracking):
When using lightweight prompt tracking, evidence files MUST include explicit documentation:
{
"evidence_mode": "lightweight_prompt_tracking",
"evidence_mode_notes": "System instruction captured as filenames + char_count (not full text). Full raw_response_text from LLM is captured. Server logs in artifacts/."
}
This ensures reviewers know what capture approach was used and can assess evidence completeness accordingly.
Evidence MUST show:
Evidence MUST include:
New features don't require "before" evidence since there's no prior behavior to compare. Instead, prove:
system_instruction_files or code paths)When proving LLM or API behavior, evidence MUST capture the full request/response cycle:
Required captures:
Mandatory 2-layer trace set for testing_mcp/lib/base_test.py runs:
request_responses.jsonl - MCP client ↔ local server (/mcp) request/response pairsllm_request_responses.jsonl (+ artifacts/server.log) - local server LLM handling tracesFor base-class runs, these artifact files must be full and untrimmed.
When REQUIRE_FULL_TRACE_LOGS=true, missing/invalid trace artifacts are a hard failure.
Why raw capture matters:
Capture format:
request_responses.jsonl # One JSON object per line, each containing:
{
"timestamp": "ISO8601",
"request": { ... full MCP request ... },
"response": { ... full MCP response ... },
"response.result.debug_info": {
// Lightweight tracking (default):
"system_instruction_files": ["prompts/master_directive.md", "..."],
"system_instruction_char_count": 93180,
// Full capture (when CAPTURE_SYSTEM_INSTRUCTION_MAX_CHARS > 0):
"system_instruction_text": "system prompt sent to LLM",
// Raw LLM capture (requires CAPTURE_RAW_LLM=true):
"raw_request_payload": "full LLMRequest JSON sent to LLM (user action, context)",
"raw_response_text": "LLM output before parsing"
}
}
Required debug_info fields for LLM claims:
system_instruction_text - The system prompt (captured by default)raw_request_payload - The user prompt/action sent to LLM (requires CAPTURE_RAW_LLM=true)raw_response_text - Raw LLM output before parsing (requires CAPTURE_RAW_LLM=true)Server configuration:
Default Evidence Capture:
Raw LLM capture is enabled by default in the server ($PROJECT_ROOT/llm_service.py):
CAPTURE_RAW_LLM defaults to "true" - no env var requiredFor tests needing higher limits, server_utils.DEFAULT_EVIDENCE_ENV provides overrides:
DEFAULT_EVIDENCE_ENV = {
"CAPTURE_RAW_LLM": "true", # Server default, included for explicitness
"CAPTURE_RAW_LLM_MAX_CHARS": "50000", # Test override (server: 20000)
"CAPTURE_SYSTEM_INSTRUCTION_MAX_CHARS": "120000",
}
This ensures raw request/response capture works automatically without manual configuration.
Test output files should be self-contained with embedded provenance:
{
"test_name": "feature_validation",
"timestamp": "2025-12-27T05:00:00Z",
"provenance": {
"git_head": "abc123def456...",
"git_branch": "feature-branch",
"server_url": "http://localhost:8001"
},
"steps": [ ... ],
"summary": { ... }
}
Why embedded provenance:
These requirements elevate evidence from "probably correct" to "provably correct."
Assertions are not evidence. Capture raw command output AND exit codes:
# ❌ BAD - Assertion only
echo "Fix commit is ancestor of test HEAD"
# ✅ GOOD - Raw output with exit code
echo "Command: git merge-base --is-ancestor $FIX_COMMIT $TEST_HEAD"
git merge-base --is-ancestor $FIX_COMMIT $TEST_HEAD
ANCESTRY_EXIT=$?
echo "Exit code: $ANCESTRY_EXIT"
echo "Interpretation: Exit 0 = TRUE (is ancestor), Exit 1 = FALSE (not ancestor), 128+ = error"
Every git command needs context. Capture pwd to prove which repo:
echo "Working directory: $(pwd)"
echo "Git root: $(git rev-parse --show-toplevel)"
git rev-parse HEAD
PR checks don't prove which commit was tested. Link checks to specific SHA:
# Get the HEAD SHA being tested
HEAD_SHA=$(git rev-parse HEAD)
# Fetch check runs for that specific SHA
# Note: :owner/:repo is auto-inferred from git remote when run in a cloned repo
gh api repos/:owner/:repo/commits/$HEAD_SHA/check-runs \
--jq '.check_runs[] | {name, status, conclusion, html_url}'
Filter out placeholder checks:
html_url or completedAt = 0001-01-01T00:00:00Zcursor.com) that aren't GH Action runsServer health ≠ server code version. Tie gunicorn PID to its git state:
# Get gunicorn process listening on port 8005
PID=$(pgrep -f "gunicorn.*:8005" | head -1)
# Get its working directory (cross-platform)
if [ -L "/proc/$PID/cwd" ]; then
SERVER_CWD=$(readlink -f "/proc/$PID/cwd") # Linux
else
SERVER_CWD=$(lsof -a -p "$PID" -d cwd 2>/dev/null | tail -1 | awk '{print $NF}') # macOS
fi
# Verify git HEAD in server's working directory
git -C "$SERVER_CWD" rev-parse HEAD
Spread-out timestamps break provenance chains. Collect all evidence in one pass:
#!/bin/bash
# Single-pass evidence collection
TIMESTAMP=$(date -u +%Y-%m-%dT%H:%M:%SZ)
EVIDENCE_DIR="/tmp/evidence_$(date +%s)"
mkdir -p "$EVIDENCE_DIR"
# Capture all state in rapid succession (< 60 seconds)
echo "Collection started: $TIMESTAMP" > "$EVIDENCE_DIR/log.txt"
curl -s http://localhost:8005/health >> "$EVIDENCE_DIR/server_state.txt"
git rev-parse HEAD >> "$EVIDENCE_DIR/git_state.txt"
# Run test
python test.py >> "$EVIDENCE_DIR/test_output.txt"
echo "Collection ended: $(date -u +%Y-%m-%dT%H:%M:%SZ)" >> "$EVIDENCE_DIR/log.txt"
Automated summaries must match raw data. Always verify:
# If summary says "Copilot comments: 4"
# Raw data MUST show 4 entries with user.login matching
# Example: raw data shows .user.login == "Copilot" (not "github-copilot[bot]")
# ❌ BAD - Exact match misses variations
jq '[.[] | select(.user.login == "github-copilot[bot]")]' # Misses "Copilot"
# ✅ GOOD - Case-insensitive pattern matching
jq '[.[] | select(.user.login | test("copilot"; "i"))]'
When generating evidence documentation (methodology, evidence summary, notes), these rules prevent common mismatches:
Never hardcode documentation content. Generate it from source data:
# ❌ BAD - Hardcoded claim
methodology = "WORLDAI_DEV_MODE=true"
# ✅ GOOD - Read from actual environment
dev_mode = os.environ.get("WORLDAI_DEV_MODE", "not set")
methodology = f"WORLDAI_DEV_MODE: {dev_mode}"
Silent drops hide real mismatches. Track and report edge cases:
# ❌ BAD - Silently skip missing items
items = [registry[id] for id in seeds if id in registry]
# ✅ GOOD - Track and warn
missing_ids = []
items = []
for id in seeds:
if id in registry:
items.append(registry[id])
else:
missing_ids.append(id)
if missing_ids:
notes += f"WARNING: Missing IDs: {missing_ids}"
"Found X/Y (need Z)" must use correct Y. Common mistake: using min_required as denominator:
# ❌ BAD - Misleading: "4/2 (need 2)" suggests 200% match
stats_col = f"{found}/{min_required} (need {min_required})"
# ✅ GOOD - Clear: "4/4 (need 2)" shows 4 of 4 found, 2 was minimum
stats_col = f"{found}/{len(total_required)} (need {min_required})"
Don't claim "bug fix" vs "new feature" unless explicit. These are product decisions:
# ❌ BAD - Makes unverifiable claim
methodology += "## New Feature (Not Bug Fix)\nThis is a new feature..."
# ✅ GOOD - Stick to verifiable facts
methodology += "## Test Scope\nValidates equipment display functionality."
Subprocess output alone doesn't prove success. Check returncode:
# ❌ BAD - Ignores failures
result = subprocess.run(cmd, capture_output=True)
print(result.stdout)
# ✅ GOOD - Warns on failure
result = subprocess.run(cmd, capture_output=True)
print(result.stdout)
if result.returncode != 0:
print(f"WARNING: Command exited with code {result.returncode}")
Evidence bundles must reference exactly ONE test run. Ambiguous artifact scope breaks traceability:
# ❌ BAD - Copies entire directory with multiple runs
--artifact /path/to/all_runs/
# ✅ GOOD - Copies specific run only
--artifact /path/to/all_runs/run_20251227T051227_953691
Not all passes are equal. Track evidence quality with explicit pass types:
| Pass Type | Criteria | Evidence Value |
|---|---|---|
| STRONG | All conditions met with robust proof | Highest - conclusive |
| WEAK | Core requirement proven, secondary conditions not met | Valid but with caveats |
| FAIL | Core requirement not proven | Invalid |
Example implementation:
# Track pass strength in evidence
strong_pass = primary_condition and secondary_condition
weak_pass = primary_condition and not secondary_condition
passed = primary_condition # Core requirement
result = {
"status": "PASS" if passed else "FAIL",
"pass_type": "strong" if strong_pass else ("weak" if weak_pass else "fail"),
...
}
Why this matters:
APIs often return only changed fields. Tests MUST handle missing fields correctly:
# ❌ BAD - Treats missing field as False
still_in_combat = response.get("combat_state", {}).get("in_combat") is True
# ✅ GOOD - Check field presence, use fallback logic
combat_state = response.get("combat_state", {})
if "in_combat" in combat_state:
# Explicit value - use it
still_in_combat = combat_state["in_combat"] is True
elif combat_state.get("current_round") is not None:
# Partial update with round info - combat ongoing
still_in_combat = True
else:
# Fall back to previous state
still_in_combat = previous_combat_state
Common partial update scenarios:
current_round but omit unchanged in_combathp_current but omit unchanged levelAlways set explicit model settings to avoid PROVIDER_SELECTION_NULL_SETTINGS fallback noise:
from lib.model_utils import settings_for_model, update_user_settings
# ❌ BAD - Relies on server defaults, causes fallback noise in logs
result = process_action(client, user_id=user_id, campaign_id=campaign_id, ...)
# ✅ GOOD - Explicit model pinning before any actions
DEFAULT_MODEL = "gemini-3-flash-preview"
# Pin model settings at test start
update_user_settings(
client,
user_id=user_id,
settings=settings_for_model(DEFAULT_MODEL),
)
# Now process actions with deterministic model selection
result = process_action(client, user_id=user_id, campaign_id=campaign_id, ...)
Why this matters:
LLMs may "shortcut" scenarios by resolving them too quickly. Force extended scenarios when needed:
# ❌ BAD - LLM may end combat in 1-2 actions
SCENARIO = "Fight three bandits"
# ✅ GOOD - Forces multi-round combat
SCENARIO = """You face an Ogre Warlord (CR 5, HP 120, AC 16) and two Ogre Guards (CR 2, HP 59 each).
This is a BOSS FIGHT - it CANNOT be resolved in fewer than 3 combat rounds.
DO NOT end combat prematurely. All enemies have full HP and fight to the death."""
Forcing techniques:
If you're about to:
Evidence JSON files must use relative paths for portability:
// ❌ BAD - Absolute paths break when bundle is moved
{
"artifacts_dir": "/tmp/worktree_worker7/dev123/e2e_test",
"output_file": "/tmp/worktree_worker7/dev123/results.json"
}
// ✅ GOOD - Relative paths work anywhere
{
"artifacts_dir": "./artifacts",
"output_file": "./results.json"
}
Post-processing requirement: After copying test output into a bundle:
Evidence bundles must have exactly one layer of checksums:
| Strategy | When to Use |
|---|---|
Per-file .sha256 | Simple bundles, few files |
Root checksums.sha256 | Complex bundles, many files |
| NEVER both | Causes .sha256.sha256 pollution |
When packaging artifacts that already have checksums:
# Clean existing checksums before copying
find /path/to/source -name "*.sha256" -delete
# Then copy artifacts to bundle
cp -r /path/to/source "${EVIDENCE_DIR}/artifacts/"
Manual cleanup alternative:
# Remove all .sha256 files from artifact source before copying
find /path/to/source -name "*.sha256" -delete
# Then create fresh checksums at bundle level
cd /bundle/root
find . -type f ! -name "*.sha256" -exec sha256sum {} \; > checksums.sha256
When creating evidence for a PR, always capture the full diff from origin/main:
# ❌ AVOID - Last commit only (may miss PR context after merge)
git diff HEAD~1..HEAD
# ✅ PREFER - Full PR diff from origin/main
git diff origin/main...HEAD > "${EVIDENCE_DIR}/pr_diff.txt"
git diff --stat origin/main...HEAD > "${EVIDENCE_DIR}/pr_diff_summary.txt"
Why this matters:
HEAD~1..HEAD captures only the last commitorigin/main...HEAD always captures the full PR diffInferential evidence is insufficient. "Action succeeded therefore truncation worked" is not proof.
# ❌ INFERENTIAL - Proves nothing about internal behavior
"budget_truncation_proof": "Action succeeded with memories exceeding budget = truncation worked"
# ✅ DIRECT - Runtime logs FROM THE CODE showing selection/exclusion
[MEMORY_BUDGET] Input: 605 memories, 43,816 tokens (budget: 40,000)
[MEMORY_BUDGET] TRUNCATED: 554 selected, 51 excluded, 39,959 tokens used
When claiming internal behavior (truncation, filtering, deduplication), the code MUST produce logs that prove the behavior:
# ❌ BAD - No evidence of truncation behavior
def select_memories_by_budget(memories, max_tokens):
# ... selection logic ...
return selected_memories
# ✅ GOOD - Runtime evidence captured in logs
def select_memories_by_budget(memories, max_tokens):
logging_util.info(
f"[MEMORY_BUDGET] Input: {len(memories)} memories, "
f"{total_tokens:,} tokens (budget: {max_tokens:,})"
)
# ... selection logic ...
logging_util.info(
f"[MEMORY_BUDGET] TRUNCATED: {len(result)} selected, "
f"{excluded_count} excluded, {final_tokens:,} tokens used"
)
return result
Key principle: If you can't point to a log line proving the behavior, add logging to produce that evidence.
For any test claiming internal behavior:
# Start server with log capture
nohup python -m mvp_site.mcp_api --http-only --port 8003 > "$EVIDENCE_DIR/server_logs.txt" 2>&1 &
# Run tests against that server
python test.py --server-url http://127.0.0.1:8003
# Extract proof
grep "MEMORY_BUDGET" "$EVIDENCE_DIR/server_logs.txt" > "$EVIDENCE_DIR/memory_budget_proof.txt"
Beyond basic provenance, include:
cat > "$EVIDENCE_DIR/git_provenance_full.txt" << EOF
=== CURRENT STATE ===
Branch: $(git rev-parse --abbrev-ref HEAD)
Commit: $(git rev-parse HEAD)
=== COMMIT DETAILS ===
$(git log -1 --format="Author: %an <%ae>%nDate: %aI%nSubject: %s")
=== RECENT COMMITS ON BRANCH ===
$(git log --oneline -10)
=== ORIGIN/MAIN REFERENCE ===
origin/main: $(git rev-parse origin/main)
=== DIFF FROM ORIGIN/MAIN ===
$(git diff --stat origin/main...HEAD)
=== COMMITS AHEAD/BEHIND ===
Ahead: $(git rev-list --count origin/main..HEAD)
Behind: $(git rev-list --count HEAD..origin/main)
=== MODIFIED FILES ===
$(git diff --name-only origin/main...HEAD)
EOF
For key evidence files, use per-file checksums alongside the artifact:
# Generate per-file checksums
for file in server_logs.txt memory_budget_proof.txt server_env_capture.txt; do
shasum -a 256 "$file" > "${file}.sha256"
done
# Results in:
# server_logs.txt
# server_logs.txt.sha256
# memory_budget_proof.txt
# memory_budget_proof.txt.sha256
Why per-file: Easier to verify individual artifacts; no need to parse a combined file.
A PR is self-contained when a stranger on a brand-new computer, with no prior access to the repo, can:
Gist MUST contain (not just link to):
git clone URL + branch checkout commandPR description MUST contain:
What "self-contained" is NOT:
evidence/<bundle>/ is fine — but PR evidence MUST also include a public, directly downloadable URL for the same MP4.Rule: GIFs and MP4s MUST be hosted on a public GitHub repo release.
Private repo release assets are inaccessible to anonymous viewers and will not render as inline images in GitHub PR descriptions.
# Upload to a public repo (e.g. agent-orchestrator)
gh release create evidence-pr-<N>-v<M> \
terminal_demo.gif terminal_demo.mp4 \
browser_ui_demo.gif browser_ui_demo.mp4 \
--title "PR #N Evidence vM" \
--notes "Evidence for PR $GITHUB_REPOSITORY#N" \
-R jleechanorg/agent-orchestrator
# Verify accessibility (expect HTTP/2 302 → CDN redirect = accessible)
curl -sI "https://github.com/jleechanorg/agent-orchestrator/releases/download/evidence-pr-<N>-v<M>/browser_ui_demo.gif" | head -2
https://github.com/jleechanorg/agent-orchestrator/releases/download/<tag>/<filename>
# Must return 302, not 404
gh api repos/jleechanorg/agent-orchestrator/releases/tags/<tag> \
--jq '.assets[] | {name: .name, state: .state}'
# All states must be "uploaded"
releases/download/... from a private repo → 404 for viewers.mp4 attached to PR comment) → not downloadable via direct URLevidence/ → requires cloning the (private) repoEvery non-trivial verification requires TWO videos. Neither alone is sufficient.
| Type | Proves | Required when |
|---|---|---|
| Tmux/terminal video | Code ran against a specific commit, tests passed, no tampering | Any code change, test run, deploy, agent session |
| UI/browser video | Visual behavior and click flows occurred as claimed | Any claim touching rendered UI, user flows, or browser behavior |
If your work has no UI component, tmux video alone is acceptable. If your work has both terminal and UI steps, both are required.
Full template: ~/.claude/skills/evidence-standards/tmux-video-evidence.md
Tool: asciinema + agg (.cast → .gif)
Mandatory sections (in order):
| # | Section | Must show |
|---|---|---|
| 1 | Git Provenance | git rev-parse HEAD, branch, merge-base |
| 2 | Commit Log | git log --oneline origin/main..HEAD |
| 3 | Code Diffs | git diff origin/main...HEAD (not just --stat) |
| 4 | PR Status | gh pr view <N> |
| 5 | Live Work | Real test/deploy/command output |
| 6 | Post-run SHA | Same git rev-parse HEAD — must match section 1 |
Hard blocks (evidence rejected if any apply):
echo "PASS" instead of real test runner output--stat only (no actual diff shown)Full template: ~/.claude/skills/evidence-standards/ui-video-evidence.md
Tool: mcp__claude-in-chrome__gif_creator (agent sessions) or Kap (manual)
Mandatory frames (in order):
| # | Frame | Must show |
|---|---|---|
| 1 | URL + Page Load | Full address bar with the route under test |
| 2 | Before State | Initial state before the action |
| 3 | Action | The click, input, or navigation |
| 4 | After State | Result — success message, data change, navigation |
| 5 | Git Linkage | Terminal split or console log showing git rev-parse HEAD |
Frame extraction (MUST): For every video evidence bundle, extract frames with ffmpeg and verify content matches claims:
ffmpeg -i <video.webm> -vf fps=1 /tmp/frames/frame_%03d.png
Verify at minimum:
Hard blocks (evidence rejected if any apply):
CI and review MUST reject evidence bundles that claim UI behavior without video proof.
| Claim Type | Required Video | Rejection if Missing |
|---|---|---|
| "Button click works" | UI GIF/recording | Auto-reject |
| "Tests pass" | Tmux recording or CI log link | Warning |
| "Page renders correctly" | UI GIF/recording | Auto-reject |
| "No visual regression" | Before/after UI comparison | Auto-reject |
| "Terminal output shows X" | Tmux recording or inline capture | Warning |
Enforcement levels:
CLAUDE.md - Three Evidence Rule (lines 110-113)generatetest.toml - Mock mode prohibition (lines 433-441)end2end-testing.md - Test mode commands (/teste, /tester, /testerc)browser-testing-ocr-validation.md - OCR evidence for visual claimstmux-video-evidence.md - Full tmux/asciinema recording templateui-video-evidence.md - Full UI/browser GIF recording template