Runs a skeptical QA gate for Flutter Flame games: executes static analysis, tests, stub detection, and contract validation. Default mode checks functionality; --strict adds quality scoring and edge-case sweeps.
How this skill is triggered — by the user, by Claude, or both
Slash command
/flutter-flame-harness:flame-harness-evaluatorThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Phase 6 of the flutter-flame-harness pipeline. Skeptical QA gate that decides PASS or FAIL
Phase 6 of the flutter-flame-harness pipeline. Skeptical QA gate that decides PASS or FAIL
against the negotiated contract. Default mode runs the functional check (6.1) only; --strict
adds quality scoring (6.2) and an agent-team edge-case sweep (6.3).
All file schemas (config.md, state.md, handoff/round-N-gen.md, feedback/round-N-qa.md,
build-log.md) and the phase transition table are defined in docs/harness-protocol.md — that
document is the single source of truth (§1 for config.md; §2 for state.md; §3 for
contract.md; §4 for handoff layout; §5 for feedback layout; §6 for log schemas; §7 for the
evaluator → admob / evaluator → build / evaluator → generator transitions). Do not redefine schemas here.
"Run the code, see the app, then judge." Never PASS on code review alone. Execute commands, launch the game on a simulator, play the core loop, capture and study screenshots. Stub detected = automatic FAIL, no exceptions.
Before any check, load:
docs/harness/state.md — extract current_round (integer ≥ 1) and confirm
next_role: evaluator.docs/harness/config.md — extract app_slug, strict_mode (bool), max_rounds (int).docs/harness/contract.md — parse ## Mandatory Hard Gates and ## Functional Criteria.
Confirm ## Status: AGREED is present; if missing, abort with:
flame-harness-evaluator: contract not AGREED — run flame-harness-contract first.docs/harness/handoff/round-<N>-gen.md (where N = current_round) per protocol §4.
If the file is missing, abort with:
flame-harness-evaluator: handoff for round <N> not found — generator must run first.Run every step in order. A failure on any Mandatory Hard Gate from contract.md is an
immediate FAIL — do not continue checking remaining criteria.
cd <projects-dir>/<app_slug>
flutter analyze
Required result: 0 issues. If any issues are reported, record each with file, line, and message. This is a Mandatory Hard Gate.
cd <projects-dir>/<app_slug>
flutter test
Required result: 0 failures. Capture the full output. This is a Mandatory Hard Gate.
grep -rn "TODO\|stub\|placeholder\|스텁\|미구현" \
<projects-dir>/<app_slug>/lib/ --include="*.dart"
Any match is an automatic FAIL. No exceptions (per docs/harness-protocol.md §3 Hard
Gate 3). Record each match with file path, line number, and the matched text.
grep -rn "[0-9]\{3,\}\.\?[0-9]*" \
<projects-dir>/<app_slug>/lib/game/ --include="*.dart" \
| grep -v "game_config.dart"
Magic numbers (3+ digits) in game logic files other than game_config.dart are a Hard Gate
failure (per protocol §3 Hard Gate 4). Exclude intentional non-tuning constants (e.g., HTTP
status codes) if clearly justified in a comment; document any exclusion in the feedback file.
cd <projects-dir>/<app_slug>
flutter gen-l10n
Then confirm that every configured lib/l10n/app_<locale>.arb (the project's default_language,
plus app_en.arb when default_language ≠ en) contains identical key sets:
python3 -c "
import glob, json
arbs = glob.glob('lib/l10n/app_*.arb')
keysets = {f: set(json.load(open(f))) for f in arbs}
allkeys = set().union(*keysets.values()) if keysets else set()
bad = {f: sorted(allkeys - ks) for f, ks in keysets.items() if allkeys - ks}
if len(arbs) < 1: print('NO ARB FILES'); exit(1)
if bad: print('MISSING KEYS:', bad); exit(1)
print('l10n OK', list(keysets))
"
Missing keys in either locale are a Hard Gate failure (protocol §3 Hard Gate 6).
Verify the ## Platform-Robustness Gates (R1–R11) from contract.md / docs/game-gotchas.md:
# R1 audio: frequent SFX pooled + audio calls guarded; BGM stop on teardown
grep -rn "AudioPool" lib/ || echo "WARN: no AudioPool — frequent SFX may stutter"
grep -rn "FlameAudio\|AudioPool\|\.bgm" lib/ | grep -i "try\|catch" >/dev/null || echo "CHECK: audio not in try/catch"
grep -rn "bgm.stop\|\.stop()" lib/ || echo "CHECK: BGM stop on teardown/background"
# R2 haptics (if used)
ls lib/systems/haptics.dart 2>/dev/null && grep -nE "kIsWeb|isIOS|isAndroid|enabled|Stopwatch|elapsed" lib/systems/haptics.dart
# R3 lifecycle
grep -rn "WidgetsBindingObserver\|didChangeAppLifecycleState\|pauseEngine" lib/ || echo "FAIL: no lifecycle pause"
# R4 performance: no per-frame whereType in update hot paths
grep -rn "whereType" lib/ | grep -i "update" && echo "CHECK: whereType inside update — verify it's cached, not per-frame"
# R5 branding: custom icon + splash + localized display name (not defaults)
ls assets/icons/icon.png 2>/dev/null || echo "FAIL: no custom app icon"
grep -q "flutter_launcher_icons" pubspec.yaml && grep -q "flutter_native_splash" pubspec.yaml || echo "FAIL: launcher-icons/native-splash not configured"
sips -g hasAlpha assets/icons/icon.png 2>/dev/null | grep -qi "hasAlpha: no" || echo "CHECK: icon may have alpha (App Store rejects)"
grep -rn "CFBundleDisplayName" ios/Runner/Info.plist 2>/dev/null || echo "CHECK: iOS display name not set"
grep -rn 'android:label' android/app/src/main/AndroidManifest.xml 2>/dev/null | grep -qiv "runner" || echo "CHECK: Android label still default/runner"
# R6 native config: orientation locked (matches config.orientation, no ~ipad), iPhone-only, export compliance, root back
grep -q "UISupportedInterfaceOrientations" ios/Runner/Info.plist 2>/dev/null && ! grep -q "UISupportedInterfaceOrientations~ipad" ios/Runner/Info.plist || echo "CHECK: orientation not locked / ~ipad still present"
grep -rq "TARGETED_DEVICE_FAMILY = 1" ios/Runner.xcodeproj/project.pbxproj 2>/dev/null || echo "FAIL: not iPhone-only (TARGETED_DEVICE_FAMILY != 1)"
grep -q "ITSAppUsesNonExemptEncryption" ios/Runner/Info.plist 2>/dev/null || echo "CHECK: export-compliance key missing"
grep -rq "PopScope" lib/ || echo "CHECK: no root back-button handler"
# R6 bundle id identical on both platforms (== config.bundle_id, lowercase no _/-)
IOSID=$(grep -oE 'PRODUCT_BUNDLE_IDENTIFIER = [A-Za-z0-9._-]+' ios/Runner.xcodeproj/project.pbxproj 2>/dev/null | head -1 | sed 's/.*= //')
ANDID=$(grep -oE 'applicationId *= *"[^"]+"' android/app/build.gradle.kts 2>/dev/null | head -1 | sed -E 's/.*"([^"]+)"/\1/')
echo "iOS=$IOSID Android=$ANDID (must be byte-identical == config.bundle_id)"
{ [ -n "$IOSID" ] && [ "$IOSID" = "$ANDID" ]; } || echo "FAIL: iOS and Android bundle id differ (or unset)"
echo "$IOSID" | grep -qE '^[a-z0-9.]+$' || echo "FAIL: bundle id has uppercase/_/- (must be lowercase [a-z0-9.])"
# R7 assets & CI: audio present, no missing-asset refs, CI workflow exists
ls assets/audio/*.wav assets/audio/*.mp3 assets/audio/*.ogg 2>/dev/null | grep -q . || echo "FAIL: no audio assets — game ships silent"
ls .github/workflows/*.yml .github/workflows/*.yaml 2>/dev/null | grep -q . || echo "FAIL: no CI workflow"
# R8 Play listing graphics: hi-res icon (512) + feature graphic (1024x500) per locale
ls android/fastlane/metadata/android/*/images/icon.png 2>/dev/null | grep -q . || echo "FAIL: no Play hi-res icon (512x512)"
ls android/fastlane/metadata/android/*/images/featureGraphic.png 2>/dev/null | grep -q . || echo "FAIL: no Play feature graphic (1024x500)"
# R9 durable save: SaveRepository mirrors to Keychain + Block Store + prefs; PreferencesService routes through it
grep -q "flutter_secure_storage" pubspec.yaml && grep -q "play_services_block_store" pubspec.yaml || echo "FAIL: durable-save deps missing (saves won't survive reinstall/device on iOS)"
ls lib/data/save_repository.dart 2>/dev/null || echo "FAIL: no SaveRepository (durable save layer)"
grep -q "FlutterSecureStorage" lib/data/save_repository.dart 2>/dev/null && grep -q "PlayServicesBlockStore" lib/data/save_repository.dart 2>/dev/null || echo "FAIL: SaveRepository missing a durable tier (Keychain/Block Store)"
# R10 accessibility & safety: reduce-motion read, menu Semantics present, no >3Hz flashing
grep -rq "disableAnimations" lib/ || echo "CHECK: OS Reduce Motion (MediaQuery.disableAnimations) not respected"
grep -rq "Semantics(" lib/screens/ lib/ui/ 2>/dev/null || echo "CHECK: no Semantics labels on menu/overlay buttons (screen-reader nav)"
grep -rniE "flash|strobe|invert" lib/ | grep -i "update\|timer\|0\.0[0-9]" && echo "CHECK: verify no full-screen flash faster than 3/sec (photosensitive safety)"
# R11 test depth: >=3 system unit tests + >=1 widget test + >=1 integration test
UNIT=$(ls test/*_test.dart 2>/dev/null | grep -viE 'widget' | wc -l | tr -d ' ')
grep -rlq "testWidgets" test/ 2>/dev/null || echo "FAIL: no widget test (R11 needs >=1)"
ls integration_test/*.dart >/dev/null 2>&1 || echo "FAIL: no integration test (R11 needs >=1 core-loop test)"
[ "${UNIT:-0}" -ge 1 ] || echo "CHECK: few unit test files — R11 wants >=3 system unit tests"
# every asset path declared in pubspec must exist on disk
python3 - <<'PY'
import re,glob,os,sys
try: txt=open('pubspec.yaml').read()
except FileNotFoundError: sys.exit(0)
m=re.search(r'\n\s*assets:\s*\n((?:\s*-\s.*\n)+)', txt)
missing=[]
if m:
for line in m.group(1).splitlines():
p=line.strip().lstrip('- ').strip().strip('"\'')
if not p: continue
if p.endswith('/'):
if not (os.path.isdir(p) and os.listdir(p)): missing.append(p)
elif not os.path.exists(p): missing.append(p)
print('MISSING ASSETS:',missing) if missing else print('assets OK')
PY
Judge results against docs/game-gotchas.md: missing lifecycle pause (R3) or unguarded audio that
can crash on a missing asset (R1) is a FAIL. A per-frame whereType in a hot update path, or
raw HapticFeedback.* in gameplay without the guarded helper, is a FAIL when the game relies on it.
Also confirm the game does not crash when an audio/image asset is missing (it should degrade).
Branding (R5): default Flutter icon/splash, an alpha-channel icon, or a "Runner"/slug display
name is a FAIL — the app must look shipped. Native config (R6): the app must open directly in
config.orientation (no rotate flash → orientation locked natively, ~ipad removed), be iPhone-only,
have ITSAppUsesNonExemptEncryption=false, and handle the root back-button (SnackBar double-exit).
Assets & CI (R7): no audio (game ships silent), any missing declared asset, or a missing CI
workflow is a FAIL. Store graphics (R8): a missing Play hi-res icon (512×512) or feature graphic
(1024×500) is a FAIL — the listing can't publish without them.
Durable save (R9): persistence on shared_preferences alone is a FAIL — iOS drops it on app
delete, so the player loses progress on reinstall/new device. There must be a SaveRepository
mirroring to iOS Keychain + Android Block Store (durable-first read, all-tier write, try/catch), with
PreferencesService routing through it.
Accessibility & safety (R10): a full-screen effect that flashes faster than 3×/second is a FAIL
(photosensitive-seizure safety). Reduce Motion not read (MediaQuery.disableAnimations) and icon-only
menu/overlay buttons with no Semantics label are CHECK-level — a FAIL only if the game leans on
heavy motion with no damping. Tap targets on menu buttons should be ≥48×48 dp.
Test depth (R11): a passing flutter test that only checks GameConfig constants is a FAIL —
there must be ≥3 system unit tests + ≥1 widget test + ≥1 integration test, and all must pass.
For each criterion in contract.md §§ "Mandatory Hard Gates" and "Functional Criteria", record
a row in the feedback evidence table showing the command run, the result, and any screenshot or
log path. Do not mark a criterion DONE without running a command that directly verifies it.
This step is mandatory. A PASS verdict is not valid without completing it.
Boot the iOS simulator (or Android emulator if iOS is unavailable):
open -a Simulator
xcrun simctl boot "iPhone 16" 2>/dev/null || true
Install and launch the game:
cd <projects-dir>/<app_slug>
flutter run -d "iPhone 16" --no-pub
While the game is running:
mkdir -p <projects-dir>/<app_slug>/docs/harness/screenshots
xcrun simctl io booted screenshot \
<projects-dir>/<app_slug>/docs/harness/screenshots/round-<N>-ios-menu.png
xcrun simctl io booted screenshot \
<projects-dir>/<app_slug>/docs/harness/screenshots/round-<N>-ios-play.png
xcrun simctl io booted screenshot \
<projects-dir>/<app_slug>/docs/harness/screenshots/round-<N>-ios-end.png
Any crash, blank screen, or console error is a Hard Gate failure (protocol §3 Hard Gates 7–8).
--strict only)Run this section only when strict_mode: true in config.md or --strict is passed.
Score the game on four axes, each 0–10:
| Axis | What to assess |
|---|---|
| Game feel / juice | Responsiveness, animations, feedback on actions, audio cues |
| Originality | Freshness relative to competitors identified in the research phase |
| Craft | Code quality, absence of jank, visual polish, consistent design tokens |
| Functionality | All contract criteria met, no edge-case breakage observed |
Also assess:
Scoring threshold:
strict_mode: false): weighted average ≥ 7 / 10 to advise PASS (advisory only).strict_mode: true): weighted average ≥ 8 / 10 required for PASS.Weight: Game feel 30 %, Originality 20 %, Craft 25 %, Functionality 25 %.
Record the score for each axis and the weighted total in the feedback file. If the weighted total is below threshold, the verdict is FAIL regardless of 6.1 results, and each axis below 7 must have at least one specific, reproducible fix listed.
--strict only)Run this section only when strict_mode: true in config.md or --strict is passed.
Spawn six specialist agents in parallel via the Agent tool. All six must report PASS for the
overall verdict to be PASS. A single FAIL from any agent is a FAIL verdict.
| Agent role | Brief |
|---|---|
| gameplay-edge | Find gameplay edge cases: score overflow, negative health, unreachable states, off-screen entities, simultaneous collision resolution. |
| balance | Assess difficulty curve: is the game winnable? is it too easy in the first 30 s? does difficulty ramp feel fair? |
| lifecycle/crash | Simulate app backgrounding (home button), screen rotation, incoming call interruption; confirm resume works and no crash. |
| performance | Run flutter run --profile, check frame rendering in DevTools; flag sustained drops below 30 fps. |
| test-generator | Identify the three highest-risk untested paths and write widget/unit tests for them; confirm they pass. |
| adversarial-reviewer | Adversarially review the generated code looking for security issues, credential leaks, or App Store policy violations. |
Each agent must return a structured PASS/FAIL verdict with evidence. Collect all six verdicts before proceeding to Judgment.
After completing all applicable sections (6.1, and 6.2 + 6.3 if --strict), write the verdict.
--strict: 6.2 weighted score ≥ threshold AND all six 6.3 agents report PASS.--strict)
6.2 score is below threshold or any 6.3 agent reports FAIL.Before writing the verdict, check: if current_round == max_rounds, force the judgment. Do
not return FAIL regardless of results — write the verdict on the current state, record the
forced-advance note in the feedback file, and proceed to PASS transitions (per protocol §7
evaluator → max_rounds → admob / build, applying the same skip_admob branch as PASS).
Create docs/harness/feedback/round-<N>-qa.md following the layout in
docs/harness-protocol.md §5. Fill:
## Verdict — PASS or FAIL (bold).## Evidence — one row per criterion checked; include screenshot paths for simulator checks.## Failed Criteria — for each FAIL, a specific reproducible fix. If PASS, write "none".Never leave placeholder text. Every criterion must have a real command output or screenshot path as evidence.
On PASS (or forced advance at max_rounds), the game has been built and has passed QA — but before any deploy work (admob/build/screenshot/submit), there is a human-approval gate by default so the user can actually play and approve the game.
First read skip_admob and auto_deploy from docs/harness/config.md, and decide next_role:
skip_admob: true → next_role: buildnext_role: admobThen branch on auto_deploy:
Default (auto_deploy: false) — PAUSE for human review.
Write state.md with status: paused, pause_reason: manual_action, and the next_role
decided above (per protocol §7 rule 2). Append a pipeline-log.md row and PRINT a review
checklist for the user:
게임 빌드 + QA 통과. 배포 전 직접 확인하세요:
cd <app_slug> && flutter run으로 플레이하고,docs/harness/screenshots/의 QA 스크린샷과docs/harness/feedback/round-<N>-qa.md를 확인. 만족하면/flame-harness --resume로 배포(admob→build→screenshot→submit)를 진행합니다.
status: paused
current_phase: evaluator
next_role: admob # or "build" if skip_admob: true
pause_reason: manual_action
updated_at: "<ISO-8601 UTC now>"
The orchestrator halts on status: paused; on --resume, flame-harness-resume confirms the
user approved and dispatches the stored next_role.
auto_deploy: true — no pause, continue straight to deploy.
Write status: running with the same next_role so the orchestrator auto-continues:
status: running
current_phase: evaluator
next_role: admob # or "build" if skip_admob: true
updated_at: "<ISO-8601 UTC now>"
Leave current_round, created_at, and resume_attempts unchanged. (In the auto_deploy: false
case you set pause_reason: manual_action; in the auto_deploy: true case leave pause_reason
unchanged.)
On FAIL (and current_round < max_rounds), update docs/harness/state.md per protocol §2 and
the evaluator → generator transition in §7. Increment current_round and set
status: running atomically (protocol §7 rule 2):
status: running
current_phase: evaluator
next_role: generator
current_round: <N+1>
checkpoint: ""
updated_at: "<ISO-8601 UTC now>"
Leave created_at, resume_attempts, and pause_reason unchanged. Reset checkpoint: "" — the
next round is a fresh feedback-driven pass, so the generator must not skip sub-phases (per protocol §2).
Append one row to docs/harness/build-log.md per protocol §6:
| <N> | evaluator | PASS/FAIL | <duration> | <one-line summary> |
Append one row to docs/harness/pipeline-log.md per protocol §6:
| <ISO-8601 UTC now> | PASS/FAIL | evaluator | round <N>; next: admob|build/generator |
When the PASS path pauses for the human-review gate (default, auto_deploy: false), additionally
append a pause event row so flame-harness-resume can show the user what they are approving:
| <ISO-8601 UTC now> | pause | evaluator | manual_action: play/approve the built game before deploy; next: admob|build |
contract.md is missing or has no ## Status: AGREED, abort immediately.handoff/round-<N>-gen.md is missing, abort with a clear message; do not set a FAIL
verdict — the generator has not run yet.npx claudepluginhub tjdrhs90/flutter-flame-harness --plugin flutter-flame-harnessReads PRD and design doc, proposes verifiable completion criteria and mandatory hard gates for Flutter Flame games, and marks the contract as AGREED for code generation pipeline.
Adds Playwright QA tests to web games for visual regression, gameplay verification, boot checks, and performance metrics. Activates on 'add tests', 'test my game', 'add QA', 'check for bugs'.
Enforces playtesting before declaring gameplay features done — requires running the game and walking the feature rather than relying on static checks.