From dm-game
Question generation for playtests, what to observe vs. ask, metrics to track, and how to interpret playtest data without confirmation bias. Use when planning a playtest session, designing a feedback survey, setting up analytics, or when you have playtest data and need to make decisions from it.
npx claudepluginhub rbergman/dark-matter-marketplace --plugin dm-gameThis skill uses the workspace's default tool permissions.
**Purpose:** Get useful signal from playtests. Most playtest sessions are wasted — observers confirm what they already believe, ask leading questions, and draw conclusions from noise. This skill provides structured methods to avoid those traps.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Guides MCP server integration in Claude Code plugins via .mcp.json or plugin.json configs for stdio, SSE, HTTP types, enabling external services as tools.
Purpose: Get useful signal from playtests. Most playtest sessions are wasted — observers confirm what they already believe, ask leading questions, and draw conclusions from noise. This skill provides structured methods to avoid those traps.
Influences: Frameworks here draw on cognitive UX research methodology, metrics-driven iterative design practice, and experience engineering theory (emergent behavior observation, planning under uncertainty).
Use this skill when:
Players are reliable reporters of their experience (what they felt) but unreliable reporters of causes (why they felt it). Design your process accordingly.
Most Reliable ←———————————————→ Least Reliable
What they did What they felt Why they think
(behavior) (experience) they felt it
(attribution)
Hierarchy of evidence:
Players attributing frustration to "bad controls" might actually be experiencing a perception failure (they couldn't see the indicator) or a pacing problem (too many new concepts at once). Use behavior to diagnose; use self-report to locate.
Generate questions along the perception → attention → memory pipeline:
Perception Questions (Did they see it?)
Attention Questions (Did they focus on the right thing?)
Memory Questions (Will they retain it?)
| Dev Stage | Focus | Key Questions |
|---|---|---|
| Prototype | Core loop viability | Is the core action inherently interesting? Do they want to do it again? |
| Alpha | System comprehension | Do they understand the rules? Can they make intentional decisions? |
| Beta | Pacing and polish | Does the session arc feel right? Where do they get bored or frustrated? |
| Pre-launch | Edge cases and balance | What breaks? What's exploitable? What did we miss? |
| Observable | What It Tells You |
|---|---|
| First action | What the UI communicates as "start here" |
| Hesitation points | Where clarity fails or cognitive load spikes |
| Repeated failures | Where difficulty exceeds skill (or UI is misleading) |
| Where they look | What's grabbing attention (intended or not) |
| Body language | Leaning in = engaged; leaning back = disengaged; fidgeting = frustrated |
| Utterances | Unprompted comments ("what?", "oh!", "come on") are gold |
| Where they quit | The most valuable data point you'll collect |
| What they skip | Content they ignore reveals priority mismatches |
| Metric | What It Measures | Warning Signal |
|---|---|---|
| Session length | Engagement | Bimodal distribution (some quit fast, some stay long) |
| Quit points | Pain points | Cluster of quits at same location/moment |
| Completion rate | Difficulty/clarity | < 70% on intended-critical-path content |
| Time per section | Pacing | Sections taking 2x+ longer than designed |
| Death/failure rate | Difficulty curve | Spike = wall; zero = too easy |
| Metric | What It Measures | Warning Signal |
|---|---|---|
| Pick rate by option | Strategy diversity | One option > 50% pick rate |
| Win rate by strategy | Balance | Any strategy > 55% win rate at comparable skill |
| Average game/match length | Pacing | Games consistently shorter or longer than intended |
| Resource accumulation rate | Economy health | Exponential growth = inflation incoming |
| Strategy churn | Meta health | If dominant strategy shifts too fast, balance is noisy |
| Metric | What It Measures | Warning Signal |
|---|---|---|
| Time to first meaningful action | Onboarding quality | > 60 seconds before the player does something |
| Tutorial completion rate | Tutorial design | < 90% = tutorial is the problem, not the player |
| Hint/help usage | Clarity | High usage = UI isn't communicating; zero usage = help system is invisible |
| Error rate on intended actions | Usability | Player tries to do the right thing but fails due to UI |
The biggest threat to useful playtest data is your own expectations.
Before the session:
After the session:
| Trap | Mechanism | Counter |
|---|---|---|
| Anchoring | First session dominates your impression | Review all sessions before concluding |
| Availability | Dramatic moments overshadow quiet ones | Use metrics, not memory |
| Projection | Attributing your own experience to players | Watch what they do, not what you'd do |
| Sunk cost | Defending features you spent time on | Ask "would we add this today?" not "should we cut this?" |
| Survivorship | Only hearing from players who stayed | Track quit points with equal priority |
If you can only ask one question: "Tell me about a moment that stood out — good or bad."
Then follow up with: "What were you trying to do?" and "What happened next?"
| Signal | Confidence | Action |
|---|---|---|
| Metrics + observation + self-report all agree | High | Act on it |
| Metrics show it, observation confirms, self-report disagrees | Moderate-High | Trust behavior over self-report |
| Self-report says it, but metrics/observation don't show it | Low | Investigate further — the report may point to a different real problem |
| Single session shows it, others don't | Very Low | Note it but don't act — one data point isn't a pattern |
When you're building alone, you can't run traditional playtests during development. These techniques bridge the gap:
| Technique | How | What It Catches |
|---|---|---|
| The 2-week break | Play your own game after not touching it for 2 weeks | UX failures, forgotten controls, unclear objectives |
| The mute test | Play with sound off | Audio-dependent information, missing visual feedback |
| The squint test | Squint at the screen or reduce resolution | Visual clarity, contrast, UI readability |
| The record-and-review | Record gameplay, watch it the next day | Pacing problems, dead time, repetitive patterns |
| The explain test | Explain what you're doing out loud while playing | Logic gaps, unjustified assumptions, unclear goals |
| The wrong-hand test | Play with your non-dominant hand | Input complexity, timing windows, control accessibility |
When you're ready for external eyes (earlier than you think):
If you're a solo developer shipping updates: