Skill

playtest-design

Question generation for playtests, what to observe vs. ask, metrics to track, and how to interpret playtest data without confirmation bias. Use when planning a playtest session, designing a feedback survey, setting up analytics, or when you have playtest data and need to make decisions from it.

Install

npx claudepluginhub rbergman/dark-matter-marketplace --plugin dm-game

Tool Access

This skill uses the workspace's default tool permissions.

Preview

**Purpose:** Get useful signal from playtests. Most playtest sessions are wasted — observers confirm what they already believe, ask leading questions, and draw conclusions from noise. This skill provides structured methods to avoid those traps.

SKILL.md

Similar Skills

skill-lookup

Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.

prompts.chat

159.9k

prompt-lookup

Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.

prompts.chat

159.9k

MCP Integration

6 files

Guides MCP server integration in Claude Code plugins via .mcp.json or plugin.json configs for stdio, SSE, HTTP types, enabling external services as tools.

plugin-dev

83.2k

Stats

Parent Repo Stars2

Parent Repo Forks2

Last CommitApr 4, 2026

Actions

View Source View Plugin View on GitHub View README

Playtest Design

Purpose: Get useful signal from playtests. Most playtest sessions are wasted — observers confirm what they already believe, ask leading questions, and draw conclusions from noise. This skill provides structured methods to avoid those traps.

Influences: Frameworks here draw on cognitive UX research methodology, metrics-driven iterative design practice, and experience engineering theory (emergent behavior observation, planning under uncertainty).

When to Activate

Use this skill when:

Planning a playtest session (what to test, who to recruit, what to measure)
Designing post-playtest surveys or interview questions
Setting up analytics/metrics for ongoing data collection
Interpreting playtest results and deciding what to change
Resolving team disagreements about what the data shows

Core Principle: Observe, Then Ask

Players are reliable reporters of their experience (what they felt) but unreliable reporters of causes (why they felt it). Design your process accordingly.

Most Reliable ←———————————————→ Least Reliable
  What they did    What they felt    Why they think
  (behavior)       (experience)      they felt it
                                     (attribution)

Hierarchy of evidence:

Behavioral data — what players actually did (metrics, video, observation)
Experience reports — what players say they felt ("I was frustrated," "that was exciting")
Causal attribution — what players think caused their experience ("the controls are bad")

Players attributing frustration to "bad controls" might actually be experiencing a perception failure (they couldn't see the indicator) or a pacing problem (too many new concepts at once). Use behavior to diagnose; use self-report to locate.

Question Generation Framework

The Three-Pillar Method

Generate questions along the perception → attention → memory pipeline:

Perception Questions (Did they see it?)

Did the player notice [critical UI element / feedback / environmental cue]?
How long before they noticed?
Did they look at it before acting or after?
Did they confuse it with something else?

Attention Questions (Did they focus on the right thing?)

Where was the player looking during [critical moment]?
Did they engage with [intended system] or get distracted by [ancillary system]?
Did they understand what was important vs. optional?
Was there a moment where they seemed overwhelmed?

Memory Questions (Will they retain it?)

After a break, can they recall how to [key mechanic]?
Did they apply a lesson learned earlier to a later challenge?
Did they remember the goal after a distraction?
Can they explain the core rules to someone else?

Stage-Specific Questions

Dev Stage	Focus	Key Questions
Prototype	Core loop viability	Is the core action inherently interesting? Do they want to do it again?
Alpha	System comprehension	Do they understand the rules? Can they make intentional decisions?
Beta	Pacing and polish	Does the session arc feel right? Where do they get bored or frustrated?
Pre-launch	Edge cases and balance	What breaks? What's exploitable? What did we miss?

Observation Protocol

What to Watch (Not Ask)

Observable	What It Tells You
First action	What the UI communicates as "start here"
Hesitation points	Where clarity fails or cognitive load spikes
Repeated failures	Where difficulty exceeds skill (or UI is misleading)
Where they look	What's grabbing attention (intended or not)
Body language	Leaning in = engaged; leaning back = disengaged; fidgeting = frustrated
Utterances	Unprompted comments ("what?", "oh!", "come on") are gold
Where they quit	The most valuable data point you'll collect
What they skip	Content they ignore reveals priority mismatches

The Silent Observer Protocol

Say nothing unless they're about to break the test setup
Don't explain — if they're confused, that's data
Don't reassure — "you're doing great" biases the session
Note timestamps — when you feel the urge to help, write down the time and why
Record everything — your memory of the session will be biased toward your expectations

Metrics to Track

Core Metrics (Track Always)

Metric	What It Measures	Warning Signal
Session length	Engagement	Bimodal distribution (some quit fast, some stay long)
Quit points	Pain points	Cluster of quits at same location/moment
Completion rate	Difficulty/clarity	< 70% on intended-critical-path content
Time per section	Pacing	Sections taking 2x+ longer than designed
Death/failure rate	Difficulty curve	Spike = wall; zero = too easy

Balance Metrics (When Tuning Systems)

Metric	What It Measures	Warning Signal
Pick rate by option	Strategy diversity	One option > 50% pick rate
Win rate by strategy	Balance	Any strategy > 55% win rate at comparable skill
Average game/match length	Pacing	Games consistently shorter or longer than intended
Resource accumulation rate	Economy health	Exponential growth = inflation incoming
Strategy churn	Meta health	If dominant strategy shifts too fast, balance is noisy

UX Metrics (When Testing Comprehension)

Metric	What It Measures	Warning Signal
Time to first meaningful action	Onboarding quality	> 60 seconds before the player does something
Tutorial completion rate	Tutorial design	< 90% = tutorial is the problem, not the player
Hint/help usage	Clarity	High usage = UI isn't communicating; zero usage = help system is invisible
Error rate on intended actions	Usability	Player tries to do the right thing but fails due to UI

Avoiding Confirmation Bias

The biggest threat to useful playtest data is your own expectations.

Pre-Test Protocol

Before the session:

Write down your predictions — what do you expect to happen?
Define "surprising" outcomes — what would change your mind?
Assign a skeptic — one team member whose job is to challenge interpretations
Pre-commit to sample size — decide how many sessions before drawing conclusions (minimum 5 for qualitative, 30+ for quantitative)

Post-Test Protocol

After the session:

Review predictions vs. reality — where were you wrong? Those are the insights.
Separate observation from interpretation — "Player hesitated for 8 seconds at the door" (observation) vs. "Player didn't understand the door mechanic" (interpretation)
Look for disconfirming evidence — actively search for data that contradicts your preferred narrative
Quantify before concluding — "it felt like everyone struggled" vs. "3 of 7 players failed this section"
Delay solutions — understand the problem fully before proposing fixes

Common Bias Traps

Trap	Mechanism	Counter
Anchoring	First session dominates your impression	Review all sessions before concluding
Availability	Dramatic moments overshadow quiet ones	Use metrics, not memory
Projection	Attributing your own experience to players	Watch what they do, not what you'd do
Sunk cost	Defending features you spent time on	Ask "would we add this today?" not "should we cut this?"
Survivorship	Only hearing from players who stayed	Track quit points with equal priority

Survey Design

Good Questions (Experience-Focused)

"How would you describe the experience in one word?"
"What moment stands out most?" (Then probe: "What made it stand out?")
"Was there a point where you wanted to stop? What was happening?"
"What would you do differently on a second playthrough?"
"Rate how [specific emotion] you felt during [specific moment]" (1-5 scale)

Bad Questions (Leading or Attributive)

"Did you find the controls intuitive?" (Leading — assumes controls are the issue)
"What would you change?" (Too broad — gets surface-level answers)
"Did you like it?" (Binary, social pressure toward "yes")
"Was it too hard?" (Leading — frames difficulty as the variable)
"What features would you add?" (Players aren't designers; this generates noise)

The One-Question Shortcut

If you can only ask one question: "Tell me about a moment that stood out — good or bad."

Then follow up with: "What were you trying to do?" and "What happened next?"

Interpreting Data

Decision Framework

Signal	Confidence	Action
Metrics + observation + self-report all agree	High	Act on it
Metrics show it, observation confirms, self-report disagrees	Moderate-High	Trust behavior over self-report
Self-report says it, but metrics/observation don't show it	Low	Investigate further — the report may point to a different real problem
Single session shows it, others don't	Very Low	Note it but don't act — one data point isn't a pattern

Sample Size Guidance

5-8 sessions — finds ~85% of major usability problems
15-20 — identifies behavioral patterns
30+ — minimum for quantitative conclusions (win rates, balance)
A/B tests — require statistical power calculation; varies by effect size

Solo Developer Validation

When you're building alone, you can't run traditional playtests during development. These techniques bridge the gap:

Self-Testing Techniques

Technique	How	What It Catches
The 2-week break	Play your own game after not touching it for 2 weeks	UX failures, forgotten controls, unclear objectives
The mute test	Play with sound off	Audio-dependent information, missing visual feedback
The squint test	Squint at the screen or reduce resolution	Visual clarity, contrast, UI readability
The record-and-review	Record gameplay, watch it the next day	Pacing problems, dead time, repetitive patterns
The explain test	Explain what you're doing out loud while playing	Logic gaps, unjustified assumptions, unclear goals
The wrong-hand test	Play with your non-dominant hand	Input complexity, timing windows, control accessibility

Recruiting First Testers

When you're ready for external eyes (earlier than you think):

Friends/family who DON'T play games — best for UX/clarity testing
Friends who play games in your genre — best for feel/depth testing
Online communities (itch.io, indie forums, Discord) — best for unbiased feedback
Start with 3 testers — even 3 external players reveal more than 100 hours of self-play

Solo Metrics

If you're a solo developer shipping updates:

Track your own play session length (are YOU getting bored?)
Count deaths/failures per section (is difficulty spiking where you don't intend?)
Time each section (is pacing matching your design?)
Screenshot every moment of confusion or frustration — these are your UX bugs

Cross-References

game-design — Playtest scenarios from the 5-Component Framework (new player, stress, skill, abuse, readability tests)
systems-design — System health metrics (behavioral diversity, archetype formation) measured through playtesting
player-ux — The cognitive pillars (perception/attention/memory) drive the question generation framework
game-balance — Metrics-driven iteration for detecting and resolving balance problems
economy-design — Economy health monitoring metrics to track during playtests
experience-design — Testing whether the intended experience matches actual player experience
motivation-design — Testing retention and motivation through session length and return rate
encounter-design — Testing spatial readability and encounter fairness
narrative-design — Testing narrative comprehension and engagement
game-feel — "Does this feel good?" requires observation, not surveys