From pm-thought-partner
Generates diverse synthetic test inputs using dimension-based tuple generation for LLM pipelines. Use when bootstrapping eval datasets, real data is sparse, or stress-testing failure hypotheses.
npx claudepluginhub breethomas/bette-thinkThis skill uses the workspace's default tool permissions.
Generate diverse, realistic test inputs that cover the failure space of an LLM pipeline. Dimension-based tuples, not random generation.
Generates synthetic test inputs for LLM pipeline evaluation using dimension-based tuples. Bootstrap eval datasets when real data is sparse or stress-test specific failure hypotheses.
Generates 20 test cases (15 happy path + 5 edge) for AI features in spreadsheet format using PM-Friendly Evals. Launches simple eval workflow with optional Linear project.
Provides patterns for generating synthetic data for ML training, testing, and privacy. Covers LLM-based generation, tabular synthesis, and quality validation. Activates on mentions of synthetic data, data augmentation, SDV, Gretel.
Share bugs, ideas, or general feedback.
Generate diverse, realistic test inputs that cover the failure space of an LLM pipeline. Dimension-based tuples, not random generation.
When this skill is invoked, start with:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
GENERATE TEST DATA
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Diverse inputs expose the failure space. Random generation doesn't.
What AI feature are we generating test data for?
What kinds of inputs does it take?
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Before generating synthetic data, identify where the pipeline is likely to fail. Ask the PM about known failure-prone areas, review existing user feedback, or form hypotheses from available traces. Dimensions (Step 1) must target anticipated failures, not arbitrary variation.
Dimensions are axes of variation specific to the application. The PM defines these — they know where failures happen.
Dimension 1: [Name] — [What it captures]
Values: [value_a, value_b, value_c, ...]
Dimension 2: [Name] — [What it captures]
Values: [value_a, value_b, value_c, ...]
Dimension 3: [Name] — [What it captures]
Values: [value_a, value_b, value_c, ...]
Example for a customer support chatbot:
Query Type: what the user is asking about
Values: [billing, technical issue, account access, feature request, cancellation]
User Expertise: how technical the user is
Values: [non-technical, somewhat technical, power user]
Complexity: how many steps to resolve
Values: [single-step, multi-step, requires escalation]
Start with 3 dimensions. Add more only if initial traces reveal failure patterns along new axes.
Ask the PM: "What are the 3 most important ways inputs vary for your feature? Think about what makes some inputs harder than others."
A tuple is one combination of dimension values defining a specific test case. Present 20 draft tuples to the PM and iterate until they confirm the tuples reflect realistic scenarios.
(Query Type: Billing, User Expertise: Non-technical, Complexity: Multi-step)
(Query Type: Technical Issue, User Expertise: Power User, Complexity: Single-step)
(Query Type: Cancellation, User Expertise: Non-technical, Complexity: Requires Escalation)
The PM's domain knowledge is essential. They know which combinations actually occur and which are unrealistic.
Claude Code executes: Generate the initial 20 tuples ensuring coverage across dimension values. Present to PM for validation.
Claude Code executes: Generate additional tuples using the PM-validated set as examples.
Generate 10 random combinations of ({dim1}, {dim2}, {dim3})
for a {application description}.
The dimensions are:
{dim1}: {description}. Possible values: {values}
{dim2}: {description}. Possible values: {values}
{dim3}: {description}. Possible values: {values}
Output each tuple in the format: ({dim1}, {dim2}, {dim3})
Avoid duplicates. Vary values across dimensions.
Separate step from tuple generation. Single-step generation (tuples + queries together) produces repetitive phrasing.
Claude Code executes: Convert each tuple to a realistic user query using a separate prompt per tuple.
We are generating synthetic user queries for a {application}.
{Brief description of what it does.}
Given:
{dim1}: {value}
{dim2}: {value}
{dim3}: {value}
Write a realistic query that a user might enter. The query should
reflect the specified characteristics.
Example: "{one of the PM-written examples}"
Now generate a new query.
Review generated queries with the PM. Discard and regenerate when:
Claude Code executes: Rate realism using an LLM, discard below threshold, regenerate replacements.
Execute all queries through the full LLM pipeline. Capture complete traces: input, all intermediate steps, tool calls, retrieved docs, final output.
Target: ~100 high-quality, diverse traces. This is a rough heuristic for reaching saturation.
Claude Code executes: Run the queries, capture traces, format for analysis. These traces feed directly into /upgrade-evals for error analysis.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
TEST DATA GENERATED
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Feature: [name]
Dimensions: [dim1], [dim2], [dim3]
Tuples generated: [count]
Queries generated: [count]
Queries after filtering: [count]
DIMENSION COVERAGE:
| Dimension | Values Covered | Gaps |
|-----------|---------------|------|
| [dim1] | [X/Y] | [any missing] |
| [dim2] | [X/Y] | [any missing] |
| [dim3] | [X/Y] | [any missing] |
NEXT STEPS:
- Run /upgrade-evals on these traces for error analysis
- Run /build-judge for failure modes that need automated evaluation
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
When you have real queries available, don't just sample randomly. Use stratified sampling:
Use synthetic data to fill gaps in underrepresented query types.
Methodology: Adapted from Hamel Husain's generate-synthetic-data skill (evals-skills, MIT license) PM adaptation: PM defines dimensions and validates realism, Claude Code handles generation and pipeline execution