npx claudepluginhub bang9/ai-tools --plugin whipThis skill uses the workspace's default tool permissions.
Run multi-agent simulations from a user-provided scenario. Concretize the scenario into test cases, spawn agents, and analyze output patterns for consistency.
Evaluates LLM agents through behavioral testing, capability assessment, reliability metrics, and production monitoring—where top agents score under 50% on real-world benchmarks.
Runs AgentV evaluations to benchmark AI agents, optimize prompts/skills via eval-driven iteration, compare outputs across providers, and analyze results.
Generates synthetic problems with quasi-ground-truth outcomes to test agents and skills, measuring recall, precision, and confidence calibration. Use for validating routing accuracy, A/B testing changes.
Share bugs, ideas, or general feedback.
Run multi-agent simulations from a user-provided scenario. Concretize the scenario into test cases, spawn agents, and analyze output patterns for consistency.
Extract from $ARGUMENTS:
--runs N: number of simulation runs (default: 5)--agent: use inline mode (see Execution Mode below)$ARGUMENTS determines which dispatch mode this skill uses. The two modes are mutually exclusive:
| Mode | Activates when | Dispatch mechanism |
|---|---|---|
| Tracked (default) | --agent is absent from $ARGUMENTS | /whip-start Team Flow — IRC, workspace, polling |
| Inline | --agent is present in $ARGUMENTS | Agent tool directly — no whip, no IRC, no lifecycle |
Strict rules:
--agent in arguments → tracked mode. No exceptions, no inference.--agent in arguments → inline mode. /whip-start, IRC, and lifecycle steps are all skipped.--backend specification (e.g., user says "use codex") → implies tracked mode. Backend selection is a whip concept and is incompatible with --agent.--agent from task simplicity, speed preference, or any other heuristic. The flag must be explicitly present in the user's input.If running inside an active whip workspace, use whip workspace view <workspace-name> to get the worktree path for reading code artifacts referenced in the scenario. In tracked mode, simulation tasks go in the global workspace (ephemeral — do not pollute the active workspace).
Read any files, git refs, or codebase artifacts referenced in the scenario, then transform it into concrete test cases:
| Field | Description |
|---|---|
| Name | Short identifier (e.g., deprecated-move-1) |
| Setup | Context the agent receives (file contents, code, instructions) |
| Action | What the agent executes |
| Output contract | Structured format the agent must produce |
The output contract is critical — all agents must produce the same structure so results are mechanically comparable:
### Result
- pattern: [short label for the approach taken]
- output:
[code block, JSON, or other structured output]
- decisions: [key judgment calls made]
For A/B comparisons, choose a strategy:
| Strategy | When to use | Agent count |
|---|---|---|
| Sequential | Outputs are structured (code, configs) — one agent runs A then B | N |
| Isolated | Outputs involve judgment or prose — separate agents per version | 2N |
Present the test plan including:
Wait for user approval before executing.
Hand off dispatch to /whip-start. Prepare one task spec per simulation run and let /whip-start handle IRC, creation, assignment, and monitoring.
Each simulation run becomes one task:
sim-{test-case}-{run}globaleasyAfter all tasks complete, collect outputs and proceed to analysis.
--agent)Spawn one Agent tool call per run, named sim-{test-case}-{run}.
Each prompt must be self-contained — embed all context inline, not file paths:
Batching:
run_in_background: trueClassify outputs into patterns:
## Simulation Report
### Consistency: X/N (Y%)
### Output Patterns
| Pattern | Count | Runs | Description |
|---------|-------|------|-------------|
| A | 8 | #1-6,#8,#10 | [dominant behavior] |
| B | 2 | #7,#9 | [variant behavior] |
### Divergence Analysis
For each non-dominant pattern:
- Runs: [list]
- Root cause: [why]
- Severity: cosmetic | functional | breaking
- Diff from dominant: [key differences]
### Summary
- Total: N runs across M test cases
- Dominant pattern: A (X%)
- Key findings: ...
- Recommendation: [if applicable]
Save the full report with raw agent outputs to /tmp/simulate-{slug}-{timestamp}.md and tell the user the path.
global workspace and delegate dispatch to /whip-startwhip task clean