Stats

Actions

Tags

Help us improve

Share bugs, ideas, or general feedback.

Dice Real-Mode Tests (MCP) | claude-commands

Skill

Dice Real-Mode Tests (MCP)

From claude-commands

Validates dice roll integrity end-to-end with real MCP services, covering Gemini code_execution, native_two_phase, distribution tests, and chi-squared authenticity checks.

$

npx claudepluginhub jleechanorg/claude-commands --plugin claude-commands

Popularity

Stars

27

Forks

4

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/claude-commands:dice-real-mode-tests

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Use this when validating **dice integrity** end-to-end with real services.

SKILL.md

73 lines · ~641 tokens

Similar Skills

Dice Roll Authenticity Standards

27

Validates dice roll authenticity using chi-squared statistical tests and RNG code verification to detect LLM fabrication.

claude-commands

field-test

574

Exercises MCP tools, resources, and prompts against a live HTTP server via JSON-RPC over curl. Starts server, runs real and adversarial inputs, produces a report with findings and follow-ups. Use after modifying definitions or to verify surface.

obsidian-mcp-server

field-test

107

Exercises MCP HTTP server tools/resources/prompts via real JSON-RPC calls. Starts the server, catalogs endpoints, runs adversarial inputs, and produces a report with findings. Use after adding or modifying MCP definitions.

pubmed-mcp-server

Stats

LanguagePython

Stars27

Forks4

MaintenanceExcellent

Last CommitJun 2, 2026

Actions

View Source View Plugin View on GitHub View README

Tags

integrity-validation

Help us improve

Share bugs, ideas, or general feedback.

Dice Real-Mode Tests (MCP)

Use this when validating dice integrity end-to-end with real services.

Script Location

testing_mcp/test_dice_rolls_comprehensive.py

Required Standards

Evidence: Follow .claude/skills/evidence-standards.md (Three-Evidence Rule)
Authenticity: Follow .claude/skills/dice-authenticity-standards.md (Chi-squared + RNG verification)

Preview Server (real mode)

python testing_mcp/test_dice_rolls_comprehensive.py \
  --server-url https://<preview-app>.run.app/mcp \
  --evidence \
  --evidence-dir /tmp/<run-id> \
  --models gemini-3-flash-preview,qwen-3-235b-a22b-instruct-2507

Notes:

Distribution tests will skip if roll_dice tool is unavailable on preview.
Qwen/native_two_phase uses server tool_results as authoritative; mismatches are overridden.
Use a unique /tmp/<run-id>/ (timestamp or UUID) per run to avoid collisions.

Local MCP (real services)

python testing_mcp/test_dice_rolls_comprehensive.py \
  --start-local --real-services --evidence --enable-dice-tool \
  --models gemini-3-flash-preview,qwen-3-235b-a22b-instruct-2507 \
  --evidence-dir /tmp/<run-id>

Outputs:

run.json (scenario results, tool_results, dice_audit_events, warnings)
local_mcp_*.log (server logs)
raw_*.txt (raw model responses when enabled)

What This Covers

Gemini code_execution path (dice_audit_events source=code_execution).
native_two_phase path (Qwen/Cerebras tool_results).
Distribution tests (1d6 / 1d20) when roll_dice tool is available.
Edge cases (invalid notation, 1d0+5).

Common Expectations

DICE_ROLLS_MISMATCH / DICE_AUDIT_MISMATCH warnings can appear; server overrides with tool_results.
Final run.json should show aligned totals across dice_rolls, dice_audit_events, and tool_results.

Chi-Squared Validation

After running distribution tests, validate authenticity:

Chi-Squared	Sample Size	Verdict
< 30	100+	PASS
30-50	100+	WARNING - Investigate
> 50	100+	FAIL - Likely fabrication

Reference: PR #2551 detected fabrication with chi-squared = 411.81

Troubleshooting

If native_two_phase fails: ensure tool_results are present in responses.
If mismatch errors appear in run.json: check the log and confirm override occurred.
If distribution tests skip on preview: expected unless roll_dice tool is exposed.
If chi-squared > 50: Check rng_verified field in evidence - may indicate fabrication.