From crucible
Dispatches prompts to multiple LLM providers in parallel via MCP for consensus on high-stakes decisions, synthesizing responses for quality gates and design reviews.
npx claudepluginhub raddue/crucibleThis skill uses the workspace's default tool permissions.
<!-- CANONICAL: shared/dispatch-convention.md -->
Queries external AI agents (Codex, Gemini, OpenCode, Claude Code headless) in parallel for code reviews, bug investigations, architecture choices, security audits, and consensus on complex decisions.
Orchestrates parallel analysis of coding problems across AI models (Claude, GPT, Gemini, Grok) via CLI tools or APIs, collects recommendations, and synthesizes optimal solution.
Share bugs, ideas, or general feedback.
All subagent dispatches use disk-mediated dispatch. See shared/dispatch-convention.md for the full protocol.
Multi-model consensus dispatches high-stakes decision prompts to multiple LLM providers in parallel via an MCP server, then synthesizes the responses. It surfaces blind spots that single-model review misses. Entirely opt-in — without configuration, all skills behave exactly as before.
The consensus system is not a replacement for existing skill logic. It is an amplifier applied at specific, high-leverage decision points where the cost of a missed defect or a wrong judgment is disproportionately high. Skills that integrate consensus call it at defined moments (e.g., the stagnation judge in quality-gate, the Challenger in design review) and treat its output as additional signal, not as an override.
consensus_query (MCP tool)Dispatches a prompt to all configured models in parallel, collects responses, and returns a synthesized result.
| Parameter | Type | Required | Description |
|---|---|---|---|
prompt | string | Yes | The decision prompt to send to each model. Should be self-contained and unambiguous. |
context | string | Yes | Supporting context (artifact content, conversation history, prior findings). Sent verbatim to each model alongside the prompt. |
mode | enum | Yes | One of "review", "verdict", or "investigate". Controls how responses are aggregated. |
metadata | object | No | Optional metadata for traceability. Fields: artifact_type (string), round_number (int), score_progression (list of ints). |
| Field | Type | Description |
|---|---|---|
status | enum | "complete" (all models responded), "partial" (some models responded, count >= min_models), "unavailable" (fewer than min_models responded or tool not configured). |
models_queried | int | Number of models the query was dispatched to. |
models_responded | int | Number of models that returned a valid response within the timeout. |
synthesis | string | Aggregated result text, formatted according to the selected mode. |
agreements | list | Findings or verdicts where 2+ models converged. |
disagreements | list | Findings or verdicts where models explicitly contradicted each other. |
unique_findings | list | Findings raised by exactly one model, with provenance. |
per_model | list | Per-model detail. Each entry: { provider: string, model_id: string, content: string, responded: boolean }. |
When status is "unavailable", all list fields are empty and synthesis contains a short explanation (e.g., "Consensus unavailable: no models configured"). The caller should fall back to single-model behavior.
For adversarial review of artifacts (red-team rounds, design review, code review).
For binary or ternary decisions (stagnation judge, fix verification, pass/fail gates).
STAGNATION or PROGRESS) along with structured reasoning."STAGNATION: 3, PROGRESS: 1")."requires_human_judgment": true in the synthesis.For design exploration (Challenger mode, architectural analysis).
disagreements list with both positions and their reasoning.Consensus is configured via .claude/consensus-config.yaml in the project root. If this file is absent or consensus.enabled is false, all consensus calls return status: "unavailable" and skills fall back to single-model behavior with zero overhead.
consensus:
enabled: true # Master switch. Default: false.
min_models: 2 # Minimum models that must respond for a valid result. Default: 2.
timeout_seconds: 120 # Per-query timeout. Models that haven't responded are marked responded: false. Default: 120.
models:
- provider: anthropic # Provider identifier. Shipped: "anthropic", "google". Planned: "openai", "deepseek".
model: claude-sonnet-4-20250514 # Model identifier within the provider.
api_key_env: ANTHROPIC_API_KEY # Environment variable name holding the API key. NEVER a raw key.
temperature: 0.6 # Sampling temperature for this model. Default: 0.6. Range: 0.0-1.0.
- provider: google
model: gemini-2.5-pro
api_key_env: GOOGLE_API_KEY
temperature: 0.6
modes:
review: true # Enable consensus for review-mode calls. Default: true.
verdict: true # Enable consensus for verdict-mode calls. Default: true.
investigate: true # Enable consensus for investigate-mode calls. Default: true.
| Field | Type | Default | Description |
|---|---|---|---|
consensus.enabled | boolean | false | Master switch. When false, all consensus_query calls immediately return status: "unavailable". |
consensus.min_models | int | 2 | Minimum number of models that must respond for the result to be "complete" or "partial". If fewer respond, status is "unavailable". Must be >= 2. |
consensus.timeout_seconds | int | 120 | Maximum time (seconds) to wait for all models to respond. Models exceeding this are marked responded: false. Must be >= 10 and <= 600. |
consensus.models | list | [] | List of model configurations. At least min_models entries required. |
consensus.models[].provider | string | — | Provider identifier. Must be one of the shipped providers (anthropic, google) or a planned provider once available. |
consensus.models[].model | string | — | Model identifier recognized by the provider's API. |
consensus.models[].api_key_env | string | — | Name of the environment variable containing the API key. The MCP server reads the key from the environment at runtime. Keys are NEVER stored in config files. |
consensus.models[].temperature | float | 0.6 | Sampling temperature. Lower values produce more deterministic responses. |
consensus.modes.review | boolean | true | Whether review mode calls are dispatched to consensus. When false, review-mode calls return "unavailable". |
consensus.modes.verdict | boolean | true | Whether verdict mode calls are dispatched to consensus. |
consensus.modes.investigate | boolean | true | Whether investigate mode calls are dispatched to consensus. |
consensus.enabled is true but consensus.models has fewer entries than min_models, the MCP server logs a warning and treats consensus as unavailable.api_key_env variable is not set in the environment, that model is skipped. If the remaining count drops below min_models, status is "unavailable".temperature field is clamped to [0.0, 1.0].The consensus system is designed to never break existing workflows. Degradation follows a strict hierarchy:
Consensus available — All configured models respond within the timeout. status: "complete". Full multi-model synthesis is returned to the calling skill.
Partially available — Some models fail or time out, but the number of successful responses is >= min_models. status: "partial". Aggregation proceeds with available responses. The models_queried and models_responded fields indicate the gap.
Unavailable (configuration) — No config file, enabled: false, no models configured, or no API keys in environment. status: "unavailable". The calling skill receives an immediate response and proceeds with its existing single-model logic. Zero overhead, zero delay.
MCP server not running — The consensus_query tool is not present in the tool list. Skills that integrate consensus check for tool availability before attempting calls. If the tool is absent, the consensus step is skipped entirely. No errors, no retries.
In all cases, the calling skill's core logic is self-contained. Consensus is additive signal, never a gate.
Consensus multiplies API costs by the number of models queried. The following table estimates cost multipliers relative to a single-model baseline, assuming prompts of roughly equal token count across models.
| Models Configured | Multiplier |
|---|---|
| 2 | 2x |
| 3 | 3x |
| 4 | 4x |
Quality gate integrates consensus at two points: the stagnation judge (verdict mode, every round) and periodic red-team review (review mode, periodic). The table below shows total additional API calls from consensus across different run lengths, assuming periodic red-team consensus fires on rounds 1, 4, 7, 10 (every 3rd round starting from round 1).
| Run Length | Stagnation Verdicts | Periodic Red-Team Reviews | Total Consensus Calls | With 2 Models | With 3 Models | With 4 Models |
|---|---|---|---|---|---|---|
| 5 rounds | 5 | 2 (rounds 1, 4) | 7 | 14 API calls | 21 API calls | 28 API calls |
| 10 rounds | 10 | 4 (rounds 1, 4, 7, 10) | 14 | 28 API calls | 42 API calls | 56 API calls |
Periodic application (rather than every-round) keeps cost manageable. A 10-round quality gate run with 2 models adds 28 API calls total — significant but bounded. Teams should start with 2 models and add more only after validating the signal-to-cost ratio.
consensus.modes.verdict: false) to disable consensus for lower-value decision points.models_responded in results to detect models that consistently time out (wasted cost).api_key_env field holds the variable name, never the key itself.consensus_query must have a code path that works without it. Consensus is additive, never required.consensus_query tool exists in the current tool list before calling it. If absent, skip the consensus step silently.| Skill | Decision Point | Consensus Mode | Frequency |
|---|---|---|---|
| quality-gate | Stagnation judge | verdict | Every judge dispatch |
| quality-gate | Red-team review | review | Rounds 1, 4, 7, 10, 13 |
| design | Challenger (deep dive) | investigate | Every deep-dive challenge |
| Skill | Decision Point | Why Excluded |
|---|---|---|
| inquisitor | 5-dimension dispatch | Subagents write and run tests — requires local filesystem and execution environment. External models cannot participate. |
| quality-gate | Fix verifier | Mechanical check (did the fix address the finding?). Single-model Sonnet is sufficient. Multi-model adds cost without proportional value. |
| quality-gate | Fix agent | Requires filesystem access to modify code/docs. External models cannot write local files. |
| red-team | Standalone full-loop | When invoked directly (not via quality-gate), red-team runs its own stagnation loop. Consensus integration is at the quality-gate level, which owns the iteration. |
| design | Investigation agents | Domain Researcher, Impact Analyst all need filesystem access. External models cannot read the local codebase. Recon provides structural context but runs locally. |
| build | Task implementation | Code generation requires filesystem access and project context. |
An excluded decision point should be reconsidered for consensus integration if:
The crucible-consensus MCP server is a standalone Python process at mcp-servers/crucible-consensus/.
.claude/consensus-config.yaml from the project directoryconsensus_query tool via MCP protocoltimeout_seconds for each model
c. Collects successful responses, discards failures
d. If successful_count >= min_models: runs aggregation prompt, returns synthesis
e. If successful_count < min_models: returns status "unavailable"Each provider adapter handles:
Shipped providers: anthropic, google. Planned: openai, deepseek (deferred to follow-up). Adding a new provider requires one adapter function following the BaseProvider protocol.
Uses asyncio.gather() with per-model timeout. Each model call is independent. Failures are caught per-model and do not abort other model calls.
After collecting responses, the aggregator:
skills/consensus/aggregation-{mode}-prompt.mdThe MCP server is testable with mock model responses. No live API keys needed for the test suite.
Key test scenarios:
consensus.enabled: false → immediate "unavailable" return