Search everything...

Skill

consensus

Dispatches prompts to multiple LLM providers in parallel via MCP for consensus on high-stakes decisions, synthesizing responses for quality gates and design reviews.

OpenAI

Anthropic

ai-ml

code-quality

npx claudepluginhub raddue/crucible

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Supporting Assets

aggregation-investigate-prompt.mdaggregation-review-prompt.mdaggregation-verdict-prompt.mdconsensus-config-example.yaml

SKILL.md

Similar Skills

council

332

Run multi-judge consensus.

20 files

agentops

agents-consilium

Queries external AI agents (Codex, Gemini, OpenCode, Claude Code headless) in parallel for code reviews, bug investigations, architecture choices, security audits, and consensus on complex decisions.

12 files

ai-driven-development

model-council

Orchestrates parallel analysis of coding problems across AI models (Claude, GPT, Gemini, Grok) via CLI tools or APIs, collects recommendations, and synthesizes optimal solution.

2 files

skills

Stats

Stars10

Forks2

Last CommitApr 12, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

consensus | crucible | ClaudePluginHub

Back to Skills

Skill

consensus

From crucible

Dispatches prompts to multiple LLM providers in parallel via MCP for consensus on high-stakes decisions, synthesizing responses for quality gates and design reviews.

npx claudepluginhub raddue/crucible

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Supporting Assets

aggregation-investigate-prompt.mdaggregation-review-prompt.mdaggregation-verdict-prompt.mdconsensus-config-example.yaml

SKILL.md

Overview

All subagent dispatches use disk-mediated dispatch. See shared/dispatch-convention.md for the full protocol.

Multi-model consensus dispatches high-stakes decision prompts to multiple LLM providers in parallel via an MCP server, then synthesizes the responses. It surfaces blind spots that single-model review misses. Entirely opt-in — without configuration, all skills behave exactly as before.

The consensus system is not a replacement for existing skill logic. It is an amplifier applied at specific, high-leverage decision points where the cost of a missed defect or a wrong judgment is disproportionately high. Skills that integrate consensus call it at defined moments (e.g., the stagnation judge in quality-gate, the Challenger in design review) and treat its output as additional signal, not as an override.

Tool Interface

`consensus_query` (MCP tool)

Dispatches a prompt to all configured models in parallel, collects responses, and returns a synthesized result.

Parameters

Parameter	Type	Required	Description
`prompt`	string	Yes	The decision prompt to send to each model. Should be self-contained and unambiguous.
`context`	string	Yes	Supporting context (artifact content, conversation history, prior findings). Sent verbatim to each model alongside the prompt.
`mode`	enum	Yes	One of `"review"`, `"verdict"`, or `"investigate"`. Controls how responses are aggregated.
`metadata`	object	No	Optional metadata for traceability. Fields: `artifact_type` (string), `round_number` (int), `score_progression` (list of ints).

Return Value

Field	Type	Description
`status`	enum	`"complete"` (all models responded), `"partial"` (some models responded, count >= min_models), `"unavailable"` (fewer than min_models responded or tool not configured).
`models_queried`	int	Number of models the query was dispatched to.
`models_responded`	int	Number of models that returned a valid response within the timeout.
`synthesis`	string	Aggregated result text, formatted according to the selected mode.
`agreements`	list	Findings or verdicts where 2+ models converged.
`disagreements`	list	Findings or verdicts where models explicitly contradicted each other.
`unique_findings`	list	Findings raised by exactly one model, with provenance.
`per_model`	list	Per-model detail. Each entry: `{ provider: string, model_id: string, content: string, responded: boolean }`.

When status is "unavailable", all list fields are empty and synthesis contains a short explanation (e.g., "Consensus unavailable: no models configured"). The caller should fall back to single-model behavior.

Consensus Modes

review

For adversarial review of artifacts (red-team rounds, design review, code review).

Each model receives the prompt and context independently and reviews the artifact.
The aggregator merges findings and deduplicates by semantic similarity (same root cause = one finding).
Confidence levels:
- High — 3 or more models independently identify the same finding.
- Medium — 2 models independently identify the same finding.
- Low — A single model identifies a finding with no corroboration.
Severity uses Crucible's standard taxonomy: Fatal, Significant, Minor.
Every finding carries per-finding provenance — the list of model IDs that surfaced it.
Deduplicated findings are ordered by severity (Fatal first), then by confidence (High first).

verdict

For binary or ternary decisions (stagnation judge, fix verification, pass/fail gates).

Each model renders a verdict (e.g., STAGNATION or PROGRESS) along with structured reasoning.
The aggregator reports the verdict distribution (e.g., "STAGNATION: 3, PROGRESS: 1").
Confidence levels:
- Unanimous — All responding models agree.
- Supermajority — 75%+ of responding models agree.
- Split — No supermajority exists.
Split decisions are flagged with "requires_human_judgment": true in the synthesis.
The majority verdict is reported as the recommendation, but the caller decides how to act on splits.

investigate

For design exploration (Challenger mode, architectural analysis).

Each model investigates the prompt independently, surfacing risks, alternatives, or concerns.
The aggregator merges findings and deduplicates by semantic similarity.
Findings unique to a single model are marked with provenance and highlighted as potential blind-spot discoveries.
Contradictory findings (where two models reach opposing conclusions about the same aspect) are flagged explicitly in the disagreements list with both positions and their reasoning.
The synthesis groups results into: shared concerns, unique discoveries, and contradictions.

Configuration

Consensus is configured via .claude/consensus-config.yaml in the project root. If this file is absent or consensus.enabled is false, all consensus calls return status: "unavailable" and skills fall back to single-model behavior with zero overhead.

Schema

consensus:
  enabled: true                # Master switch. Default: false.
  min_models: 2                # Minimum models that must respond for a valid result. Default: 2.
  timeout_seconds: 120         # Per-query timeout. Models that haven't responded are marked responded: false. Default: 120.
  models:
    - provider: anthropic      # Provider identifier. Shipped: "anthropic", "google". Planned: "openai", "deepseek".
      model: claude-sonnet-4-20250514  # Model identifier within the provider.
      api_key_env: ANTHROPIC_API_KEY   # Environment variable name holding the API key. NEVER a raw key.
      temperature: 0.6         # Sampling temperature for this model. Default: 0.6. Range: 0.0-1.0.
    - provider: google
      model: gemini-2.5-pro
      api_key_env: GOOGLE_API_KEY
      temperature: 0.6
  modes:
    review: true               # Enable consensus for review-mode calls. Default: true.
    verdict: true              # Enable consensus for verdict-mode calls. Default: true.
    investigate: true          # Enable consensus for investigate-mode calls. Default: true.

Field Reference

Field	Type	Default	Description
`consensus.enabled`	boolean	`false`	Master switch. When false, all `consensus_query` calls immediately return `status: "unavailable"`.
`consensus.min_models`	int	`2`	Minimum number of models that must respond for the result to be `"complete"` or `"partial"`. If fewer respond, status is `"unavailable"`. Must be >= 2.
`consensus.timeout_seconds`	int	`120`	Maximum time (seconds) to wait for all models to respond. Models exceeding this are marked `responded: false`. Must be >= 10 and <= 600.
`consensus.models`	list	`[]`	List of model configurations. At least `min_models` entries required.
`consensus.models[].provider`	string	—	Provider identifier. Must be one of the shipped providers (`anthropic`, `google`) or a planned provider once available.
`consensus.models[].model`	string	—	Model identifier recognized by the provider's API.
`consensus.models[].api_key_env`	string	—	Name of the environment variable containing the API key. The MCP server reads the key from the environment at runtime. Keys are NEVER stored in config files.
`consensus.models[].temperature`	float	`0.6`	Sampling temperature. Lower values produce more deterministic responses.
`consensus.modes.review`	boolean	`true`	Whether `review` mode calls are dispatched to consensus. When false, review-mode calls return `"unavailable"`.
`consensus.modes.verdict`	boolean	`true`	Whether `verdict` mode calls are dispatched to consensus.
`consensus.modes.investigate`	boolean	`true`	Whether `investigate` mode calls are dispatched to consensus.

Validation Rules

If consensus.enabled is true but consensus.models has fewer entries than min_models, the MCP server logs a warning and treats consensus as unavailable.
If a referenced api_key_env variable is not set in the environment, that model is skipped. If the remaining count drops below min_models, status is "unavailable".
Duplicate provider+model combinations are rejected at config load time.
The temperature field is clamped to [0.0, 1.0].

Graceful Degradation

The consensus system is designed to never break existing workflows. Degradation follows a strict hierarchy:

Consensus available — All configured models respond within the timeout. status: "complete". Full multi-model synthesis is returned to the calling skill.
Partially available — Some models fail or time out, but the number of successful responses is >= min_models. status: "partial". Aggregation proceeds with available responses. The models_queried and models_responded fields indicate the gap.
Unavailable (configuration) — No config file, enabled: false, no models configured, or no API keys in environment. status: "unavailable". The calling skill receives an immediate response and proceeds with its existing single-model logic. Zero overhead, zero delay.
MCP server not running — The consensus_query tool is not present in the tool list. Skills that integrate consensus check for tool availability before attempting calls. If the tool is absent, the consensus step is skipped entirely. No errors, no retries.

In all cases, the calling skill's core logic is self-contained. Consensus is additive signal, never a gate.

Cost Estimation

Consensus multiplies API costs by the number of models queried. The following table estimates cost multipliers relative to a single-model baseline, assuming prompts of roughly equal token count across models.

Per-Query Cost Multiplier

Models Configured	Multiplier
2	2x
3	3x
4	4x

Estimated Impact on Quality Gate Runs

Quality gate integrates consensus at two points: the stagnation judge (verdict mode, every round) and periodic red-team review (review mode, periodic). The table below shows total additional API calls from consensus across different run lengths, assuming periodic red-team consensus fires on rounds 1, 4, 7, 10 (every 3rd round starting from round 1).

Run Length	Stagnation Verdicts	Periodic Red-Team Reviews	Total Consensus Calls	With 2 Models	With 3 Models	With 4 Models
5 rounds	5	2 (rounds 1, 4)	7	14 API calls	21 API calls	28 API calls
10 rounds	10	4 (rounds 1, 4, 7, 10)	14	28 API calls	42 API calls	56 API calls

Periodic application (rather than every-round) keeps cost manageable. A 10-round quality gate run with 2 models adds 28 API calls total — significant but bounded. Teams should start with 2 models and add more only after validating the signal-to-cost ratio.

Cost Management Recommendations

Start with 2 models to validate consensus value before scaling.
Use mode-level toggles (consensus.modes.verdict: false) to disable consensus for lower-value decision points.
Monitor models_responded in results to detect models that consistently time out (wasted cost).

Red Flags

Never

Use consensus on every red-team round. Consensus is for periodic checkpoints (e.g., rounds 1, 4, 7), not every iteration. Every-round consensus is wasteful and slows feedback loops.
Treat single-model unique findings as less important than multi-model agreements. Unique findings are often the most valuable — they represent blind spots. Confidence level indicates corroboration, not importance.
Pass consensus provenance to the next red-team reviewer. This creates anchoring bias. The next reviewer must form independent judgments. Provenance is for the aggregator and the human operator, not for downstream reviewers.
Store API keys in config files. Keys are referenced by environment variable name only. The api_key_env field holds the variable name, never the key itself.
Log prompt content or API keys. The MCP server must not persist prompt text, context, or API credentials to disk or stdout. Structured metadata (model IDs, response times, token counts) may be logged.

Always

Fall back gracefully when consensus is unavailable. Every skill that calls consensus_query must have a code path that works without it. Consensus is additive, never required.
Include provenance metadata in consensus results. Every finding and verdict must trace back to the model(s) that produced it. Provenance enables debugging, trust calibration, and cost attribution.
Check tool availability before attempting consensus calls. Skills must verify that the consensus_query tool exists in the current tool list before calling it. If absent, skip the consensus step silently.

Integration Points

Active Integration

Skill	Decision Point	Consensus Mode	Frequency
quality-gate	Stagnation judge	verdict	Every judge dispatch
quality-gate	Red-team review	review	Rounds 1, 4, 7, 10, 13
design	Challenger (deep dive)	investigate	Every deep-dive challenge

Excluded (with rationale)

Skill	Decision Point	Why Excluded
inquisitor	5-dimension dispatch	Subagents write and run tests — requires local filesystem and execution environment. External models cannot participate.
quality-gate	Fix verifier	Mechanical check (did the fix address the finding?). Single-model Sonnet is sufficient. Multi-model adds cost without proportional value.
quality-gate	Fix agent	Requires filesystem access to modify code/docs. External models cannot write local files.
red-team	Standalone full-loop	When invoked directly (not via quality-gate), red-team runs its own stagnation loop. Consensus integration is at the quality-gate level, which owns the iteration.
design	Investigation agents	Domain Researcher, Impact Analyst all need filesystem access. External models cannot read the local codebase. Recon provides structural context but runs locally.
build	Task implementation	Code generation requires filesystem access and project context.

Reconsideration Criteria

An excluded decision point should be reconsidered for consensus integration if:

The task becomes self-contained (no filesystem dependency), OR
MCP servers gain filesystem-pass-through capabilities, OR
Empirical evidence shows the single-model output has systematic blind spots at that point

MCP Server Implementation Guide

The crucible-consensus MCP server is a standalone Python process at mcp-servers/crucible-consensus/.

Architecture

Reads .claude/consensus-config.yaml from the project directory
Validates API key environment variables are set
Exposes the consensus_query tool via MCP protocol
On each tool call: a. Dispatches the prompt + context to all configured models in parallel b. Waits up to timeout_seconds for each model c. Collects successful responses, discards failures d. If successful_count >= min_models: runs aggregation prompt, returns synthesis e. If successful_count < min_models: returns status "unavailable"

Provider Adapters

Each provider adapter handles:

API authentication (reading key from environment)
Request formatting (model-specific API shape)
Response extraction (model-specific response parsing)
Error handling (HTTP errors, rate limits, malformed responses)

Shipped providers: anthropic, google. Planned: openai, deepseek (deferred to follow-up). Adding a new provider requires one adapter function following the BaseProvider protocol.

Parallel Dispatch

Uses asyncio.gather() with per-model timeout. Each model call is independent. Failures are caught per-model and do not abort other model calls.

Aggregation

After collecting responses, the aggregator:

Selects the mode-specific prompt template from skills/consensus/aggregation-{mode}-prompt.md
Injects all successful model responses (identified by provider name)
Dispatches to the first configured Anthropic model (primary aggregator)
Parses the structured JSON output from the aggregator response
Returns the parsed output as the tool response

Testing

The MCP server is testable with mock model responses. No live API keys needed for the test suite.

Key test scenarios:

All models succeed → consensus response
N-1 models fail, 1 succeeds, min_models=2 → unavailable
2 of 4 models succeed, min_models=2 → partial consensus
All models fail → unavailable
Timeout handling (one model slow, others fast)
Malformed response from one model (does not crash aggregation)
consensus.enabled: false → immediate "unavailable" return

Similar Skills

council

332

Run multi-judge consensus.

20 files

agentops

agents-consilium

Queries external AI agents (Codex, Gemini, OpenCode, Claude Code headless) in parallel for code reviews, bug investigations, architecture choices, security audits, and consensus on complex decisions.

12 files

ai-driven-development

model-council

Orchestrates parallel analysis of coding problems across AI models (Claude, GPT, Gemini, Grok) via CLI tools or APIs, collects recommendations, and synthesizes optimal solution.

2 files

skills

Stats

Stars10

Forks2

Last CommitApr 12, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Overview

All subagent dispatches use disk-mediated dispatch. See shared/dispatch-convention.md for the full protocol.

Tool Interface

`consensus_query` (MCP tool)

Dispatches a prompt to all configured models in parallel, collects responses, and returns a synthesized result.

Parameters

Parameter	Type	Required	Description
`prompt`	string	Yes	The decision prompt to send to each model. Should be self-contained and unambiguous.
`context`	string	Yes	Supporting context (artifact content, conversation history, prior findings). Sent verbatim to each model alongside the prompt.
`mode`	enum	Yes	One of `"review"`, `"verdict"`, or `"investigate"`. Controls how responses are aggregated.
`metadata`	object	No	Optional metadata for traceability. Fields: `artifact_type` (string), `round_number` (int), `score_progression` (list of ints).

Return Value

Field	Type	Description
`status`	enum	`"complete"` (all models responded), `"partial"` (some models responded, count >= min_models), `"unavailable"` (fewer than min_models responded or tool not configured).
`models_queried`	int	Number of models the query was dispatched to.
`models_responded`	int	Number of models that returned a valid response within the timeout.
`synthesis`	string	Aggregated result text, formatted according to the selected mode.
`agreements`	list	Findings or verdicts where 2+ models converged.
`disagreements`	list	Findings or verdicts where models explicitly contradicted each other.
`unique_findings`	list	Findings raised by exactly one model, with provenance.
`per_model`	list	Per-model detail. Each entry: `{ provider: string, model_id: string, content: string, responded: boolean }`.

Consensus Modes

review

For adversarial review of artifacts (red-team rounds, design review, code review).

Each model receives the prompt and context independently and reviews the artifact.
The aggregator merges findings and deduplicates by semantic similarity (same root cause = one finding).
Confidence levels:
- High — 3 or more models independently identify the same finding.
- Medium — 2 models independently identify the same finding.
- Low — A single model identifies a finding with no corroboration.
Severity uses Crucible's standard taxonomy: Fatal, Significant, Minor.
Every finding carries per-finding provenance — the list of model IDs that surfaced it.
Deduplicated findings are ordered by severity (Fatal first), then by confidence (High first).

verdict

For binary or ternary decisions (stagnation judge, fix verification, pass/fail gates).

Each model renders a verdict (e.g., STAGNATION or PROGRESS) along with structured reasoning.
The aggregator reports the verdict distribution (e.g., "STAGNATION: 3, PROGRESS: 1").
Confidence levels:
- Unanimous — All responding models agree.
- Supermajority — 75%+ of responding models agree.
- Split — No supermajority exists.
Split decisions are flagged with "requires_human_judgment": true in the synthesis.
The majority verdict is reported as the recommendation, but the caller decides how to act on splits.

investigate

For design exploration (Challenger mode, architectural analysis).

Each model investigates the prompt independently, surfacing risks, alternatives, or concerns.
The aggregator merges findings and deduplicates by semantic similarity.
Findings unique to a single model are marked with provenance and highlighted as potential blind-spot discoveries.
Contradictory findings (where two models reach opposing conclusions about the same aspect) are flagged explicitly in the disagreements list with both positions and their reasoning.
The synthesis groups results into: shared concerns, unique discoveries, and contradictions.

Configuration

Schema

consensus:
  enabled: true                # Master switch. Default: false.
  min_models: 2                # Minimum models that must respond for a valid result. Default: 2.
  timeout_seconds: 120         # Per-query timeout. Models that haven't responded are marked responded: false. Default: 120.
  models:
    - provider: anthropic      # Provider identifier. Shipped: "anthropic", "google". Planned: "openai", "deepseek".
      model: claude-sonnet-4-20250514  # Model identifier within the provider.
      api_key_env: ANTHROPIC_API_KEY   # Environment variable name holding the API key. NEVER a raw key.
      temperature: 0.6         # Sampling temperature for this model. Default: 0.6. Range: 0.0-1.0.
    - provider: google
      model: gemini-2.5-pro
      api_key_env: GOOGLE_API_KEY
      temperature: 0.6
  modes:
    review: true               # Enable consensus for review-mode calls. Default: true.
    verdict: true              # Enable consensus for verdict-mode calls. Default: true.
    investigate: true          # Enable consensus for investigate-mode calls. Default: true.

Field Reference

Field	Type	Default	Description
`consensus.enabled`	boolean	`false`	Master switch. When false, all `consensus_query` calls immediately return `status: "unavailable"`.
`consensus.min_models`	int	`2`	Minimum number of models that must respond for the result to be `"complete"` or `"partial"`. If fewer respond, status is `"unavailable"`. Must be >= 2.
`consensus.timeout_seconds`	int	`120`	Maximum time (seconds) to wait for all models to respond. Models exceeding this are marked `responded: false`. Must be >= 10 and <= 600.
`consensus.models`	list	`[]`	List of model configurations. At least `min_models` entries required.
`consensus.models[].provider`	string	—	Provider identifier. Must be one of the shipped providers (`anthropic`, `google`) or a planned provider once available.
`consensus.models[].model`	string	—	Model identifier recognized by the provider's API.
`consensus.models[].api_key_env`	string	—	Name of the environment variable containing the API key. The MCP server reads the key from the environment at runtime. Keys are NEVER stored in config files.
`consensus.models[].temperature`	float	`0.6`	Sampling temperature. Lower values produce more deterministic responses.
`consensus.modes.review`	boolean	`true`	Whether `review` mode calls are dispatched to consensus. When false, review-mode calls return `"unavailable"`.
`consensus.modes.verdict`	boolean	`true`	Whether `verdict` mode calls are dispatched to consensus.
`consensus.modes.investigate`	boolean	`true`	Whether `investigate` mode calls are dispatched to consensus.

Validation Rules

If consensus.enabled is true but consensus.models has fewer entries than min_models, the MCP server logs a warning and treats consensus as unavailable.
If a referenced api_key_env variable is not set in the environment, that model is skipped. If the remaining count drops below min_models, status is "unavailable".
Duplicate provider+model combinations are rejected at config load time.
The temperature field is clamped to [0.0, 1.0].

Graceful Degradation

The consensus system is designed to never break existing workflows. Degradation follows a strict hierarchy:

Consensus available — All configured models respond within the timeout. status: "complete". Full multi-model synthesis is returned to the calling skill.
Partially available — Some models fail or time out, but the number of successful responses is >= min_models. status: "partial". Aggregation proceeds with available responses. The models_queried and models_responded fields indicate the gap.
Unavailable (configuration) — No config file, enabled: false, no models configured, or no API keys in environment. status: "unavailable". The calling skill receives an immediate response and proceeds with its existing single-model logic. Zero overhead, zero delay.
MCP server not running — The consensus_query tool is not present in the tool list. Skills that integrate consensus check for tool availability before attempting calls. If the tool is absent, the consensus step is skipped entirely. No errors, no retries.

In all cases, the calling skill's core logic is self-contained. Consensus is additive signal, never a gate.

Cost Estimation

Per-Query Cost Multiplier

Models Configured	Multiplier
2	2x
3	3x
4	4x

Estimated Impact on Quality Gate Runs

Run Length	Stagnation Verdicts	Periodic Red-Team Reviews	Total Consensus Calls	With 2 Models	With 3 Models	With 4 Models
5 rounds	5	2 (rounds 1, 4)	7	14 API calls	21 API calls	28 API calls
10 rounds	10	4 (rounds 1, 4, 7, 10)	14	28 API calls	42 API calls	56 API calls

Cost Management Recommendations

Start with 2 models to validate consensus value before scaling.
Use mode-level toggles (consensus.modes.verdict: false) to disable consensus for lower-value decision points.
Monitor models_responded in results to detect models that consistently time out (wasted cost).

Red Flags

Never

Use consensus on every red-team round. Consensus is for periodic checkpoints (e.g., rounds 1, 4, 7), not every iteration. Every-round consensus is wasteful and slows feedback loops.
Treat single-model unique findings as less important than multi-model agreements. Unique findings are often the most valuable — they represent blind spots. Confidence level indicates corroboration, not importance.
Pass consensus provenance to the next red-team reviewer. This creates anchoring bias. The next reviewer must form independent judgments. Provenance is for the aggregator and the human operator, not for downstream reviewers.
Store API keys in config files. Keys are referenced by environment variable name only. The api_key_env field holds the variable name, never the key itself.
Log prompt content or API keys. The MCP server must not persist prompt text, context, or API credentials to disk or stdout. Structured metadata (model IDs, response times, token counts) may be logged.

Always

Fall back gracefully when consensus is unavailable. Every skill that calls consensus_query must have a code path that works without it. Consensus is additive, never required.
Include provenance metadata in consensus results. Every finding and verdict must trace back to the model(s) that produced it. Provenance enables debugging, trust calibration, and cost attribution.
Check tool availability before attempting consensus calls. Skills must verify that the consensus_query tool exists in the current tool list before calling it. If absent, skip the consensus step silently.

Integration Points

Active Integration

Skill	Decision Point	Consensus Mode	Frequency
quality-gate	Stagnation judge	verdict	Every judge dispatch
quality-gate	Red-team review	review	Rounds 1, 4, 7, 10, 13
design	Challenger (deep dive)	investigate	Every deep-dive challenge

Excluded (with rationale)

Skill	Decision Point	Why Excluded
inquisitor	5-dimension dispatch	Subagents write and run tests — requires local filesystem and execution environment. External models cannot participate.
quality-gate	Fix verifier	Mechanical check (did the fix address the finding?). Single-model Sonnet is sufficient. Multi-model adds cost without proportional value.
quality-gate	Fix agent	Requires filesystem access to modify code/docs. External models cannot write local files.
red-team	Standalone full-loop	When invoked directly (not via quality-gate), red-team runs its own stagnation loop. Consensus integration is at the quality-gate level, which owns the iteration.
design	Investigation agents	Domain Researcher, Impact Analyst all need filesystem access. External models cannot read the local codebase. Recon provides structural context but runs locally.
build	Task implementation	Code generation requires filesystem access and project context.

Reconsideration Criteria

An excluded decision point should be reconsidered for consensus integration if:

The task becomes self-contained (no filesystem dependency), OR
MCP servers gain filesystem-pass-through capabilities, OR
Empirical evidence shows the single-model output has systematic blind spots at that point

MCP Server Implementation Guide

The crucible-consensus MCP server is a standalone Python process at mcp-servers/crucible-consensus/.

Architecture

Reads .claude/consensus-config.yaml from the project directory
Validates API key environment variables are set
Exposes the consensus_query tool via MCP protocol
On each tool call: a. Dispatches the prompt + context to all configured models in parallel b. Waits up to timeout_seconds for each model c. Collects successful responses, discards failures d. If successful_count >= min_models: runs aggregation prompt, returns synthesis e. If successful_count < min_models: returns status "unavailable"

Provider Adapters

Each provider adapter handles:

API authentication (reading key from environment)
Request formatting (model-specific API shape)
Response extraction (model-specific response parsing)
Error handling (HTTP errors, rate limits, malformed responses)

Shipped providers: anthropic, google. Planned: openai, deepseek (deferred to follow-up). Adding a new provider requires one adapter function following the BaseProvider protocol.

Parallel Dispatch

Uses asyncio.gather() with per-model timeout. Each model call is independent. Failures are caught per-model and do not abort other model calls.

Aggregation

After collecting responses, the aggregator:

Selects the mode-specific prompt template from skills/consensus/aggregation-{mode}-prompt.md
Injects all successful model responses (identified by provider name)
Dispatches to the first configured Anthropic model (primary aggregator)
Parses the structured JSON output from the aggregator response
Returns the parsed output as the tool response

Testing

The MCP server is testable with mock model responses. No live API keys needed for the test suite.

Key test scenarios:

All models succeed → consensus response
N-1 models fail, 1 succeeds, min_models=2 → unavailable
2 of 4 models succeed, min_models=2 → partial consensus
All models fail → unavailable
Timeout handling (one model slow, others fast)
Malformed response from one model (does not crash aggregation)
consensus.enabled: false → immediate "unavailable" return

consensus

Tool Access

Preview

Supporting Assets

SKILL.md

Similar Skills

Help us improve

Help us improve

consensus

Tool Access

Preview

Supporting Assets

SKILL.md

Overview

Tool Interface

consensus_query (MCP tool)

Parameters

Return Value

Consensus Modes

review

verdict

investigate

Configuration

Schema

Field Reference

Validation Rules

Graceful Degradation

Cost Estimation

Per-Query Cost Multiplier

Estimated Impact on Quality Gate Runs

Cost Management Recommendations

Red Flags

Never

Always

Integration Points

Active Integration

Excluded (with rationale)

Reconsideration Criteria

MCP Server Implementation Guide

Architecture

Provider Adapters

Parallel Dispatch

Aggregation

Testing

Similar Skills

Help us improve

Overview

Tool Interface

consensus_query (MCP tool)

Parameters

Return Value

Consensus Modes

review

verdict

investigate

Configuration

Schema

Field Reference

Validation Rules

Graceful Degradation

Cost Estimation

Per-Query Cost Multiplier

Estimated Impact on Quality Gate Runs

Cost Management Recommendations

Red Flags

Never

Always

Integration Points

Active Integration

Excluded (with rationale)

Reconsideration Criteria

MCP Server Implementation Guide

Architecture

Provider Adapters

Parallel Dispatch

Aggregation

Testing

`consensus_query` (MCP tool)

`consensus_query` (MCP tool)