Help us improve
Share bugs, ideas, or general feedback.
From its-hub
Guides users through inference-time scaling using its_hub: algorithm selection (Self-Consistency, Best-of-N, Beam Search, Particle Filtering), budget tuning, reward model setup, and troubleshooting.
npx claudepluginhub red-hat-ai-innovation-team/its_hub --plugin its-hubHow this skill is triggered — by the user, by Claude, or both
Slash command
/its-hub:inference-scaling-guideThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
its_hub generates multiple LLM responses and selects the best one using voting, scoring, or search. All algorithms share the same interface: `ainfer(lm, prompt, budget)` (async) or `infer(...)` (sync).
Detects the inference-scaling environment, runs inference-time scaling on a prompt, and presents results with vote counts or scores.
Provides patterns for LLM inference infrastructure with serving frameworks like vLLM, TGI, TensorRT-LLM; quantization, batching strategies, KV cache, and streaming responses. Use for optimizing latency and scaling deployments.
Interactive benchmark orchestrator for vLLM inference services. Runs single/multi-case online benchmarks, aggregates results, and auto-optimizes concurrency under latency SLOs.
Share bugs, ideas, or general feedback.
its_hub generates multiple LLM responses and selects the best one using voting, scoring, or search. All algorithms share the same interface: ainfer(lm, prompt, budget) (async) or infer(...) (sync).
For API reference and conceptual overviews, consult the docs at https://ai-innovation.team/its_hub and the docs/ directory. This skill covers practical knowledge, decision frameworks, and troubleshooting.
| Need | Algorithm | Why |
|---|---|---|
| Fast improvement, tool calling | Self-Consistency | Voting is cheap, no reward model needed, excellent for tool-call consensus |
| Highest quality single response | Best-of-N | Scores every candidate, picks the best — requires a reward model |
| Step-by-step reasoning | Beam Search | Evaluates partial solutions at each step — requires process reward model + GPU |
| Complex multi-path reasoning | Particle Filtering | Maintains diverse reasoning paths — requires process reward model + GPU |
| Long multi-step tasks | Entropic Particle Filtering | Avoids premature convergence on long sequences — requires process reward model + GPU |
tool_vote="tool_hierarchical" is the recommended starting pointThe budget parameter controls how many LLM calls are made per prompt:
| Algorithm | Budget meaning | Starting point | Diminishing returns |
|---|---|---|---|
| Self-Consistency | Number of parallel generations | 5-8 | Beyond 16 for most tasks |
| Best-of-N | Number of candidates to score | 4-8 | Beyond 16 |
| Beam Search | Total generations (= beam_width × steps) | 16-32 | Depends on step count |
| Particle Filtering | Number of particles | 8-16 | Beyond 32 |
Budget vs cost: each budget unit = 1 LLM call. Budget 8 costs 8x a single call. Start low, increase only if quality improves.
Budget vs latency: Self-Consistency and Best-of-N run in parallel (latency ≈ single call). Beam Search and Particle Filtering are sequential per step (latency ≈ budget × step time).
Score complete responses. Used by Best-of-N.
LLM Judge (easiest setup — uses an LLM to score):
from its_hub import LLMJudge, OpenAICompatibleLanguageModel
judge_lm = OpenAICompatibleLanguageModel(
endpoint="https://api.openai.com/v1",
api_key=os.environ["OPENAI_API_KEY"],
model_name="gpt-4o-mini"
)
judge = LLMJudge(lm=judge_lm, fallback_score=5.0)
The judge model can be the same as the generation model, but using a stronger model as judge improves quality.
Score each reasoning step. Used by Beam Search and Particle Filtering. Requires a local GPU.
from its_hub.core.reward_models.local_vllm_prm import LocalVllmProcessRewardModel
prm = LocalVllmProcessRewardModel(
model_name="Qwen/Qwen2.5-Math-PRM-7B",
device="cuda:0",
aggregation_method="prod" # or "mean", "min", "max"
)
Aggregation methods:
prod: Product of step scores (strict — one bad step kills the score)mean: Average of step scores (forgiving)min: Worst step score (conservative)max: Best step score (optimistic)Start with prod for math, mean for general reasoning.
Self-Consistency supports voting on tool calls, not just text:
sc = SelfConsistency(tool_vote="tool_hierarchical")
result = sc.infer(lm, messages, budget=5, tools=tools, tool_choice="auto")
Tool voting modes:
tool_name: Vote on which tool to calltool_args: Vote on tool argumentstool_hierarchical (recommended): First vote on tool name, then on arguments within the winning toolexclude_args=["timestamp", "id"]: Exclude non-semantic arguments from votingBest-of-N also works with tool calls when using an LLM Judge that understands tool-call quality.
For Beam Search and Particle Filtering, configure how the LLM generates incrementally:
from its_hub import StepGeneration
sg = StepGeneration(
max_steps=32, # Maximum reasoning steps
step_token="\n\n", # Split on double newlines
stop_token=r"\boxed", # Stop when final answer found
)
Tuning:
max_steps: Higher for complex problems. 16-32 is typical for math.step_token: Use "\n\n" for chain-of-thought, "\n" for more granular steps.stop_token: Match your expected answer format (\boxed for math, custom for other tasks).All algorithms accept an optional orchestrator for controlling parallelism:
from its_hub import LMOrchestrator
orchestrator = LMOrchestrator(max_concurrency=4)
sc = SelfConsistency(orchestrator=orchestrator)
When to tune:
max_concurrency to stay under the limitAbstractOrchestrator with your own rate limiting| Symptom | Cause | Fix |
|---|---|---|
| All responses identical | Temperature too low or budget too low | Increase temperature (0.7-1.0) or budget |
| Self-Consistency ties | Budget too low for the task | Increase budget to odd number (5, 7, 9) |
| Best-of-N picks poor response | Judge model not strong enough | Use a stronger judge model or tune the prompt |
| Beam Search OOM | PRM too large for GPU | Use a smaller PRM or offload to different GPU (device="cuda:1") |
| Particle Filtering slow | Sequential step generation | Reduce max_steps or switch to Self-Consistency for speed |
| Rate limit errors | Too many parallel calls | Set LMOrchestrator(max_concurrency=N) |
| Empty or null results | LM endpoint unreachable or API key invalid | Verify endpoint with a single lm.agenerate_single() call |
Always close the LM after use:
# Async context manager (recommended)
async with OpenAICompatibleLanguageModel(...) as lm:
result = await algorithm.ainfer(lm, prompt, budget=5)
# Sync usage — explicit close
lm = OpenAICompatibleLanguageModel(...)
result = algorithm.infer(lm, prompt, budget=5)
asyncio.run(lm.close())
scripts/benchmark.py on MATH500 or AIME-2024 to compare algorithms for your modelDetailed documentation for specific topics lives in the docs/ directory:
docs/algorithms.md — Full code examples for every algorithm (Self-Consistency, Best-of-N, Beam Search, Particle Filtering, Entropic PF), tool-calling integration, step generation config, and reward model setupdocs/orchestration.md — Concurrency control, custom orchestrator implementation for gateway deployments, async/sync usage patternsdocs/benchmarking.md — How to benchmark algorithms on MATH500 and AIME-2024, budget scaling analysisdocs/iaas-service.md — Running the Inference-as-a-Service HTTP serverdocs/quick-start.md — Getting started from zero