From magic-powers
Use when selecting AI models for different tasks, designing cost-aware routing (cheap→expensive cascade), implementing model fallbacks, and optimizing the capability/cost/latency tradeoff across model tiers.
npx claudepluginhub kienbui1995/magic-powers --plugin magic-powersThis skill uses the workspace's default tool permissions.
Not all LLM tasks are equal. Sending every request to the most capable — and most expensive — model is a failure of system design. Model routing assigns each task to the cheapest model that can handle it reliably, uses cascade escalation when a cheaper model is insufficient, and maintains fallback chains to keep systems available when a model is down.
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
Not all LLM tasks are equal. Sending every request to the most capable — and most expensive — model is a failure of system design. Model routing assigns each task to the cheapest model that can handle it reliably, uses cascade escalation when a cheaper model is insufficient, and maintains fallback chains to keep systems available when a model is down.
| Tier | Models | Cost | Best For |
|---|---|---|---|
| Frontier/Opus | Claude Opus 4.6, GPT-4o, Gemini 1.5 Pro | $$$$ | Complex reasoning, multi-step planning, nuanced judgment |
| Standard/Sonnet | Claude Sonnet 4.6, GPT-4o-mini, Gemini 1.5 Flash | $$ | Most tasks: coding, analysis, writing, RAG Q&A |
| Fast/Haiku | Claude Haiku 4.5, GPT-3.5-turbo | $ | Classification, simple extraction, routing decisions, short summaries |
| Reasoning | o1, o3-mini, Claude extended thinking | $$$-$$$$ | Math, logic, code correctness, multi-step reasoning |
| Embeddings | text-embedding-3-small/large, voyage-large | $ | Semantic search, clustering, similarity |
Route tasks to the cheapest model that can handle them reliably:
class TaskRouter:
def route(self, task: Task) -> str:
# Classify task complexity first (use haiku for this — it's cheap)
complexity = self.classify_complexity(task)
routing_table = {
"simple_extraction": "claude-haiku-4-5", # extract structured data
"classification": "claude-haiku-4-5", # categorize input
"short_summary": "claude-haiku-4-5", # <500 word summary
"rag_qa": "claude-sonnet-4-6", # RAG Q&A with context
"code_generation": "claude-sonnet-4-6", # write/review code
"complex_analysis": "claude-opus-4-6", # deep analysis
"multi_step_planning": "claude-opus-4-6", # agent planning
"math_logic": "claude-opus-4-6", # reasoning tasks
}
return routing_table.get(complexity, "claude-sonnet-4-6")
def classify_complexity(self, task: Task) -> str:
# Use haiku to classify — fast and cheap
response = haiku.classify(
task.description,
categories=list(routing_table.keys())
)
return response.category
Try cheap model first; escalate only when output quality is insufficient:
async def cascade_generate(prompt: str, quality_threshold: float = 0.8) -> str:
models = [
"claude-haiku-4-5", # try cheapest first
"claude-sonnet-4-6", # escalate if haiku insufficient
"claude-opus-4-6", # escalate for complex cases
]
for model in models:
response = await llm.generate(prompt, model=model)
quality = await evaluate_quality(response, prompt)
if quality >= quality_threshold:
log_routing_decision(model, quality, escalated=(model != models[0]))
return response.text
log_escalation(from_model=model, quality=quality)
return response.text # return best available even if below threshold
87% cost reduction potential: OpenAI's analysis shows 87% of queries can be handled by cheaper models in well-designed cascades. Only ~13% need frontier models.
Route based on the content characteristics of the request:
def content_based_route(query: str, context: dict) -> str:
# Length heuristics
if len(query.split()) < 20 and not context.get("requires_reasoning"):
return "claude-haiku-4-5"
# Topic-based routing
if any(kw in query.lower() for kw in ["calculate", "prove", "math", "algorithm"]):
return "claude-opus-4-6" # reasoning intensive
if any(kw in query.lower() for kw in ["summarize", "extract", "list", "classify"]):
return "claude-haiku-4-5" # structured, simple
# Context length routing
total_tokens = count_tokens(query) + count_tokens(str(context))
if total_tokens > 50000:
return "claude-sonnet-4-6" # long context needs capable model
return "claude-sonnet-4-6" # default: balanced
Always define fallbacks for production reliability:
MODEL_FALLBACK_CHAIN = {
"claude-opus-4-6": ["claude-sonnet-4-6", "claude-haiku-4-5"],
"claude-sonnet-4-6": ["claude-haiku-4-5", "gpt-4o-mini"],
"claude-haiku-4-5": ["gpt-3.5-turbo"],
}
async def generate_with_fallback(prompt: str, preferred_model: str) -> GenerationResult:
models_to_try = [preferred_model] + MODEL_FALLBACK_CHAIN.get(preferred_model, [])
for model in models_to_try:
try:
result = await llm.generate(prompt, model=model)
return GenerationResult(
text=result,
model_used=model,
used_fallback=(model != preferred_model)
)
except (ModelUnavailableError, RateLimitError) as e:
log_fallback(from_model=model, reason=str(e))
continue
raise AllModelsFailedError(f"All models in chain failed: {models_to_try}")
Track routing decisions to optimize over time:
# Log every routing decision
def log_routing_decision(model, quality_score, latency_ms, cost_usd, task_type):
metrics.record({
"model": model,
"quality": quality_score,
"latency_ms": latency_ms,
"cost_usd": cost_usd,
"task_type": task_type,
"timestamp": now()
})
# Weekly analysis: what % of tasks need each model tier?
# If haiku handles 70% of tasks at acceptable quality → good routing
# If haiku handles only 20% → routing too conservative, tighten thresholds
llm-cost-optimization for comprehensive cost reduction strategyagentic-reliability for fallback chain implementation patternsllm-observability to track routing decisions in production@ai-engineer and @ai-product use this when designing multi-model systems