From openrouter-pack
Provides reference architectures for production OpenRouter LLM gateway setups with caching, rate limiting, observability, from simple to enterprise scale.
How this skill is triggered — by the user, by Claude, or both
Slash command
/openrouter-pack:openrouter-reference-architectureThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
OpenRouter serves as a unified LLM gateway, abstracting provider complexity. A production architecture wraps it with caching, rate limiting, cost controls, observability, and async processing. This skill provides three reference architectures: simple (single service), standard (microservice), and enterprise (event-driven).
OpenRouter serves as a unified LLM gateway, abstracting provider complexity. A production architecture wraps it with caching, rate limiting, cost controls, observability, and async processing. This skill provides three reference architectures: simple (single service), standard (microservice), and enterprise (event-driven).
┌─────────────┐ ┌──────────────────────────┐ ┌──────────────┐
│ Your App │────▶│ OpenRouter Client │────▶│ OpenRouter │
│ │ │ - Retry (SDK built-in) │ │ /api/v1 │
│ │◀────│ - Cost tracking │◀────│ │
│ │ │ - Structured logging │ └──────────────┘
└─────────────┘ └──────────────────────────┘
import os, logging
from openai import OpenAI
log = logging.getLogger("llm")
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=os.environ["OPENROUTER_API_KEY"],
max_retries=3,
timeout=30.0,
default_headers={"HTTP-Referer": "https://my-app.com", "X-Title": "my-app"},
)
def complete(prompt, model="openai/gpt-4o-mini", **kwargs):
kwargs.setdefault("max_tokens", 1024)
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
**kwargs,
)
log.info(f"[{response.model}] {response.usage.prompt_tokens}+{response.usage.completion_tokens} tokens")
return response.choices[0].message.content
┌─────────────┐ ┌─────────────────────┐ ┌──────────────┐
│ API Gateway│────▶│ AI Service │────▶│ OpenRouter │
│ (auth, │ │ ┌─────────────┐ │ │ /api/v1 │
│ rate-limit│ │ │ Router │ │ └──────────────┘
│ logging) │ │ │ (task→model)│ │
└─────────────┘ │ └─────────────┘ │
│ ┌─────────────┐ │
│ │ Cache │◀──▶│── Redis
│ │ (TTL-based) │ │
│ └─────────────┘ │
│ ┌─────────────┐ │
│ │ Budget │◀──▶│── SQLite/Postgres
│ │ Enforcer │ │
│ └─────────────┘ │
└─────────────────────┘
from fastapi import FastAPI, Depends, HTTPException
from pydantic import BaseModel
app = FastAPI()
class CompletionRequest(BaseModel):
prompt: str
task_type: str = "general" # classification, code, analysis, etc.
max_tokens: int = 1024
user_id: str = "anonymous"
ROUTING_TABLE = {
"classification": "openai/gpt-4o-mini",
"code": "anthropic/claude-3.5-sonnet",
"analysis": "anthropic/claude-3.5-sonnet",
"general": "openai/gpt-4o-mini",
"budget": "meta-llama/llama-3.1-8b-instruct",
}
@app.post("/v1/complete")
async def complete(req: CompletionRequest):
model = ROUTING_TABLE.get(req.task_type, "openai/gpt-4o-mini")
# Check cache first (for deterministic requests)
cached = cache.get(model, req.prompt)
if cached:
return {"content": cached, "cached": True}
# Check budget
budget.check(req.user_id, model, estimate_tokens(req.prompt), req.max_tokens)
# Call OpenRouter
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": req.prompt}],
max_tokens=req.max_tokens,
extra_body={
"models": [model, "openai/gpt-4o-mini"], # Fallback
"route": "fallback",
},
)
# Record cost and cache
budget.record(req.user_id, response.id)
cache.set(model, req.prompt, response.choices[0].message.content)
return {
"content": response.choices[0].message.content,
"model": response.model,
"tokens": response.usage.prompt_tokens + response.usage.completion_tokens,
}
┌──────────┐ ┌───────────┐ ┌──────────────┐ ┌──────────────┐
│ API │───▶│ Queue │───▶│ Workers │───▶│ OpenRouter │
│ Gateway │ │ (Redis/ │ │ (auto-scale) │ │ /api/v1 │
└──────────┘ │ SQS) │ │ ┌──────────┐│ └──────────────┘
└───────────┘ │ │ Router ││
│ │ │ Cache ││
▼ │ │ Budget ││
┌───────────┐ │ │ Audit ││
│ Results │◀───│ └──────────┘│
│ Store │ └──────────────┘
└───────────┘
│
┌───────────┐ ┌──────────────┐
│ Metrics │───▶│ Dashboard │
│ (OTEL) │ │ Alerts │
└───────────┘ └──────────────┘
# Worker that processes queued AI requests
import json, redis
r = redis.Redis()
def worker_loop():
"""Process AI requests from the queue."""
while True:
_, raw = r.brpop("ai:requests")
request = json.loads(raw)
try:
response = client.chat.completions.create(
model=request["model"],
messages=request["messages"],
max_tokens=request.get("max_tokens", 1024),
extra_body={
"models": [request["model"], "openai/gpt-4o-mini"],
"route": "fallback",
},
)
result = {
"id": request["id"],
"content": response.choices[0].message.content,
"model": response.model,
"status": "complete",
}
except Exception as e:
result = {"id": request["id"], "error": str(e), "status": "failed"}
r.lpush(f"ai:results:{request['id']}", json.dumps(result))
r.expire(f"ai:results:{request['id']}", 3600)
| Factor | Simple | Standard | Enterprise |
|---|---|---|---|
| Team size | 1-3 | 3-10 | 10+ |
| Requests/day | <1K | 1K-100K | 100K+ |
| Latency needs | Tolerant | Low | Mixed (sync+async) |
| Budget tracking | Basic | Per-user | Per-user + department |
| Failure handling | SDK retries | Fallback chain | Queue + retry + DLQ |
| Observability | Logging | Metrics + logging | Full OTEL tracing |
| Error | Cause | Fix |
|---|---|---|
| Single point of failure | No redundancy in AI service | Deploy 2+ instances behind load balancer |
| Queue backlog | Worker throughput < incoming rate | Auto-scale workers; implement backpressure |
| Cache stampede | Many requests for same uncached key | Use cache locking or singleflight pattern |
| Budget bypass | Direct calls skipping middleware | All calls must go through the AI service |
npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin openrouter-packProvides Claude API reference architectures: sync FastAPI gateway, async Redis queues, multi-model routing. Use when designing scalable Anthropic integrations.
Model routing configuration templates and strategies for cost optimization, speed optimization, quality optimization, and intelligent fallback chains. Use when building AI applications with OpenRouter, implementing model routing strategies, optimizing API costs, setting up fallback chains, implementing quality-based routing, or when user mentions model routing, cost optimization, fallback strategies, model selection, intelligent routing, or dynamic model switching.
Provides expert guidance for Vercel AI Gateway configuration: model routing, provider failover, cost tracking, unified API for multiple AI providers like OpenAI, Anthropic, Gemini.