From langchain-py-pack
Rate-limit LangChain 1.0 calls correctly across multi-worker deployments — Redis-backed limiters, asyncio.Semaphore, narrow exception whitelists, and provider-specific throttle handling. Use when hitting 429s in production, scaling workers horizontally, or tuning throughput against Anthropic, OpenAI, or Gemini tier limits. Trigger with "langchain rate limit", "langchain 429", "langchain semaphore", "langchain token bucket", "anthropic rpm", "openai rpm throttling", "InMemoryRateLimiter", "redis rate limiter".
npx claudepluginhub flight505/skill-forge --plugin langchain-py-packThis skill is limited to using the following tools:
A team deploys 10 Cloud Run workers. Each worker initializes its `ChatAnthropic`
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).
Share bugs, ideas, or general feedback.
A team deploys 10 Cloud Run workers. Each worker initializes its ChatAnthropic
with InMemoryRateLimiter(requests_per_second=10) — they read the docs, they
picked a safe-looking number, they shipped. Thirty seconds later the dashboard
lights up with 429s: the cluster is pushing 100 RPS to Anthropic's 50 RPM
tier-1 ceiling, not the 10 RPS they configured. The name is the fix —
InMemoryRateLimiter is in-process. Each worker has its own counter. Ten
workers × 10 RPS = 100 RPS to the provider. This is pain-catalog entry P29
and it lands on every team that scales past one pod.
Three more traps wait on the same code path:
.with_fallbacks([backup]) defaults exceptions_to_handle=(Exception,),
which on Python <3.12 swallows KeyboardInterrupt. Ctrl+C during a 429
retry storm silently falls through to the backup chain and keeps billing.ChatOpenAI and ChatAnthropic default max_retries=6. That is
retries, not attempts: 7 total requests per logical call on flaky
networks. One .invoke() can bill 7x.This skill covers measuring demand before picking a limit; the
InMemoryRateLimiter vs Redis-backed limiter vs asyncio.Semaphore decision
tree; the narrow exceptions_to_handle whitelist; max_retries=2 math; and
the provider-specific limit taxonomy (RPM, ITPM, OTPM, concurrent,
cached-vs-uncached). Pin: langchain-core 1.0.x, langchain-anthropic 1.0.x,
langchain-openai 1.0.x. Pain-catalog anchors: P07, P08, P29, P30, P31.
For .batch(max_concurrency=...) tuning, see the sibling skill
langchain-performance-tuning — this skill is about provider-facing rate caps.
KeyboardInterrupt half of P07)langchain-core >= 1.0, < 2.0pip install langchain-anthropic langchain-openairedis >= 4.5 client and a Redis server reachable from every workerlangchain-model-inference — the chat-model factory from that skill is where rate_limiter= gets attachedDo not guess at requests_per_second. Instrument first, size second.
Attach a BaseCallbackHandler that logs per-call input_tokens,
output_tokens, and cache_read_input_tokens from response.generations[].message.usage_metadata:
chain.with_config({"callbacks": [DemandLogger()]})
Collect 24-48 hours of representative traffic. Roll up: p50 and p95 RPM, p95 ITPM, p95 OTPM, cache hit rate. Size the limiter at 70% of the binding constraint's tier ceiling on your p95.
See Measuring Demand for the full
DemandLogger implementation, pandas roll-up, OTEL integration, load-test
harness, and multi-tenant sizing strategies.
InMemoryRateLimiter for single-process dev only; never multi-worker prodLangChain 1.0 ships InMemoryRateLimiter as a first-class BaseChatModel parameter:
from langchain_anthropic import ChatAnthropic
from langchain_core.rate_limiters import InMemoryRateLimiter
limiter = InMemoryRateLimiter(
requests_per_second=0.58, # 35 RPM = 70% of Anthropic tier-1 50 RPM
check_every_n_seconds=0.1,
max_bucket_size=5, # burst capacity
)
llm = ChatAnthropic(
model="claude-sonnet-4-6",
rate_limiter=limiter,
max_retries=2,
timeout=30,
)
InMemoryRateLimiter is per-process. Safe for:
python script.py)uvicorn --workers 1)Unsafe for (this is P29):
--workers 4)For multi-worker deployments, cluster-wide rate limiting requires shared state.
Redis is the default answer — atomic Lua script for sliding-window, or Redis
6.2+ CL.THROTTLE for GCRA.
import redis
from langchain_anthropic import ChatAnthropic
# RedisRateLimiter class defined in references/redis-limiter-pattern.md
from your_app.limiters import RedisRateLimiter
client = redis.Redis.from_url("redis://redis.internal:6379/0")
limiter = RedisRateLimiter(
client,
key="anthropic:prod",
requests_per_second=35 / 60, # 35 RPM cluster-wide, not per-worker
)
llm = ChatAnthropic(
model="claude-sonnet-4-6",
rate_limiter=limiter,
max_retries=2,
timeout=30,
)
Key scoping decisions:
key="anthropic:prod" — all tenants share one global budget (simplest)key=f"anthropic:tenant:{tenant_id}" — per-tenant quota (requires cleanup for dead tenants)See Redis Limiter Pattern for the full
RedisRateLimiter implementation (atomic Lua sliding window), the GCRA
alternative via CL.THROTTLE, failure modes (Redis down, clock skew), and
per-tenant cleanup strategy.
asyncio.Semaphore for per-worker in-flight concurrency capThe rate limiter throttles request rate. A semaphore throttles in-flight count. Use both:
import asyncio
# Cluster: 35 RPM (Redis enforces)
# Worker: 20 in-flight at once (semaphore enforces)
worker_sem = asyncio.Semaphore(20)
async def bounded_invoke(inp):
async with worker_sem:
return await llm.ainvoke(inp)
# Fanout
results = await asyncio.gather(*[bounded_invoke(x) for x in inputs])
Why both: a semaphore prevents a single worker from queueing hundreds of pending limiter acquires against Redis (head-of-line blocking on the event loop). The limiter prevents the cluster from exceeding the provider tier. They solve different problems.
Semaphore sizing: target latency-bandwidth-product. If p95 request latency is 2s and the worker's RPS cap is 10, in-flight count ≈ 2 × 10 = 20. Overshoot is wasted memory; undershoot leaves throughput on the table.
with_fallbacks(exceptions_to_handle=...) — never (Exception,).with_fallbacks([backup]) defaults to catching Exception. This is P07 — on
Python <3.12, Exception edge-cases include KeyboardInterrupt propagation.
Ctrl+C during a retry storm silently hands off to the backup and keeps running.
Always narrow the tuple:
from anthropic import (
RateLimitError, APITimeoutError, APIConnectionError, InternalServerError,
)
resilient = (prompt | claude | parser).with_fallbacks(
[prompt | gpt4o | parser],
exceptions_to_handle=(
RateLimitError, APITimeoutError,
APIConnectionError, InternalServerError,
),
# NEVER: Exception, BaseException, AuthenticationError,
# BadRequestError, ValidationError
)
The whitelist is only transient provider errors. AuthenticationError,
BadRequestError, and ValidationError are bugs in your code/credentials —
fallback produces the same crash. See the sibling skill's reference
langchain-sdk-patterns/references/fallback-exception-list.md for the full
per-provider whitelist (Anthropic, OpenAI, Gemini).
max_retries=2, never the default max_retries=6max_retries is retries, not attempts. Default max_retries=6 on
ChatOpenAI / ChatAnthropic means initial + 6 retries = 7 billed requests
per logical call (P30). On a flaky network, one .invoke() costs 7x what you
budgeted.
# BAD — default
llm = ChatOpenAI(model="gpt-4o") # max_retries=6
# GOOD — production default
llm = ChatOpenAI(
model="gpt-4o",
max_retries=2, # initial + 2 retries = 3 total billed requests max
timeout=30,
rate_limiter=redis_limiter,
)
Trade resilience off to the fallback layer — with_fallbacks is strictly
cheaper than retry amplification when the primary is genuinely unhealthy.
Instrument retry count via callback and alert if retry rate exceeds ~5%.
See Backoff and Retry for the full math,
Retry-After header handling, and circuit-breaker pattern for sustained
overload.
Different providers expose different limit types. Know which one binds your workload before you size:
| Limit | Meaning | Who enforces | Binds for |
|---|---|---|---|
| RPM | Requests/minute (counts every call) | All three providers | Short chat replies |
| ITPM | Input tokens/minute | Anthropic, OpenAI (as TPM combined) | Long document Q&A |
| OTPM | Output tokens/minute | Anthropic separately; OpenAI as combined TPM | Long completions |
| Concurrent | In-flight request cap | Mainly OpenAI higher tiers | Burst traffic |
| Cached reads | Cache-read input tokens (Anthropic) | Anthropic separate budget line | Cache-heavy workloads (but still counts toward RPM — P31) |
Critical for Anthropic cache workloads (P31): RPM counts uniformly across
cached reads, cache writes, and uncached calls. A workload at 90% cache hit
rate still trips the 50 RPM ceiling at 51 requests/min. Separate monitors for
cache_read_input_tokens vs input_tokens (minus cache read/write) give
early warning.
┌─ Single process (dev, notebooks, sync CLI, --workers 1)?
│ └─ InMemoryRateLimiter
│
├─ Multi-process but single host (same-machine pool, local gunicorn)?
│ └─ Redis-backed limiter (even localhost Redis beats InMemoryRateLimiter —
│ which still has per-process counters)
│
├─ Multi-host cluster (Cloud Run --min-instances>1, K8s, ECS)?
│ └─ Redis-backed limiter (mandatory)
│
├─ Multi-region or cross-cloud?
│ └─ Regional Redis per zone + provider-side account quota
│ (cross-region Redis latency adds 30-200ms per acquire)
│
└─ Any of the above + multi-tenant SaaS?
└─ Two-level Redis limiter: per-tenant + global, acquire both
Always pair with asyncio.Semaphore(N) per-worker for in-flight concurrency.
2026-04-21 snapshot — re-verify against the official console before shipping.
| Provider | Free tier RPM | Tier-1 RPM | High tier RPM | Source |
|---|---|---|---|---|
| Anthropic | 5 | 50 (Build 1) | 4000 (Build 4) | https://docs.anthropic.com/en/api/rate-limits |
| OpenAI | 3 | 500 | 10000 (Tier 5) | https://platform.openai.com/docs/guides/rate-limits |
| Google Gemini | 15 | 2000 (Paid 1) | 30000 (Paid 3) | https://ai.google.dev/gemini-api/docs/rate-limits |
Tiers change quarterly. A limiter sized six months ago on a different tier is a liability. See Provider Tier Matrix for the full matrix including ITPM / OTPM / cached-read separation, binding-limit math, and the pre-ship verification checklist.
DemandLogger callback attached to your chains for 24-48h before sizingInMemoryRateLimiter in dev / notebooks / single-worker onlyRedisRateLimiter (sliding-window Lua or CL.THROTTLE GCRA) for any multi-worker deployment, keyed per-tenant or globalasyncio.Semaphore(N) per-worker in-flight cap paired with the cluster-wide limitermax_retries=2 on every ChatAnthropic / ChatOpenAI / ChatGoogleGenerativeAI.with_fallbacks(exceptions_to_handle=(RateLimitError, APITimeoutError, APIConnectionError, InternalServerError)) — never (Exception,)| Error | Cause | Fix |
|---|---|---|
anthropic.RateLimitError: 429 THROTTLED at cluster RPM = N × InMemoryRateLimiter ceiling | InMemoryRateLimiter is per-process; N workers each send at their limit (P29) | Switch to Redis-backed limiter (Step 3) |
| 429 on cache writes while ITPM dashboard shows headroom | Anthropic RPM counts cache writes uniformly (P31) | Budget at RPM level with limiter; separate cached vs uncached metrics |
One .invoke() bills as 7 requests on flaky networks | Default max_retries=6 (P30) | max_retries=2 + fallback layer for resilience |
Ctrl+C during retry storm silently falls through to backup chain | exceptions_to_handle=(Exception,) catches KeyboardInterrupt on Python <3.12 (P07) | Narrow tuple to (RateLimitError, APITimeoutError, APIConnectionError, InternalServerError) |
| Limiter queue p95 wait > 500ms | Limiter is oversubscribed for real traffic | Re-measure demand (Step 1); upgrade provider tier OR shed load |
redis.exceptions.ConnectionError blocks all LLM calls | Redis unavailable and limiter is fail-closed | Instrument Redis health; decide fail-open (log loudly) vs fail-closed (shed load) — for provider safety, prefer fail-closed |
retry-after header climbing 2→4→8→16 | Pushing past tier; backoff amplifying, not absorbing | Lower limiter target RPS by 20%; upgrade tier if sustained |
google.api_core.exceptions.ResourceExhausted on Gemini | Gemini free tier 15 RPM is brutal | Upgrade to paid Gemini tier 1 (2000 RPM) or use Redis limiter at 10 RPM |
Ten workers, single region, Redis in same VPC. Target: 35 RPM cluster-wide (70% of 50 RPM ceiling), 20 in-flight per worker.
import asyncio, os, redis
from langchain_anthropic import ChatAnthropic
from anthropic import (
RateLimitError, APITimeoutError, APIConnectionError, InternalServerError,
)
from your_app.redis_limiter import RedisRateLimiter # see references
_client = redis.Redis.from_url(os.environ["REDIS_URL"])
anthropic_limiter = RedisRateLimiter(
_client, key="anthropic:prod",
requests_per_second=35 / 60, # 35 RPM cluster-wide
)
llm = ChatAnthropic(
model="claude-sonnet-4-6",
rate_limiter=anthropic_limiter, # cluster gate
max_retries=2, # not 6 (P30)
timeout=30,
)
chain = (prompt | llm | parser).with_fallbacks(
[prompt | gpt4o_backup | parser],
exceptions_to_handle=( # narrow tuple (P07)
RateLimitError, APITimeoutError,
APIConnectionError, InternalServerError,
),
)
worker_sem = asyncio.Semaphore(20) # per-worker in-flight cap
async def invoke_bounded(inp):
async with worker_sem:
return await chain.ainvoke(inp)
Cluster behavior: every worker's limiter call hits the same Redis key. At 35
RPM cluster-wide, individual workers see fair-share throughput. max_retries=2
Two-level Redis limiter. Per-tenant limit prevents noisy neighbors; global limit protects the provider tier.
See Redis Limiter Pattern for the two-level acquire implementation (acquire tenant key first, then global key; release tenant if global fails) and the per-tenant cleanup cron.
InMemoryRateLimiter is fineFor local debugging, notebook work, or a sync CLI tool:
from langchain_core.rate_limiters import InMemoryRateLimiter
limiter = InMemoryRateLimiter(requests_per_second=0.5, max_bucket_size=3)
llm = ChatAnthropic(model="claude-sonnet-4-6", rate_limiter=limiter, max_retries=2)
Do not carry this into production without re-reading Step 2.
InMemoryRateLimiter APICL.THROTTLE (redis-cell module)docs/pain-catalog.md (entries P07, P08, P29, P30, P31)langchain-sdk-patterns (batch concurrency, fallback exception whitelist), langchain-performance-tuning (.batch(max_concurrency=...) tuning for throughput)