From langchain-py-pack
Triage LangChain 1.0 / LangGraph 1.0 production incidents — LLM-specific SLOs, provider outage runbook, latency spike decision tree, cost-overrun response, agent loop containment. Use during an on-call page, in a post-mortem, or writing the team's first LLM runbook. Trigger with "langchain incident", "llm on-call", "langchain slo", "langchain outage", "langchain cost spike", "langchain agent loop".
npx claudepluginhub flight505/skill-forge --plugin langchain-py-packThis skill is limited to using the following tools:
3:07am. PagerDuty: "LangChain p95 latency > 10s for 5 minutes." You open LangSmith,
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).
Share bugs, ideas, or general feedback.
3:07am. PagerDuty: "LangChain p95 latency > 10s for 5 minutes." You open LangSmith,
filter by service="triage-agent" over the last 15 minutes, and the first trace
is 43 seconds long — an agent is on step 24 of 25 iterations, bouncing between
the same two tools on a vague user prompt ("help me with my account"). The cost
dashboard shows $400 spent in the last 10 minutes, up from a $6/hour baseline.
This is P10: create_react_agent defaults to recursion_limit=25 with no cost
cap; vague prompts never converge; the spend hits before GraphRecursionError
surfaces. First move is not to push a code fix — it is to flip
recursion_limit=5 via config reload and add a middleware token-budget cap per
session, then deal with the stuck sessions.
Or: same alert, different signature. p95 is healthy at 1.8s, but p99 is 12s and
spiky. The spikes correlate with instance starts in Cloud Run. P36: Python +
LangChain + embedding preloads = 5–15s cold start; Cloud Run scales to zero by
default, so first-request p99 is 10x p95. First move is --min-instances=1
(or a keepalive pinger), not more CPU.
The shape of the page decides the first move. This runbook gives you:
.with_fallbacks(backup) so failover is a
config flip, not a code change.recursion_limit tuning and middleware
token-budget caps so runaway agents stop burning cost before the
GraphRecursionError.langchain-debug-bundle) and write-up
template.Pinned: langchain-core 1.0.x, langgraph 1.0.x, langsmith 0.3+. Primary pain
anchors: P10 (agent runaway), P36 (cold start). Adjacent: P29 (per-process
rate limiter), P30 (max_retries=6 means 7 attempts), P31 (Anthropic cache RPM).
langchain-observability skill applied — metrics are grounded in what LangSmith callbacks emitbackup model factory from langchain-rate-limits — the failover playbook assumes .with_fallbacks() is already wiredHTTP-style SLOs miss what users actually feel. Define four, publish them, wire burn-rate alerts to the symptom.
| SLO | Threshold | Alert condition | First-response action |
|---|---|---|---|
| p95 TTFT (time to first streamed token) | <1s | burn-rate > 2% over 5min | Check streaming is enabled; check provider status; check cold start (P36) |
| p99 total latency | <10s | burn-rate > 5% over 5min | Check agent loop depth (P10); cold start (P36); provider latency |
| Error rate (5xx + uncaught exceptions) | <0.5% | burn-rate > 1% over 5min | Check provider 429/500; auth token; schema drift on structured output |
| Cost per request | <$0.05 (tier-dependent) | p95 spend/req > $0.20 over 15min | Check agent recursion (P10); retry rate (P30); token-use per req |
Prometheus recording rule pattern for p99 latency burn-rate (replicate for TTFT, error-rate, and cost):
groups:
- name: langchain_slo
interval: 30s
rules:
- record: langchain:p99_latency_5m
expr: histogram_quantile(0.99, sum(rate(langchain_request_duration_seconds_bucket[5m])) by (le, service))
- alert: LangChainP99LatencyBurn
expr: langchain:p99_latency_5m > 10
for: 5m
labels: { severity: page, team: llm }
annotations:
summary: "LangChain p99 > 10s for {{ $labels.service }}"
runbook: "https://runbooks/langchain-incident-runbook#latency"
See LLM SLOs for the canonical set (free / paid / enterprise tiers), burn-rate recipes (fast + slow), and a TTFT-specific rule that requires streaming to be instrumented.
The alert name tells you the root path. Do not mix diagnostics across paths — the first-response action differs.
Alert fired
├── Latency (p95/p99 breach, TTFT breach)
│ ├── 1. Provider status page (Anthropic, OpenAI) green? → if red, Step 3
│ ├── 2. Cold start pattern? (p99 >> p95, correlates with instance starts) → P36
│ └── 3. Streaming configured? (TTFT only makes sense with .stream/.astream)
│
├── Cost (spend/req or absolute spend/hour breach)
│ ├── 1. Agent recursion depth? (LangSmith: max steps per trace) → P10
│ ├── 2. Retry rate elevated? (callback log: attempt count / logical call) → P30
│ └── 3. Token-use per req regression? (input + output tokens from callbacks)
│
└── Error rate (5xx + uncaught exceptions)
├── 1. Provider 429/500 spike? (distinguish client 4xx from provider 5xx)
├── 2. Auth? (API key rotation, expired token, org quota exhausted)
└── 3. Schema drift on structured output? (Pydantic ValidationError in traces)
For each leaf, Latency Triage and Cost Overrun Response give the LangSmith filter query, the exact metric to inspect, and the remediation.
Detection precedes failover. Do not flip fallbacks on an application bug.
status.anthropic.com, status.openai.com) —
poll every 30s, surface into SlackCircuitBreaker middleware (see
langchain-middleware-patterns if available, or a simple
aiobreaker-backed runnable) opens after N consecutive APIError /
APITimeoutError within a window. Once open, calls skip the primary and go
straight to the backup. This bounds the latency cost of a down provider..with_fallbacks(backup) — the fallback chain is already
wired (see langchain-rate-limits). During an outage, either flip a feature
flag that swaps the default factory, or temporarily set the primary's
max_retries=0 so the chain reaches the fallback immediately.P10 is the most common cost-spike cause. create_react_agent defaults to
recursion_limit=25, meaning 25 model calls per user turn — with Claude Sonnet
at ~$3/MTok input, a 10k-token tool-call loop burns real money per minute.
Three containment layers, applied in order:
Set recursion_limit per agent depth — interactive chat agents rarely
need more than 5–8 steps; background research agents can justify 15; never
leave the default 25 in production.
from langgraph.prebuilt import create_react_agent
agent = create_react_agent(
llm, tools,
recursion_limit=8, # P10 — was default 25
)
Middleware token-budget cap per session — a callback that tracks
cumulative input + output tokens for a session id and raises a custom
BudgetExceeded exception once the cap is hit. The agent terminates
cleanly; the user sees a polite "I could not finish this task in budget,
try rephrasing" instead of a spinning UI until GraphRecursionError.
Circuit on repeated tool calls — a LangGraph edge that routes to END
when the same tool has been called with the same args twice in a row. This
is a cheap heuristic for "the agent is stuck in a loop."
Cost Overrun Response has the middleware implementation, the repeat-tool edge pattern, and a per-tenant budget enforcement example.
Within 30 minutes of all-clear:
langchain-debug-bundle if present..with_fallbacks(backup)
wired to a feature flag for one-flip failoverrecursion_limit set per agent depth (never default 25 in prod); middleware
token-budget cap per session| Symptom | Likely cause | First-response action |
|---|---|---|
| p95 latency breach, TTFT degraded | Streaming disabled, or provider-side latency | Verify .stream()/.astream() used; check provider status page |
| p99 >> p95, correlates with instance starts | Cloud Run cold start (P36) | --min-instances=1, CPU-always-allocated billing, preload imports |
| Cost-per-req spike, agent traces show 20+ steps | recursion_limit=25 default + vague prompt (P10) | recursion_limit=5–8, add middleware token-budget cap |
| Cost spike, callback log shows 7 attempts per logical call | max_retries=6 inflates cost 7x (P30) | max_retries=2 + circuit breaker; log retries via callbacks |
429 storm despite requests_per_second=10 on each of N workers | InMemoryRateLimiter is per-process (P29) | Switch to RedisRateLimiter or provider-side quota |
| Anthropic 429 while token budget has headroom | Cache RPM throttled separately (P31) | Client-side semaphore on RPM, not token count; monitor cached-read vs uncached separately |
| Error-rate spike, all on primary provider | Provider outage | Canary probe confirms; flip failover to .with_fallbacks(backup) via flag |
ValidationError surge on structured output | Schema drift — model added fields | ConfigDict(extra="ignore") on the Pydantic schema (see langchain-sdk-patterns) |
Agent never terminates, no GraphRecursionError yet | Stuck in tool-call loop | Add "repeated tool call" edge routing to END; raise BudgetExceeded from middleware |
PagerDuty: "cost-per-req > $0.20 for 15 minutes." LangSmith filtered to the
last 15 minutes shows average trace depth = 22 steps (baseline 4). Single
tenant, single conversation pattern — a user who asked an open-ended question
the agent cannot resolve. First-response action: flip recursion_limit=5 via
config reload (no deploy), add session to the blocklist in middleware, post
internal Slack with the trace URL.
See Cost Overrun Response for the middleware token-budget implementation and the per-tenant budget pattern.
p95 healthy at 1.8s, p99 at 12s, spikes correlate with Cloud Run instance
starts — classic P36. First-response action: gcloud run services update <svc> --min-instances=1, verify heavy imports are at module top level,
schedule follow-up ticket to move embedding preload to a warm-up hook.
See Latency Triage for the cold-start detection recipe and the p95-vs-p99 attribution decision tree.
Anthropic status page goes red. Canary probe error-rate jumps from 0% to 100%
on Anthropic, stays at 0% on OpenAI. Flip the failover flag — the
.with_fallbacks(backup=ChatOpenAI(...)) chain (already wired via
langchain-rate-limits) takes over. Post user-facing status entry, monitor
cost (OpenAI pricing differs — watch cost-per-req SLO), revert when upstream
recovers.
See Provider Outage Playbook for the circuit-breaker middleware, the canary probe snippet, and the user-comms template.
create_react_agent and recursion_limitdocs/pain-catalog.md (primary: P10, P36; adjacent: P29, P30, P31)langchain-debug-bundle (post-incident capture), langchain-observability (SLO metrics source), langchain-rate-limits (.with_fallbacks chain), langchain-cost-tuning (token budget caps), langchain-deploy-integration (cold-start fix)