LangChain Incident Runbook

Overview

Standard operating procedures for responding to LangChain production incidents with diagnosis, mitigation, and recovery steps.

Prerequisites

Access to production infrastructure
Monitoring dashboards configured
LangSmith or equivalent tracing
On-call rotation established

Incident Classification

Severity Levels

Level	Description	Response Time	Examples
SEV1	Complete outage	15 min	All LLM calls failing
SEV2	Major degradation	30 min	50%+ error rate, >10s latency
SEV3	Minor degradation	2 hours	<10% errors, slow responses
SEV4	Low impact	24 hours	Intermittent issues

Runbook: LLM Provider Outage

Detection

# Check if LLM provider is responding
curl -s https://status.openai.com/api/v2/status.json | jq '.status.indicator'
curl -s https://status.anthropic.com/api/v2/status.json | jq '.status.indicator'

# Check your error rate
# Prometheus query:
# sum(rate(langchain_llm_requests_total{status="error"}[5m])) / sum(rate(langchain_llm_requests_total[5m]))

Diagnosis

# Quick diagnostic script
import asyncio
from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic

async def diagnose_providers():
    """Check all configured providers."""
    results = {}

    # Test OpenAI
    try:
        llm = ChatOpenAI(model="gpt-4o-mini", request_timeout=10)
        await llm.ainvoke("test")
        results["openai"] = "OK"
    except Exception as e:
        results["openai"] = f"FAIL: {e}"

    # Test Anthropic
    try:
        llm = ChatAnthropic(model="claude-3-5-sonnet-20241022", timeout=10)
        await llm.ainvoke("test")
        results["anthropic"] = "OK"
    except Exception as e:
        results["anthropic"] = f"FAIL: {e}"

    return results

# Run
print(asyncio.run(diagnose_providers()))

Mitigation: Enable Fallback

# Emergency fallback configuration
from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic

# Original
llm = ChatOpenAI(model="gpt-4o-mini")

# With fallback
primary = ChatOpenAI(model="gpt-4o-mini", max_retries=1, request_timeout=5)
fallback = ChatAnthropic(model="claude-3-haiku-20240307")

llm = primary.with_fallbacks([fallback])

Recovery

Monitor provider status page
Gradually remove fallback when primary recovers
Document incident in post-mortem

Runbook: High Error Rate

Detection

# Check recent errors in logs
grep -i "error" /var/log/langchain/app.log | tail -50

# Check LangSmith for failed traces
# Navigate to: https://smith.langchain.com/o/YOUR_ORG/projects/YOUR_PROJECT/runs?filter=error%3Atrue

Diagnosis

# Analyze error distribution
from collections import Counter
import json

def analyze_errors(log_file: str) -> dict:
    """Analyze error patterns from logs."""
    errors = []

    with open(log_file) as f:
        for line in f:
            if "error" in line.lower():
                try:
                    log = json.loads(line)
                    errors.append(log.get("error_type", "unknown"))
                except:
                    pass

    return dict(Counter(errors).most_common(10))

# Common error types and causes
ERROR_CAUSES = {
    "RateLimitError": "Exceeded API quota - reduce load or increase limits",
    "AuthenticationError": "Invalid API key - check secrets",
    "Timeout": "Network issues or overloaded provider",
    "OutputParserException": "LLM output format changed - check prompts",
    "ValidationError": "Schema mismatch - update Pydantic models",
}

Mitigation

# 1. Reduce load
# Scale down instances or enable circuit breaker

# 2. Emergency rate limiting
from functools import wraps
import time

def emergency_rate_limit(calls_per_minute: int = 10):
    """Emergency rate limiter decorator."""
    interval = 60.0 / calls_per_minute
    last_call = [0]

    def decorator(func):
        @wraps(func)
        async def wrapper(*args, **kwargs):
            elapsed = time.time() - last_call[0]
            if elapsed < interval:
                await asyncio.sleep(interval - elapsed)
            last_call[0] = time.time()
            return await func(*args, **kwargs)
        return wrapper
    return decorator

# 3. Enable caching for repeated queries
from langchain_core.globals import set_llm_cache
from langchain_community.cache import InMemoryCache
set_llm_cache(InMemoryCache())

Runbook: Memory/Performance Issues

Detection

# Check memory usage
ps aux | grep python | head -5

# Check for memory leaks
# Prometheus: process_resident_memory_bytes

Diagnosis

# Memory profiling
import tracemalloc

tracemalloc.start()

# Run your chain
chain.invoke({"input": "test"})

snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')

print("Top 10 memory allocations:")
for stat in top_stats[:10]:
    print(stat)

Mitigation

# 1. Clear caches
from langchain_core.globals import set_llm_cache
set_llm_cache(None)

# 2. Reduce batch sizes
# Change from: chain.batch(inputs, config={"max_concurrency": 50})
# To: chain.batch(inputs, config={"max_concurrency": 10})

# 3. Restart pods gracefully
# kubectl rollout restart deployment/langchain-api

Runbook: Cost Spike

Detection

# Check token usage
# Prometheus: sum(increase(langchain_llm_tokens_total[1h]))

# OpenAI usage dashboard
# https://platform.openai.com/usage

Diagnosis

# Identify high-cost operations
def analyze_costs(traces: list) -> dict:
    """Analyze cost from trace data."""
    by_chain = {}

    for trace in traces:
        chain_name = trace.get("name", "unknown")
        tokens = trace.get("total_tokens", 0)

        if chain_name not in by_chain:
            by_chain[chain_name] = {"count": 0, "tokens": 0}

        by_chain[chain_name]["count"] += 1
        by_chain[chain_name]["tokens"] += tokens

    return sorted(by_chain.items(), key=lambda x: x[1]["tokens"], reverse=True)

Mitigation

# 1. Emergency budget limit
class BudgetExceeded(Exception):
    pass

daily_spend = 0
DAILY_LIMIT = 100.0  # $100

def check_budget(cost: float):
    global daily_spend
    daily_spend += cost
    if daily_spend > DAILY_LIMIT:
        raise BudgetExceeded(f"Daily limit ${DAILY_LIMIT} exceeded")

# 2. Switch to cheaper model
# gpt-4o -> gpt-4o-mini (30x cheaper)
# claude-3-5-sonnet -> claude-3-haiku (12x cheaper)

# 3. Enable aggressive caching

Incident Response Checklist

During Incident

Post-Incident

Resources

Next Steps

Use langchain-debug-bundle for detailed evidence collection.

langchain-incident-runbook

LangChain Incident Runbook

Overview

Prerequisites

Incident Classification

Severity Levels

Runbook: LLM Provider Outage

Detection

Diagnosis

Mitigation: Enable Fallback

Recovery

Runbook: High Error Rate

Detection

Diagnosis

Mitigation

Runbook: Memory/Performance Issues

Detection

Diagnosis

Mitigation

Runbook: Cost Spike

Detection

Diagnosis

Mitigation

Incident Response Checklist

During Incident

Post-Incident

Resources

Next Steps

Similar Skills