Skill

langchain-incident-runbook

Provides runbooks for LangChain production incidents: detects LLM provider outages and error spikes via bash/TypeScript, mitigates with fallbacks, guides recovery.

Langchain

OpenAI

npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin langchain-pack

Tool Access

This skill is limited to using the following tools:

ReadWriteEditBash(curl:*)Grep

Preview

Standard operating procedures for LangChain production incidents: provider outages, error rate spikes, latency degradation, memory issues, and cost overruns.

SKILL.md

Similar Skills

langchain-observability

1.9k

Sets up LangSmith tracing, Prometheus metrics callbacks, OpenTelemetry, structured logging, and Grafana dashboards for LangChain apps.

1 file3 tools

langchain-pack

anth-incident-runbook

1.9k

Executes incident response for Anthropic Claude API outages, errors, rate limits, and latency with triage scripts, status checks, and mitigation steps.

3 tools

anthropic-pack

langfuse-incident-runbook

1.9k

Troubleshoots Langfuse outages and production issues via triage scripts, severity levels, and response steps for LLM observability. Use for missing traces, auth errors, or rate limits.

1 file4 tools

langfuse-pack

Stats

Parent Repo Stars1854

Parent Repo Forks248

Last CommitApr 3, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

LangChain Incident Runbook

Overview

Standard operating procedures for LangChain production incidents: provider outages, error rate spikes, latency degradation, memory issues, and cost overruns.

Severity Classification

Level	Description	Response Time	Example
SEV1	Complete outage	15 min	All LLM calls failing
SEV2	Major degradation	30 min	>50% error rate, >10s latency
SEV3	Minor degradation	2 hours	<10% errors, slow responses
SEV4	Low impact	24 hours	Intermittent issues, warnings

Runbook 1: LLM Provider Outage

Detect

# Check provider status pages
curl -s https://status.openai.com/api/v2/status.json | jq '.status'
curl -s https://status.anthropic.com/api/v2/status.json | jq '.status'

Diagnose

async function diagnoseProviders() {
  const results: Record<string, string> = {};

  try {
    const openai = new ChatOpenAI({ model: "gpt-4o-mini", timeout: 10000 });
    await openai.invoke("ping");
    results.openai = "OK";
  } catch (e: any) {
    results.openai = `FAIL: ${e.message.slice(0, 100)}`;
  }

  try {
    const anthropic = new ChatAnthropic({ model: "claude-sonnet-4-20250514" });
    await anthropic.invoke("ping");
    results.anthropic = "OK";
  } catch (e: any) {
    results.anthropic = `FAIL: ${e.message.slice(0, 100)}`;
  }

  console.table(results);
  return results;
}

Mitigate

// Enable fallback — switch to healthy provider
const primary = new ChatOpenAI({
  model: "gpt-4o-mini",
  maxRetries: 1,
  timeout: 5000,
});

const fallback = new ChatAnthropic({
  model: "claude-sonnet-4-20250514",
  maxRetries: 1,
});

const resilientModel = primary.withFallbacks({
  fallbacks: [fallback],
});

// All chains using resilientModel auto-failover

Recover

Monitor provider status page for resolution
Verify primary provider works: await diagnoseProviders()
Remove fallback config (or keep it for resilience)
Document incident timeline for post-mortem

Runbook 2: High Error Rate

Detect

# Check LangSmith for error spike
# https://smith.langchain.com/o/YOUR_ORG/projects/YOUR_PROJECT/runs?filter=error:true

# Check application logs
grep -c "Error\|error\|ERROR" /var/log/app/langchain.log | tail -5

Diagnose

// Common error patterns
const ERROR_CAUSES: Record<string, string> = {
  "RateLimitError":     "API quota exceeded -> reduce concurrency",
  "AuthenticationError": "API key invalid -> check secrets",
  "Timeout":            "Provider slow -> increase timeout",
  "OutputParserException": "LLM output format changed -> check prompts",
  "ValidationError":    "Schema mismatch -> update Zod schemas",
  "ContextLengthExceeded": "Input too long -> truncate or chunk",
};

Mitigate

// 1. Reduce load
// Lower maxConcurrency on batch operations

// 2. Enable caching for repeated queries
const cache = new Map();
async function withCache(chain: any, input: any) {
  const key = JSON.stringify(input);
  if (cache.has(key)) return cache.get(key);
  const result = await chain.invoke(input);
  cache.set(key, result);
  return result;
}

// 3. Enable fallback model
const model = primary.withFallbacks({ fallbacks: [fallback] });

Runbook 3: Latency Spike

Detect

# Prometheus query
histogram_quantile(0.95, rate(langchain_llm_latency_seconds_bucket[5m])) > 5

Diagnose

// Measure per-component latency
const tracer = new MetricsCallback();
await chain.invoke({ input: "test" }, { callbacks: [tracer] });
console.table(tracer.getReport());
// Check: is it the LLM, retriever, or tool that's slow?

Mitigate

Switch to faster model: gpt-4o-mini (200ms TTFT) vs gpt-4o (400ms)
Enable streaming to reduce perceived latency
Enable caching for repeated queries
Reduce context length (shorter prompts)

Runbook 4: Cost Overrun

Detect

# Check OpenAI usage dashboard
# https://platform.openai.com/usage

Mitigate

// 1. Emergency model downgrade
// gpt-4o ($2.50/1M) -> gpt-4o-mini ($0.15/1M) = 17x cheaper

// 2. Enable budget enforcement
const budget = new BudgetEnforcer(50.0); // $50 daily limit
const model = new ChatOpenAI({
  model: "gpt-4o-mini",
  callbacks: [budget],
});

// 3. Enable aggressive caching
// (see langchain-cost-tuning skill)

Runbook 5: Memory/OOM Issues

Detect

# Check process memory
ps aux --sort=-%mem | head -5

# Node.js heap stats
node -e "console.log(process.memoryUsage())"

Mitigate

Clear caches: reset in-memory caches
Reduce batch sizes: lower maxConcurrency
Use streaming instead of accumulating full responses
Restart pods: kubectl rollout restart deployment/langchain-api

Incident Response Checklist

During Incident

Acknowledge in incident channel
Classify severity (SEV1-4)
Check provider status pages
Run diagnostic script
Apply mitigation (fallback/cache/throttle)
Communicate status to stakeholders
Document timeline

Post-Incident

Resources

Next Steps

Use langchain-debug-bundle for detailed evidence collection during incidents.