From langfuse-pack
Troubleshoots Langfuse outages and production issues via triage scripts, severity levels, and response steps for LLM observability. Use for missing traces, auth errors, or rate limits.
npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin langfuse-packThis skill is limited to using the following tools:
Step-by-step procedures for Langfuse-related incidents, from initial triage (2 min) through resolution and post-incident review. Your application should work without Langfuse -- these procedures focus on restoring observability.
Diagnoses common Langfuse errors like 401 auth failures, missing traces, network timeouts. Provides bash curl tests and TypeScript SDK fixes for flush/shutdown.
Interact with Langfuse via CLI to query/modify traces, prompts, datasets, scores, sessions; access documentation, SDK usage, integrations, and features.
Debugs AI traces, finds exceptions, analyzes sessions, and manages prompts/datasets via Langfuse MCP server. Use for investigating errors, latency, and prompt versions in AI pipelines.
Share bugs, ideas, or general feedback.
Step-by-step procedures for Langfuse-related incidents, from initial triage (2 min) through resolution and post-incident review. Your application should work without Langfuse -- these procedures focus on restoring observability.
| Severity | Description | Response Time | Example |
|---|---|---|---|
| P1 | Application impacted by tracing | 15 min | SDK throwing unhandled errors, blocking requests |
| P2 | Traces not appearing, no app impact | 1 hour | Missing observability data |
| P3 | Degraded performance from tracing | 4 hours | High latency from flush backlog |
| P4 | Minor issues | 24 hours | Occasional missing traces |
set -euo pipefail
echo "=== Langfuse Incident Triage ==="
echo "Time: $(date -u)"
# 1. Check Langfuse cloud status
echo -n "Status page: "
curl -s -o /dev/null -w "%{http_code}" https://status.langfuse.com || echo "UNREACHABLE"
echo ""
# 2. Test API connectivity
HOST="${LANGFUSE_BASE_URL:-${LANGFUSE_HOST:-https://cloud.langfuse.com}}"
echo -n "API health: "
curl -s -o /dev/null -w "%{http_code} (%{time_total}s)" "$HOST/api/public/health" || echo "FAILED"
echo ""
# 3. Test auth
if [ -n "${LANGFUSE_PUBLIC_KEY:-}" ] && [ -n "${LANGFUSE_SECRET_KEY:-}" ]; then
AUTH=$(echo -n "$LANGFUSE_PUBLIC_KEY:$LANGFUSE_SECRET_KEY" | base64)
echo -n "Auth test: "
curl -s -o /dev/null -w "%{http_code}" \
-H "Authorization: Basic $AUTH" "$HOST/api/public/traces?limit=1" || echo "FAILED"
echo ""
fi
# 4. Check app error logs
echo ""
echo "--- Recent errors ---"
grep -i "langfuse\|trace.*error\|flush.*fail" /var/log/app/*.log 2>/dev/null | tail -10 || echo "No log files found"
| Symptom | Likely Cause | Immediate Action |
|---|---|---|
| No traces appearing | SDK not flushing | Check shutdown handlers; set flushAt: 1 temporarily |
401 Unauthorized | Key rotation or mismatch | Verify keys match the correct project |
429 Too Many Requests | Rate limited | Increase batch size, reduce flush frequency |
| SDK throwing errors | Unhandled exception | Wrap in try/catch; check SDK version |
| High request latency | Sync flush in hot path | Switch to async; increase requestTimeout |
| Complete Langfuse outage | Service-side issue | Enable fallback mode |
If Langfuse is causing application issues, disable tracing immediately:
// Emergency disable via environment variable
// Set LANGFUSE_ENABLED=false in your deployment
// In your tracing initialization:
if (process.env.LANGFUSE_ENABLED === "false") {
console.warn("Langfuse tracing DISABLED (emergency fallback)");
// Don't initialize SDK -- all observe/startActiveObservation calls
// will still work but produce no-op spans
}
For v3, use the enabled flag:
const langfuse = new Langfuse({
enabled: process.env.LANGFUSE_ENABLED !== "false",
});
Procedure A: Missing Traces
// 1. Verify SDK is initialized
console.log("Langfuse configured:", !!process.env.LANGFUSE_PUBLIC_KEY);
// 2. Check flush is happening
// v4+: Verify NodeSDK is started and shutdown is registered
// v3: Verify flushAsync() or shutdownAsync() is called
// 3. Temporarily set aggressive flush for debugging
const processor = new LangfuseSpanProcessor({
exportIntervalMillis: 1000,
maxExportBatchSize: 1,
});
Procedure B: Rate Limit (429) Recovery
// Increase batching to reduce API calls
const processor = new LangfuseSpanProcessor({
exportIntervalMillis: 30000, // 30s flush
maxExportBatchSize: 200, // Large batches
});
// Or temporarily enable sampling
const EMERGENCY_SAMPLE_RATE = 0.1; // Only trace 10%
Procedure C: Self-Hosted Instance Down
set -euo pipefail
# Check container status
docker ps -a | grep langfuse
# Check logs
docker logs langfuse-langfuse-1 --tail 50
# Check database
docker exec langfuse-postgres-1 pg_isready -U langfuse
# Restart if needed
docker compose restart langfuse
set -euo pipefail
# Verify traces are flowing again
echo "=== Post-Incident Check ==="
HOST="${LANGFUSE_BASE_URL:-https://cloud.langfuse.com}"
AUTH=$(echo -n "$LANGFUSE_PUBLIC_KEY:$LANGFUSE_SECRET_KEY" | base64)
# Check recent trace count
TRACE_COUNT=$(curl -s \
-H "Authorization: Basic $AUTH" \
"$HOST/api/public/traces?limit=5" | python3 -c "import sys,json; print(len(json.load(sys.stdin).get('data',[])))" 2>/dev/null || echo "ERROR")
echo "Recent traces: $TRACE_COUNT"
if [ "$TRACE_COUNT" = "0" ] || [ "$TRACE_COUNT" = "ERROR" ]; then
echo "WARNING: Traces may not be flowing yet"
else
echo "OK: Traces are appearing"
fi
Document for post-mortem:
| Level | Who | When |
|---|---|---|
| L1 | On-call engineer | All incidents -- run triage |
| L2 | Platform team lead | P1/P2 unresolved after 30 min |
| L3 | Langfuse support | Confirmed service-side issue |
Langfuse support channels:
| Issue | Immediate Fix | Permanent Fix |
|---|---|---|
| SDK crashes app | Set LANGFUSE_ENABLED=false | Wrap all tracing in try/catch |
| Lost traces | Increase batch size | Add shutdown handlers |
| High latency | Disable sync flush | Use async-only patterns |
| Auth failures | Rotate and redeploy keys | Add key validation at startup |