From lindy-pack
Provides incident response runbook for Lindy AI agent outages, with severity levels, bash diagnostics, credit checks, and playbooks including fallbacks.
npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin lindy-packThis skill is limited to using the following tools:
Incident response procedures for Lindy AI agent failures. Covers platform outages,
Troubleshoots Lindy AI agent errors in triggers (webhook, email, schedule) and actions (Slack, Gmail sends). Provides diagnostics, causes, and fixes with curl examples.
Mandates invoking relevant skills via tools before any response in coding sessions. Covers access, priorities, and adaptations for Claude Code, Copilot CLI, Gemini CLI.
Share bugs, ideas, or general feedback.
Incident response procedures for Lindy AI agent failures. Covers platform outages, individual agent failures, integration breakdowns, credit exhaustion, and webhook endpoint failures.
| Severity | Description | Response Time | Examples |
|---|---|---|---|
| SEV1 | All agents failing, customer impact | 15 minutes | Lindy platform outage, all webhooks failing |
| SEV2 | Critical agent down | 30 minutes | Support bot offline, phone agent unreachable |
| SEV3 | Degraded performance | 2 hours | High latency, intermittent failures |
| SEV4 | Minor issue | 24 hours | Non-critical agent misconfigured |
# Is Lindy up?
curl -s -o /dev/null -w "Lindy API: HTTP %{http_code}\n" \
"https://public.lindy.ai" --max-time 5
# Check status page
echo "Status page: https://status.lindy.ai"
# Is your webhook receiver up?
curl -s -o /dev/null -w "Our endpoint: HTTP %{http_code}\n" \
"https://api.yourapp.com/health" --max-time 5
# Is the webhook auth working?
curl -s -o /dev/null -w "Webhook auth: HTTP %{http_code}\n" \
-X POST "https://api.yourapp.com/lindy/callback" \
-H "Authorization: Bearer $LINDY_WEBHOOK_SECRET" \
-H "Content-Type: application/json" \
-d '{"test": true}' --max-time 5
Log in at https://app.lindy.ai > Settings > Billing
Symptoms: All agents failing, status.lindy.ai shows incident Impact: All Lindy-dependent workflows halted
Runbook:
Fallback code:
async function triggerLindyWithFallback(payload: any) {
try {
const response = await fetch(WEBHOOK_URL, {
method: 'POST',
headers: {
'Authorization': `Bearer ${SECRET}`,
'Content-Type': 'application/json',
},
body: JSON.stringify(payload),
signal: AbortSignal.timeout(10000), // 10s timeout
});
if (!response.ok) throw new Error(`HTTP ${response.status}`);
return { routed: 'lindy' };
} catch (error) {
console.error('Lindy unreachable, activating fallback:', error);
await queueForReplay(payload); // Store for later
await notifyTeam(`Lindy trigger failed: ${error}`);
return { routed: 'fallback' };
}
}
Symptoms: Specific agent tasks showing "Failed" status Impact: One workflow affected, others may be fine
Runbook:
Symptoms: Actions failing with "Not authorized" or "Token expired" Impact: All tasks using that integration fail
Runbook:
Symptoms: Agents stop running, no new tasks created Impact: All agents paused until credits refill
Runbook:
Symptoms: Lindy agent runs but your callback never receives data Impact: Agent completes but results are lost
Runbook:
curl -s https://api.yourapp.com/health| Level | Contact | When |
|---|---|---|
| L1 | On-call engineer | Initial response, diagnostics |
| L2 | Engineering lead | After 30 min SEV1, 1 hour SEV2 |
| L3 | VP Engineering | After 1 hour SEV1 |
| Lindy Support | support@lindy.ai | Confirmed Lindy platform issue |
## Incident Report
**Date**: YYYY-MM-DD
**Severity**: SEV[1-4]
**Duration**: [start time] to [end time] ([total minutes])
**Impact**: [what was affected, customer impact]
### Timeline
- HH:MM — Issue detected via [monitoring/user report]
- HH:MM — On-call paged, diagnostics started
- HH:MM — Root cause identified: [cause]
- HH:MM — Fix applied: [what was done]
- HH:MM — Service restored, monitoring confirmed
### Root Cause
[Technical description of what failed and why]
### Resolution
[What was done to fix it]
### Prevention
- [ ] [Action item 1]
- [ ] [Action item 2]
- [ ] [Action item 3]
| Incident Type | Detection | Automated Response |
|---|---|---|
| Platform outage | Health check fails | Queue events, notify team |
| Agent failure | Task Completed trigger | Slack alert to #ops |
| Auth expiry | Action step fails | Alert + re-auth link |
| Credit exhaustion | Billing check | Pause non-critical agents |
| Endpoint down | Health check | Redirect to fallback |
Proceed to lindy-data-handling for data security and compliance.