From snowflake-pack
Runs Snowflake incident triage with status checks, SQL diagnostics for query/task failures, and decision tree. For outages, pipeline issues, and postmortems.
How this skill is triggered — by the user, by Claude, or both
Slash command
/snowflake-pack:snowflake-incident-runbookThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Rapid incident response procedures for Snowflake infrastructure, pipeline failures, and query issues.
Rapid incident response procedures for Snowflake infrastructure, pipeline failures, and query issues.
| Level | Definition | Response Time | Examples |
|---|---|---|---|
| P1 | Complete outage | < 15 min | All queries failing, auth broken |
| P2 | Degraded service | < 1 hour | High latency, task failures |
| P3 | Minor impact | < 4 hours | Snowpipe delays, non-critical errors |
| P4 | No user impact | Next business day | Monitoring gaps, cost anomalies |
# Check Snowflake status page
curl -s https://status.snowflake.com/api/v2/summary.json | python3 -c "
import sys, json
data = json.load(sys.stdin)
print(f\"Status: {data['status']['description']}\")
for c in data['components']:
if c['status'] != 'operational':
print(f\" DEGRADED: {c['name']} - {c['status']}\")
"
-- Quick connectivity test
SELECT CURRENT_TIMESTAMP(), CURRENT_ACCOUNT(), CURRENT_REGION();
-- If this fails, the issue is connectivity/auth, not query logic
-- Recent failures (last 30 minutes)
SELECT error_code, error_message, COUNT(*) AS occurrences,
MIN(start_time) AS first_seen, MAX(start_time) AS last_seen
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE execution_status = 'FAIL'
AND start_time >= DATEADD(minutes, -30, CURRENT_TIMESTAMP())
GROUP BY error_code, error_message
ORDER BY occurrences DESC;
-- Failed tasks
SELECT name, state, error_message, scheduled_time, completed_time
FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY(
SCHEDULED_TIME_RANGE_START => DATEADD(hours, -1, CURRENT_TIMESTAMP())
))
WHERE state = 'FAILED'
ORDER BY scheduled_time DESC;
-- Stale streams (data loss risk)
SHOW STREAMS;
-- Check STALE column — if TRUE, stream offset is beyond retention
Query failures?
├─ Auth errors (390100, 390144)
│ → Check credentials, key pair, network policy
├─ Object not found (002003)
│ → Wrong context? Permissions? Object dropped?
├─ Warehouse issues (000606)
│ → Warehouse suspended? Resource monitor hit?
├─ Timeout (100038)
│ → Query too slow? Warehouse too small?
└─ Snowflake platform issue (5xx, connectivity)
→ Check status.snowflake.com → enable fallback
Pipeline failures?
├─ Task failed
│ → Check TASK_HISTORY error_message
│ → Is source stream stale?
├─ Snowpipe not loading
│ → SYSTEM$PIPE_STATUS('pipe_name')
│ → Check S3 event notifications
└─ Dynamic table not refreshing
→ Check DYNAMIC_TABLE_REFRESH_HISTORY
-- Check login failures
SELECT user_name, client_ip, error_code, error_message, event_timestamp
FROM SNOWFLAKE.ACCOUNT_USAGE.LOGIN_HISTORY
WHERE is_success = 'NO'
AND event_timestamp >= DATEADD(minutes, -30, CURRENT_TIMESTAMP())
ORDER BY event_timestamp DESC;
-- If key pair issue — verify public key assignment
DESC USER svc_etl;
-- Check RSA_PUBLIC_KEY and RSA_PUBLIC_KEY_2
-- Check resource monitor status
SHOW RESOURCE MONITORS;
-- Temporarily increase quota
ALTER RESOURCE MONITOR prod_monitor SET CREDIT_QUOTA = 3000;
-- Or switch to a different warehouse
ALTER SESSION SET WAREHOUSE = BACKUP_WH;
-- If stream is stale, data between old and new offset is lost
-- You must recreate the stream and backfill
-- Check stream status
SELECT * FROM TABLE(INFORMATION_SCHEMA.STREAMS())
WHERE stale = TRUE;
-- Recreate stream
DROP STREAM IF EXISTS orders_stream;
CREATE STREAM orders_stream ON TABLE raw_orders;
-- Backfill from Time Travel
INSERT INTO dim_orders
SELECT * FROM raw_orders
AT (TIMESTAMP => '<last_known_good_timestamp>'::TIMESTAMP_NTZ)
WHERE order_id NOT IN (SELECT order_id FROM dim_orders);
-- Use Time Travel to restore table to pre-deployment state
CREATE OR REPLACE TABLE prod_dw.silver.users
CLONE prod_dw.silver.users
AT (TIMESTAMP => '2026-03-22 08:00:00'::TIMESTAMP_NTZ);
-- Or use UNDROP for accidentally dropped objects
UNDROP TABLE prod_dw.silver.users;
UNDROP SCHEMA prod_dw.silver;
UNDROP DATABASE prod_dw;
-- Suspend problematic tasks
ALTER TASK transform_orders SUSPEND;
Internal (Slack):
P1 INCIDENT: Snowflake [Category]
Status: INVESTIGATING
Impact: [Describe user/pipeline impact]
Current action: [What you're doing now]
Next update: [Time]
Incident commander: @[name]
Postmortem Template:
## Incident: [Title]
**Date:** YYYY-MM-DD | **Duration:** X hours | **Severity:** P[1-4]
### Summary
[1-2 sentences]
### Timeline (UTC)
- HH:MM — [Event/detection]
- HH:MM — [Response action]
- HH:MM — [Resolution]
### Root Cause
[Technical explanation referencing specific error codes and query IDs]
### Impact
- Pipelines affected: N
- Data freshness delay: X hours
- Credit overage: Y credits
### Action Items
- [ ] [Preventive measure] — Owner — Due date
| Issue | Cause | Solution |
|---|---|---|
| Can't query ACCOUNT_USAGE | Missing privileges | Use ACCOUNTADMIN or grant IMPORTED PRIVILEGES |
| Time Travel expired | Past retention period | Cannot recover; increase retention proactively |
| Task won't resume | Dependency chain issue | Resume children first, then parent |
| Snowpipe backlog | S3 notification gap | Check SQS queue, run ALTER PIPE x REFRESH |
For data governance, see snowflake-data-handling.
npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin snowflake-packCollects diagnostic data from Snowflake QUERY_HISTORY, ACCOUNT_USAGE, sessions, and warehouse metrics for troubleshooting and support tickets.
Assists with Snowflake SQL best practices, data pipelines (Dynamic Tables, Streams, Tasks, Snowpipe), Cortex AI, Snowpark Python, dbt, performance tuning, and security hardening.
Runs Databricks incident triage script via CLI (status, workspace, failed runs, error clusters), decision tree for pipeline failures, mitigation actions, and postmortem templates. For outages and job issues.