Execute Databricks incident response procedures with triage, mitigation, and postmortem. Use when responding to Databricks-related outages, investigating job failures, or running post-incident reviews for pipeline failures. Trigger with phrases like "databricks incident", "databricks outage", "databricks down", "databricks on-call", "databricks emergency", "job failed".
From databricks-packnpx claudepluginhub nickloveinvesting/nick-love-plugins --plugin databricks-packThis skill is limited to using the following tools:
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Designs, audits, and improves analytics tracking systems using Signal Quality Index for reliable, decision-ready data in marketing, product, and growth.
Enforces A/B test setup with gates for hypothesis locking, metrics definition, sample size calculation, assumptions checks, and execution readiness before implementation.
Rapid incident response procedures for Databricks-related outages.
| Level | Definition | Response Time | Examples |
|---|---|---|---|
| P1 | Production data pipeline down | < 15 min | Critical ETL failed, data not updating |
| P2 | Degraded performance | < 1 hour | Slow queries, partial failures |
| P3 | Non-critical issues | < 4 hours | Dev cluster issues, delayed non-critical jobs |
| P4 | No user impact | Next business day | Monitoring gaps, documentation |
#!/bin/bash
set -euo pipefail
# quick-triage.sh - Run this first during any incident
echo "=== Databricks Quick Triage ==="
echo "Time: $(date)"
echo ""
# 1. Check Databricks status
echo "--- Databricks Status ---"
curl -s https://status.databricks.com/api/v2/status.json | jq '.status.description'
echo ""
# 2. Check workspace connectivity
echo "--- Workspace Connectivity ---"
databricks workspace list / --output json | jq -r '.[] | .path' | head -5
if [ $? -eq 0 ]; then
echo "Workspace: CONNECTED"
else
echo "Workspace: CONNECTION FAILED"
fi
echo ""
# 3. Check recent job failures
echo "--- Recent Job Failures (last 1 hour) ---"
databricks runs list --limit 20 --output json | \
jq -r '.runs[] | select(.state.result_state == "FAILED") | "\(.run_id): \(.run_name) - \(.state.state_message)"'
echo ""
# 4. Check cluster status
echo "--- Running Clusters ---"
databricks clusters list --output json | \
jq -r '.clusters[] | select(.state == "RUNNING" or .state == "ERROR") | "\(.cluster_id): \(.cluster_name) [\(.state)]"'
echo ""
# 5. Check for errors in last hour
echo "--- Recent Errors ---"
# Query system tables via SQL warehouse or notebook
Job/Pipeline failing?
├─ YES: Is it a single job or multiple?
│ ├─ SINGLE JOB → Check job-specific issues
│ │ ├─ Cluster failed to start → Check cluster events
│ │ ├─ Code error → Check task output/logs
│ │ ├─ Data issue → Check source data
│ │ └─ Permission error → Check grants
│ │
│ └─ MULTIPLE JOBS → Likely infrastructure issue
│ ├─ Check Databricks status page
│ ├─ Check workspace quotas
│ └─ Check network connectivity
│
└─ NO: Is it a performance issue?
├─ Slow queries → Check query plan, cluster sizing
├─ Slow cluster startup → Check instance availability
└─ Data freshness → Check upstream pipelines
# 1. Get run details
RUN_ID="your-run-id"
databricks runs get --run-id $RUN_ID
# 2. Get detailed error output
databricks runs get-output --run-id $RUN_ID | jq '.error'
# 3. Check task-level errors
databricks runs get --run-id $RUN_ID | jq '.tasks[] | select(.state.result_state == "FAILED") | {task: .task_key, error: .state.state_message}'
# 4. If notebook task, get notebook output
# (View in UI or use jobs API to get cell outputs)
# 1. Check cluster events
CLUSTER_ID="your-cluster-id"
databricks clusters events --cluster-id $CLUSTER_ID --limit 20
# 2. Common causes and fixes
# - QUOTA_EXCEEDED: Terminate unused clusters
# - CLOUD_PROVIDER_LAUNCH_ERROR: Check instance availability
# - DRIVER_UNREACHABLE: Network/firewall issue
# 3. Quick fix - restart cluster
databricks clusters restart --cluster-id $CLUSTER_ID
# 4. Check cluster logs
databricks clusters get --cluster-id $CLUSTER_ID | jq '.termination_reason'
# 1. Check current user
databricks current-user me
# 2. Check job permissions
databricks permissions get jobs --job-id $JOB_ID
# 3. Check table permissions (run in notebook)
# SHOW GRANTS ON TABLE catalog.schema.table
# 4. Fix: Grant necessary permissions
databricks permissions update jobs --job-id $JOB_ID --json '{
"access_control_list": [{
"user_name": "user@company.com",
"permission_level": "CAN_MANAGE_RUN"
}]
}'
-- Quick data quality check
SELECT
COUNT(*) as total_rows,
COUNT(DISTINCT id) as unique_ids,
SUM(CASE WHEN amount IS NULL THEN 1 ELSE 0 END) as null_amounts,
MIN(created_at) as oldest_record,
MAX(created_at) as newest_record
FROM catalog.schema.table
WHERE created_at > current_timestamp() - INTERVAL 1 DAY;
-- Check for recent changes
DESCRIBE HISTORY catalog.schema.table LIMIT 10;
-- Restore to previous version if needed
RESTORE TABLE catalog.schema.table TO VERSION AS OF 5;
:red_circle: **P1 INCIDENT: [Brief Description]**
**Status:** INVESTIGATING
**Impact:** [Describe user/business impact]
**Started:** [Time]
**Current Action:** [What you're doing now]
**Next Update:** [Time]
**Incident Commander:** @[name]
**Thread:** [link]
**Data Pipeline Delay**
We are experiencing delays in data processing. Some reports may show stale data.
**Impact:** Dashboard data may be up to [X] hours delayed
**Started:** [Time] UTC
**Current Status:** Our team is actively investigating
We will provide updates every 30 minutes.
Last updated: [Timestamp]
#!/bin/bash
# collect-incident-evidence.sh
INCIDENT_ID=$1
RUN_ID=$2
CLUSTER_ID=$3
mkdir -p "incident-$INCIDENT_ID"
# Job run details
databricks runs get --run-id $RUN_ID > "incident-$INCIDENT_ID/run_details.json"
databricks runs get-output --run-id $RUN_ID > "incident-$INCIDENT_ID/run_output.json"
# Cluster info
if [ -n "$CLUSTER_ID" ]; then
databricks clusters get --cluster-id $CLUSTER_ID > "incident-$INCIDENT_ID/cluster_info.json"
databricks clusters events --cluster-id $CLUSTER_ID --limit 50 > "incident-$INCIDENT_ID/cluster_events.json"
fi
# Create summary
cat << EOF > "incident-$INCIDENT_ID/summary.md"
# Incident $INCIDENT_ID
**Date:** $(date)
**Run ID:** $RUN_ID
**Cluster ID:** $CLUSTER_ID
## Evidence Collected
- run_details.json
- run_output.json
- cluster_info.json
- cluster_events.json
EOF
tar -czf "incident-$INCIDENT_ID.tar.gz" "incident-$INCIDENT_ID"
echo "Evidence collected: incident-$INCIDENT_ID.tar.gz"
## Incident: [Title]
**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** P[1-4]
**Incident Commander:** [Name]
### Summary
[1-2 sentence description of what happened]
### Timeline (UTC)
| Time | Event |
|------|-------|
| HH:MM | [First alert/detection] |
| HH:MM | [Investigation started] |
| HH:MM | [Root cause identified] |
| HH:MM | [Mitigation applied] |
| HH:MM | [Incident resolved] |
### Root Cause
[Technical explanation of what went wrong]
### Impact
- **Data Impact:** [Tables affected, rows impacted]
- **Users Affected:** [Number, types]
- **Duration:** [How long data was unavailable/stale]
- **Financial Impact:** [If applicable]
### Detection
- **How detected:** [Alert, user report, monitoring]
- **Time to detect:** [Minutes from issue start]
- **Detection gap:** [What could have caught this sooner]
### Response
- **Time to respond:** [Minutes from detection]
- **What worked:** [Effective response actions]
- **What didn't:** [Ineffective actions, dead ends]
### Action Items
| Priority | Action | Owner | Due Date |
|----------|--------|-------|----------|
| P1 | [Preventive measure] | [Name] | [Date] |
| P2 | [Monitoring improvement] | [Name] | [Date] |
| P3 | [Documentation update] | [Name] | [Date] |
### Lessons Learned
1. [Key learning 1]
2. [Key learning 2]
3. [Key learning 3]
Run the triage script to identify the issue source.
Determine if the issue is Databricks-side or code/data issue.
Apply the appropriate remediation for the error type.
Update internal and external stakeholders.
Document everything for postmortem.
| Issue | Cause | Solution |
|---|---|---|
| Can't access workspace | Token expired | Re-authenticate |
| CLI commands fail | Network issue | Check VPN |
| Logs unavailable | Cluster terminated | Check cluster events |
| Restore fails | Retention exceeded | Check vacuum settings |
databricks runs list --job-id $JOB_ID --limit 5 --output json | \
jq -r '.runs[] | "\(.start_time): \(.state.result_state)"'
databricks clusters restart --cluster-id $CLUSTER_ID && \
echo "Cluster restart initiated"
For data handling, see databricks-data-handling.