Skill

databricks-common-errors

Diagnoses and fixes common Databricks errors like cluster not ready, Spark OOM, Delta concurrent writes, using CLI and SDK commands.

Python

Bash

data-engineering

npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin databricks-pack

Tool Access

This skill is limited to using the following tools:

ReadGrepBash(databricks:*)

Preview

Quick-reference diagnostic guide for the most frequent Databricks errors. Covers cluster failures, Spark OOM, Delta Lake conflicts, permissions, schema mismatches, rate limits, and job run failures with real SDK/SQL solutions.

SKILL.md

Similar Skills

github-deep-research

63.9k

Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.

2 files

bytedance-deer-flow-1

surprise-me

63.9k

Dynamically discovers and combines enabled skills into cohesive, unexpected delightful experiences like interactive HTML or themed artifacts. Activates on 'surprise me', inspiration, or boredom cues.

bytedance-deer-flow-1

image-generation

63.9k

Generates images from structured JSON prompts via Python script execution. Supports reference images and aspect ratios for characters, scenes, products, visuals.

2 files

bytedance-deer-flow-1

Stats

Parent Repo Stars1854

Parent Repo Forks248

Last CommitApr 3, 2026

Actions

View Source View Plugin View on GitHub View README

Databricks Common Errors

Overview

Prerequisites

Databricks CLI configured
Access to cluster/job logs
databricks-sdk installed for programmatic debugging

Instructions

Step 1: Identify the Error Source

# Get failed run details
databricks runs get --run-id $RUN_ID --output json | jq '{
  state: .state.result_state,
  message: .state.state_message,
  tasks: [.tasks[] | {key: .task_key, state: .state.result_state, error: .state.state_message}]
}'

Step 2: Match and Fix

CLUSTER_NOT_READY / INVALID_STATE

ClusterNotReadyException: Cluster 0123-456789-abcde is not in a RUNNING state

Cause: Cluster is starting, terminating, or in error state.

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.compute import State

w = WorkspaceClient()
cluster = w.clusters.get(cluster_id="0123-456789-abcde")

if cluster.state in (State.PENDING, State.RESTARTING):
    w.clusters.ensure_cluster_is_running("0123-456789-abcde")
elif cluster.state == State.TERMINATED:
    w.clusters.start_and_wait(cluster_id="0123-456789-abcde")
elif cluster.state == State.ERROR:
    reason = cluster.termination_reason
    print(f"Cluster error: {reason.code} — {reason.parameters}")
    # Common: CLOUD_PROVIDER_LAUNCH_FAILURE, INSTANCE_POOL_CLUSTER_FAILURE

SPARK_DRIVER_OOM

java.lang.OutOfMemoryError: Java heap space
SparkException: Job aborted due to stage failure

Cause: Driver or executor running out of memory.

# Fix 1: Increase memory via cluster Spark config
spark_conf = {
    "spark.driver.memory": "8g",
    "spark.executor.memory": "8g",
    "spark.sql.shuffle.partitions": "400",  # reduce skew
}

# Fix 2: Never collect() large datasets
# BAD:  all_data = df.collect()
# GOOD: df.write.format("delta").saveAsTable("catalog.schema.results")

# Fix 3: Broadcast small tables instead of shuffling
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_lookup_df), "key")

DELTA_CONCURRENT_WRITE

ConcurrentAppendException: Files were added by a concurrent update
ConcurrentDeleteReadException: A concurrent operation modified files

Cause: Multiple jobs writing to the same Delta table simultaneously.

from delta.tables import DeltaTable
import time

def merge_with_retry(spark, source_df, target_table, merge_key, max_retries=3):
    """MERGE with retry for concurrent write conflicts."""
    for attempt in range(max_retries):
        try:
            target = DeltaTable.forName(spark, target_table)
            (target.alias("t")
                .merge(source_df.alias("s"), f"t.{merge_key} = s.{merge_key}")
                .whenMatchedUpdateAll()
                .whenNotMatchedInsertAll()
                .execute())
            return
        except Exception as e:
            if "Concurrent" in str(e) and attempt < max_retries - 1:
                time.sleep(2 ** attempt)
                continue
            raise

PERMISSION_DENIED

PERMISSION_DENIED: User does not have SELECT on TABLE catalog.schema.table
PermissionDeniedException: User does not have permission MANAGE on cluster

Cause: Missing Unity Catalog grants or workspace permissions.

-- Fix Unity Catalog permissions (requires GRANT privilege)
GRANT USAGE ON CATALOG analytics TO `data-team`;
GRANT USAGE ON SCHEMA analytics.silver TO `data-team`;
GRANT SELECT ON TABLE analytics.silver.orders TO `data-team`;

-- Check current grants
SHOW GRANTS ON TABLE analytics.silver.orders;

# Fix workspace object permissions
databricks permissions update jobs --job-id 123 --json '{
  "access_control_list": [{
    "user_name": "user@company.com",
    "permission_level": "CAN_MANAGE_RUN"
  }]
}'

INVALID_PARAMETER_VALUE

InvalidParameterValue: Instance type xyz not supported in region us-east-1
Invalid spark_version: 13.x.x-scala2.12

Cause: Wrong cluster config for the workspace region.

w = WorkspaceClient()

# List valid node types for this workspace
for nt in sorted(w.clusters.list_node_types().node_types, key=lambda x: x.memory_mb)[:10]:
    print(f"{nt.node_type_id}: {nt.memory_mb}MB, {nt.num_cores} cores")

# List valid Spark versions
for v in w.clusters.spark_versions().versions:
    if "LTS" in v.name:
        print(f"{v.key}: {v.name}")

SCHEMA_MISMATCH

AnalysisException: A schema mismatch detected when writing to the Delta table

Cause: Source schema doesn't match target table.

# Option 1: Enable schema evolution
df.write.format("delta").option("mergeSchema", "true").mode("append").saveAsTable("target")

# Option 2: Identify differences
source_cols = set(df.columns)
target_cols = set(spark.table("target").columns)
print(f"Missing in source: {target_cols - source_cols}")
print(f"Extra in source: {source_cols - target_cols}")

# Option 3: Cast to match target schema
target_schema = spark.table("target").schema
for field in target_schema:
    if field.name in df.columns:
        df = df.withColumn(field.name, col(field.name).cast(field.dataType))

JOB_RUN_FAILED

RunState: FAILED — Run terminated with error

w = WorkspaceClient()
run = w.jobs.get_run(run_id=12345)

print(f"State: {run.state.life_cycle_state}")
print(f"Result: {run.state.result_state}")
print(f"Message: {run.state.state_message}")

# Check each task
for task in run.tasks:
    if task.state.result_state and task.state.result_state.value == "FAILED":
        output = w.jobs.get_run_output(task.run_id)
        print(f"Task '{task.task_key}' failed: {output.error}")
        if output.error_trace:
            print(f"Traceback:\n{output.error_trace[:500]}")

HTTP 429 — RATE_LIMIT_EXCEEDED

See databricks-rate-limits skill for full retry patterns.

from databricks.sdk.errors import TooManyRequests
import time

def call_with_backoff(operation, max_retries=5):
    for attempt in range(max_retries):
        try:
            return operation()
        except TooManyRequests as e:
            wait = e.retry_after_secs or (2 ** attempt)
            print(f"Rate limited, waiting {wait}s...")
            time.sleep(wait)
    raise RuntimeError("Max retries exceeded")

Output

Error identified and categorized
Fix applied from matching error pattern
Resolution verified

Error Handling

Error Code	HTTP	Category	Quick Fix
`CLUSTER_NOT_READY`	-	Compute	`ensure_cluster_is_running()`
`OutOfMemoryError`	-	Spark	Increase memory, avoid `.collect()`
`ConcurrentAppendException`	-	Delta	MERGE with retry, serialize writes
`PERMISSION_DENIED`	403	Auth	`GRANT` in Unity Catalog
`INVALID_PARAMETER_VALUE`	400	Config	Check `list_node_types()`
`AnalysisException`	-	Schema	`mergeSchema=true`
`FAILED` run state	-	Job	Check `get_run_output()` for traceback
`Too Many Requests`	429	Rate Limit	Exponential backoff with `Retry-After`

Examples

Quick Diagnostic Commands

databricks clusters get --cluster-id $CID | jq '{state, termination_reason}'
databricks runs list --job-id $JID --limit 5 | jq '.runs[] | {run_id, state: .state.result_state}'
databricks permissions get jobs --job-id $JID

Escalation Path

Check Databricks Status
Collect evidence with databricks-debug-bundle
Search Community Forum
Contact support with workspace ID and request ID from error response

Resources

Next Steps

For comprehensive debugging, see databricks-debug-bundle.