databricks-cost-tuning | databricks-pack

Stats

Actions

Tags

databricks-cost-tuning | databricks-pack

Databricks Cost Tuning

Overview

Reduce Databricks spending by optimizing cluster configurations, leveraging spot instances, right-sizing SQL warehouses, and implementing cost governance policies. Databricks charges per DBU (Databricks Unit) with rates varying by workload type: Jobs Compute (~~$0.15/DBU), All-Purpose Compute (~~$0.40/DBU), SQL Compute (~~$0.22/DBU), and Serverless (~~$0.07/DBU).

Prerequisites

Databricks Premium or Enterprise workspace
Access to system.billing.usage tables
Workspace admin for cluster policy creation

Instructions

Step 1: Identify Top Cost Drivers

-- Top 10 most expensive clusters this month
SELECT cluster_id, cluster_name, sku_name,
       SUM(usage_quantity) AS total_dbus,
       ROUND(SUM(usage_quantity * list_price), 2) AS cost_usd
FROM system.billing.usage
WHERE usage_date >= date_trunc('month', current_date())
GROUP BY cluster_id, cluster_name, sku_name
ORDER BY cost_usd DESC
LIMIT 10;

Step 2: Enforce Cluster Policies

{
  "name": "cost-optimized-interactive",
  "definition": {
    "autotermination_minutes": { "type": "range", "minValue": 10, "maxValue": 60, "defaultValue": 20 },
    "num_workers": { "type": "range", "minValue": 1, "maxValue": 8 },
    "node_type_id": { "type": "allowlist", "values": ["m5.xlarge", "m5.2xlarge"] },
    "aws_attributes.availability": { "type": "fixed", "value": "SPOT_WITH_FALLBACK" },
    "spark_conf.spark.databricks.cluster.profile": { "type": "fixed", "value": "singleNode" }
  }
}

Step 3: Use Spot Instances for Jobs

# jobs/cost_optimized_job.py
job_config = {
    "name": "nightly-etl",
    "new_cluster": {
        "num_workers": 4,
        "node_type_id": "m5.2xlarge",
        "aws_attributes": {
            "availability": "SPOT_WITH_FALLBACK",
            "first_on_demand": 1,  # Driver on-demand, workers spot
            "spot_bid_price_percent": 100,
        },
        "autotermination_minutes": 0,  # Jobs clusters auto-terminate on completion
    },
}
# Spot instances save 60-90% on worker node costs

Step 4: Right-Size SQL Warehouses

-- Check warehouse utilization (queries/hour and queue time)
SELECT warehouse_id, warehouse_name,
       COUNT(*) AS query_count,
       AVG(total_duration_ms) AS avg_duration_ms,
       MAX(queue_duration_ms) AS max_queue_ms
FROM system.query.history
WHERE start_time > current_timestamp() - INTERVAL 7 DAYS
GROUP BY warehouse_id, warehouse_name;
-- If max_queue_ms is near 0, warehouse is oversized. Reduce cluster size.
-- If max_queue_ms > 30000, warehouse needs more capacity or auto-scaling.  # 30000: 30 seconds in ms

Step 5: Schedule Auto-Stop for Development Clusters

# Set aggressive auto-termination for all dev clusters
databricks clusters list --output JSON | \
  jq -r '.[] | select(.cluster_name | test("dev|sandbox|test")) | .cluster_id' | \
  xargs -I{} databricks clusters edit {} --json '{"autotermination_minutes": 15}'

Error Handling

Issue	Cause	Solution
Spot instance interruption	Cloud provider reclaiming capacity	Use SPOT_WITH_FALLBACK, checkpoint long jobs
Cluster policy too restrictive	Workers can't handle workload	Increase max_workers or allow larger instance types
SQL warehouse idle but running	No auto-stop configured	Enable auto-stop with 10-minute timeout
Billing data not available	System tables not enabled	Enable system table access in account settings

Examples

Basic usage: Apply databricks cost tuning to a standard project setup with default configuration options.

Advanced scenario: Customize databricks cost tuning for production environments with multiple constraints and team-specific requirements.

Output

Configuration files or code changes applied to the project
Validation report confirming correct implementation
Summary of changes made and their rationale

Resources

Official monitoring documentation
Community best practices and patterns
Related skills in this plugin pack