Generates story-driven synthetic data for Databricks using Spark + Faker + Pandas UDFs. Scales serverlessly to millions of rows in Parquet/JSON/CSV/Delta for test/demo datasets.
npx claudepluginhub databricks-solutions/ai-dev-kit --plugin databricks-ai-dev-kitThis skill uses the workspace's default tool permissions.
> Catalog and schema are **always user-supplied** — never default to any value. If the user hasn't provided them, ask. For any UC write, **always create the schema if it doesn't exist** before writing data.
Provides Ktor server patterns for routing DSL, plugins (auth, CORS, serialization), Koin DI, WebSockets, services, and testApplication testing.
Conducts multi-source web research with firecrawl and exa MCPs: searches, scrapes pages, synthesizes cited reports. For deep dives, competitive analysis, tech evaluations, or due diligence.
Provides demand forecasting, safety stock optimization, replenishment planning, and promotional lift estimation for multi-location retailers managing 300-800 SKUs.
Catalog and schema are always user-supplied — never default to any value. If the user hasn't provided them, ask. For any UC write, always create the schema if it doesn't exist before writing data.
Generate realistic, story-driven synthetic data for Databricks using Spark + Faker + Pandas UDFs (strongly recommended).
Synthetic data should demonstrate how Databricks helps solve real business problems.
The pattern: Something goes wrong → business impact ($) → analyze root cause → identify affected customers → fix and prevent.
Key principles:
Why no flat distributions: Uniform data has no story — no spikes, no anomalies, no cohort, no 20/80, no skew, nothing to investigate. It can't show Databricks' value for root cause analysis.
| When | Guide |
|---|---|
| User mentions ML model training or complex time patterns | references/1-data-patterns.md — ML-ready data, time multipliers, row coherence |
| Errors during generation | references/2-troubleshooting.md — Fixing common issues |
.cache() or .persist() — Not supported on serverless. Write to Delta, read back for joins.collect() — Use Spark parallelism. No driver-side iteration, avoid Pandas↔Spark conversionsBefore generating any code, you MUST present a plan for user approval.
You MUST explicitly ask the user which catalog to use. Do not assume or proceed without confirmation.
Example prompt to user:
"Which Unity Catalog should I use for this data?"
When presenting your plan, always show the selected catalog prominently:
📍 Output Location: catalog_name.schema_name
Volume: /Volumes/catalog_name/schema_name/raw_data/
This makes it easy for the user to spot and correct if needed.
Ask the user about:
If user doesn't specify a story: Propose one. Don't generate bland data — suggest an incident, anomaly, or trend that shows Databricks value (e.g., "I'll include a system outage that causes ticket spike and churn — this lets you demo root cause analysis").
Show a clear specification with the business story and your assumptions surfaced:
📍 Output Location: {user_catalog}.support_demo
Volume: /Volumes/{user_catalog}/support_demo/raw_data/
📖 Story: A payment system outage causes support ticket spike. Resolution times
degrade, enterprise customers churn, revenue drops $2.3M. With Databricks we
identify the root cause, affected customers, and prevent future impact.
| Table | Description | Rows | Key Assumptions |
|---|---|---|---|
| customers | Customer profiles with tier, MRR | 10,000 | Enterprise 10% but 60% of revenue |
| tickets | Support tickets with priority, resolution_time | 80,000 | Spike during outage, SLA breaches |
| incidents | System events (outages, deployments) | 50 | Payment outage mid-month |
| churn_events | Customer cancellations with reason | 500 | Spike after poor support experience |
Business metrics:
customers.mrr — Revenue at risk ($)tickets.resolution_hours — SLA performancechurn_events.lost_mrr — Churn impact ($)The story this data tells:
Ask user: "Does this story work? Any adjustments?"
Do NOT proceed to code generation until user approves the plan, including the catalog.
After generating data, use get_volume_folder_details to validate the output matches requirements:
from databricks.connect import DatabricksSession, DatabricksEnv
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
import pandas as pd
# Setup serverless with dependencies (MUST list all libs used in UDFs)
env = DatabricksEnv().withDependencies("faker", "holidays")
spark = DatabricksSession.builder.withEnvironment(env).serverless(True).getOrCreate()
# Pandas UDF pattern - import lib INSIDE the function
@F.pandas_udf(StringType())
def fake_name(ids: pd.Series) -> pd.Series:
from faker import Faker # Import inside UDF
fake = Faker()
return pd.Series([fake.name() for _ in range(len(ids))])
# Generate with spark.range, apply UDFs
customers_df = spark.range(0, 10000, numPartitions=16).select(
F.concat(F.lit("CUST-"), F.lpad(F.col("id").cast("string"), 5, "0")).alias("customer_id"),
fake_name(F.col("id")).alias("name"),
)
# Write to Volume as Parquet (default for raw data)
# Path is a folder with table name: /Volumes/catalog/schema/raw_data/customers/
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}")
spark.sql(f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.raw_data")
customers_df.write.mode("overwrite").parquet(f"/Volumes/{CATALOG}/{SCHEMA}/raw_data/customers")
Partitions by scale: spark.range(N, numPartitions=P)
Output formats:
df.write.parquet("/Volumes/.../raw_data/table") — raw data for pipelinesdf.write.saveAsTable("catalog.schema.table") — if user wants queryable tablesGenerated scripts must be highly performant. Never do these:
| Anti-Pattern | Why It's Slow | Do This Instead |
|---|---|---|
| Python loops on driver | Single-threaded, no parallelism | Use spark.range() + Spark operations |
.collect() then iterate | Brings all data to driver memory | Keep data in Spark, use DataFrame ops |
| Pandas → Spark → Pandas | Serialization overhead, defeats distribution | Stay in Spark, use pandas_udf only for UDFs |
| Read/write temp files | Unnecessary I/O | Chain DataFrame transformations |
| Scalar UDFs | Row-by-row processing | Use pandas_udf for batch processing |
Good pattern: spark.range() → Spark transforms → pandas_udf for Faker → write directly
F.when(F.rand() < 0.6, "Free").when(F.rand() < 0.9, "Pro").otherwise("Enterprise")
Use np.random.lognormal(mean, sigma) — always positive, long tail:
lognormal(7.5, 0.8) → ~$1800 medianlognormal(5.5, 0.7) → ~$245 medianlognormal(4.0, 0.6) → ~$55 medianEND_DATE = datetime.now()
START_DATE = END_DATE - timedelta(days=180)
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}")
spark.sql(f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.raw_data")
Write master table to Delta first, then read back for FK joins (no .cache() on serverless):
# 1. Write master table
customers_df.write.mode("overwrite").saveAsTable(f"{CATALOG}.{SCHEMA}.customers")
# 2. Read back for FK lookup
customer_lookup = spark.table(f"{CATALOG}.{SCHEMA}.customers").select("customer_idx", "customer_id")
# 3. Generate child table with valid FKs via join
orders_df = spark.range(N_ORDERS).select(
(F.abs(F.hash(F.col("id"))) % N_CUSTOMERS).alias("customer_idx")
)
orders_with_fk = orders_df.join(customer_lookup, on="customer_idx")
Requires Python 3.12 and databricks-connect>=16.4. Use uv:
uv pip install "databricks-connect>=16.4,<17.4" faker numpy pandas holidays
| Issue | Solution |
|---|---|
ImportError: cannot import name 'DatabricksEnv' | Upgrade: uv pip install "databricks-connect>=16.4" |
| Python 3.11 instead of 3.12 | Python 3.12 required. Use uv to create env with correct version |
ModuleNotFoundError: faker | Add to withDependencies(), import inside UDF |
| Faker UDF is slow | Use pandas_udf for batch processing |
| Out of memory | Increase numPartitions in spark.range() |
| Referential integrity errors | Write master table to Delta first, read back for FK joins |
PERSIST TABLE is not supported on serverless | NEVER use .cache() or .persist() with serverless - write to Delta table first, then read back |
F.window vs Window confusion | Use from pyspark.sql.window import Window for row_number(), rank(), etc. F.window is for streaming only. |
| Broadcast variables not supported | NEVER use spark.sparkContext.broadcast() with serverless |
See references/2-troubleshooting.md for full troubleshooting guide.