Generate realistic synthetic data using Spark + Faker (strongly recommended). Supports serverless execution, multiple output formats (Parquet/JSON/CSV/Delta), and scales from thousands to millions of rows. For small datasets (<10K rows), can optionally generate locally and upload to volumes. Use when user mentions 'synthetic data', 'test data', 'generate data', 'demo dataset', 'Faker', or 'sample data'.
Generates realistic synthetic data using Spark and Faker for Databricks, supporting serverless execution and multiple output formats.
/plugin marketplace add https://www.claudepluginhub.com/api/plugins/databricks-solutions-databricks-ai-dev-kit/marketplace.json/plugin install databricks-solutions-databricks-ai-dev-kit@cpd-databricks-solutions-databricks-ai-dev-kitThis skill inherits all available tools. When active, it can use any tool Claude has access to.
references/1-setup-and-execution.mdreferences/2-generation-approaches.mdreferences/3-data-patterns.mdreferences/4-domain-guidance.mdreferences/5-output-formats.mdreferences/6-troubleshooting.mdscripts/generate_synthetic_data.pyCatalog and schema are always user-supplied — never default to any value. If the user hasn't provided them, ask. For any UC write, always create the schema if it doesn't exist before writing data.
Generate realistic, story-driven synthetic data for Databricks using Spark + Faker + Pandas UDFs (strongly recommended).
| Topic | Guide | When to Use |
|---|---|---|
| Setup & Execution | references/1-setup-and-execution.md | Setting up environment, choosing compute, installing dependencies |
| Generation Approaches | references/2-generation-approaches.md | Choosing Spark UDFs vs Polars local, writing generation code |
| Data Patterns | references/3-data-patterns.md | Creating realistic distributions, referential integrity, time patterns |
| Domain Guidance | references/4-domain-guidance.md | E-commerce, IoT, financial, support/CRM domain patterns |
| Output Formats | references/5-output-formats.md | Choosing output format, saving to volumes/tables |
| Troubleshooting | references/6-troubleshooting.md | Fixing errors, debugging issues |
| Example Script | scripts/generate_synthetic_data.py | Complete Spark + Pandas UDF example |
Prefer uv for all Python operations. Fall back to pip only if uv is not available.
# Preferred
uv pip install "databricks-connect>=16.4,<17.4" faker numpy pandas holidays
uv run python generate_data.py
# Fallback if uv not available
pip install "databricks-connect>=16.4,<17.4" faker numpy pandas holidays
python generate_data.py
.cache() or .persist() with serverless compute - these operations are NOT supported and will fail with AnalysisException: PERSIST TABLE is not supported on serverless compute. Instead, write master tables to Delta first, then read them back for FK joins.Before generating any code, you MUST present a plan for user approval.
You MUST explicitly ask the user which catalog to use. Do not assume or proceed without confirmation.
Example prompt to user:
"Which Unity Catalog should I use for this data?"
When presenting your plan, always show the selected catalog prominently:
📍 Output Location: catalog_name.schema_name
Volume: /Volumes/catalog_name/schema_name/raw_data/
This makes it easy for the user to spot and correct if needed.
Ask the user about:
Show a clear specification with YOUR ASSUMPTIONS surfaced. Always start with the output location:
📍 Output Location: {user_catalog}.ecommerce_demo
Volume: /Volumes/{user_catalog}/ecommerce_demo/raw_data/
| Table | Columns | Rows | Key Assumptions |
|---|---|---|---|
| customers | customer_id, name, email, tier, region | 5,000 | Tier: Free 60%, Pro 30%, Enterprise 10% |
| orders | order_id, customer_id (FK), amount, status | 15,000 | Enterprise customers generate 5x more orders |
Assumptions I'm making:
Ask user: "Does this look correct? Any adjustments to the catalog, tables, or distributions?"
Do NOT proceed to code generation until user approves the plan, including the catalog.
from databricks.connect import DatabricksSession, DatabricksEnv
from pyspark.sql import functions as F
from pyspark.sql.types import StringType, DoubleType
import pandas as pd
import numpy as np
# Setup with managed dependencies (databricks-connect 16.4+)
env = DatabricksEnv().withDependencies("faker", "pandas", "numpy")
spark = DatabricksSession.builder.withEnvironment(env).serverless(True).getOrCreate()
# Define Pandas UDFs
@F.pandas_udf(StringType())
def fake_name(ids: pd.Series) -> pd.Series:
from faker import Faker
fake = Faker()
return pd.Series([fake.name() for _ in range(len(ids))])
@F.pandas_udf(DoubleType())
def generate_amount(tiers: pd.Series) -> pd.Series:
amounts = []
for tier in tiers:
if tier == "Enterprise":
amounts.append(float(np.random.lognormal(7.5, 0.8)))
elif tier == "Pro":
amounts.append(float(np.random.lognormal(5.5, 0.7)))
else:
amounts.append(float(np.random.lognormal(4.0, 0.6)))
return pd.Series(amounts)
# Generate customers
customers_df = (
spark.range(0, 10000, numPartitions=16)
.select(
F.concat(F.lit("CUST-"), F.lpad(F.col("id").cast("string"), 5, "0")).alias("customer_id"),
fake_name(F.col("id")).alias("name"),
F.when(F.rand() < 0.6, "Free")
.when(F.rand() < 0.9, "Pro")
.otherwise("Enterprise").alias("tier"),
)
.withColumn("arr", generate_amount(F.col("tier")))
)
# Save to Unity Catalog
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}")
spark.sql(f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.raw_data")
customers_df.write.mode("overwrite").parquet(f"/Volumes/{CATALOG}/{SCHEMA}/raw_data/customers")
F.when(F.rand() < 0.6, "Free")
.when(F.rand() < 0.9, "Pro")
.otherwise("Enterprise").alias("tier")
@F.pandas_udf(DoubleType())
def generate_amount(tiers: pd.Series) -> pd.Series:
return pd.Series([
float(np.random.lognormal({"Enterprise": 7.5, "Pro": 5.5, "Free": 4.0}[t], 0.7))
for t in tiers
])
from datetime import datetime, timedelta
END_DATE = datetime.now()
START_DATE = END_DATE - timedelta(days=180)
F.date_add(F.lit(START_DATE.date()), (F.rand() * 180).cast("int")).alias("order_date")
# Always in script - assume catalog exists
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}")
spark.sql(f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.raw_data")
| Mode | Best For | Setup |
|---|---|---|
| DB Connect 16.4+ Serverless | Local dev, Python 3.12+ | DatabricksEnv().withDependencies(...) |
| Serverless Job | Production, scheduled | Job with environments parameter |
| Classic Cluster | Fallback only | Use Databricks CLI to install libraries. databricks libraries install --json '{"cluster_id": "<cluster_id>", "libraries": [{"pypi": {"package": "faker"}}, {"pypi": {"package": "holidays"}}]}' |
See references/1-setup-and-execution.md for detailed setup instructions.
| Format | Use Case | Code |
|---|---|---|
| Parquet (default) | SDP pipeline input | df.write.parquet(path) |
| JSON | Log-style ingestion | df.write.json(path) |
| CSV | Legacy systems | df.write.option("header", "true").csv(path) |
| Delta Table | Direct analytics | df.write.saveAsTable("catalog.schema.table") |
See references/5-output-formats.md for detailed options.
CREATE SCHEMA/VOLUME IF NOT EXISTS)| Issue | Solution |
|---|---|
ModuleNotFoundError: faker | See references/1-setup-and-execution.md |
| Faker UDF is slow | Use pandas_udf for batch processing |
| Out of memory | Increase numPartitions in spark.range() |
| Referential integrity errors | Write master table to Delta first, read back for FK joins |
PERSIST TABLE is not supported on serverless | NEVER use .cache() or .persist() with serverless - write to Delta table first, then read back |
F.window vs Window confusion | Use from pyspark.sql.window import Window for row_number(), rank(), etc. F.window is for streaming only. |
See references/6-troubleshooting.md for full troubleshooting guide.
Activates when the user asks about AI prompts, needs prompt templates, wants to search for prompts, or mentions prompts.chat. Use for discovering, retrieving, and improving prompts.
Search, retrieve, and install Agent Skills from the prompts.chat registry using MCP tools. Use when the user asks to find skills, browse skill catalogs, install a skill for Claude, or extend Claude's capabilities with reusable AI agent components.
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.