From dqx
Profiles Spark DataFrames or Unity Catalog tables and generates DQX data quality rule candidates with summary statistics. Supports sampling, filters, DLT expectations, and AI-assisted variants.
npx claudepluginhub databrickslabs/dqx --plugin dqxThis skill uses the workspace's default tool permissions.
Typical one-shot bootstrap for a new table:
Validates PySpark DataFrames or Delta tables against DQX quality rules using DQEngine. Appends results as columns, splits valid/invalid rows, or uses metadata rules.
References data quality dimensions with qsv checks and provides remediation decision tree for tabular CSV assessment and fixes.
Validates data quality using Great Expectations, dbt tests, and data contracts for formal rules, expectation suites, checkpoints, and CI/CD pipelines.
Share bugs, ideas, or general feedback.
Typical one-shot bootstrap for a new table:
from databricks.labs.dqx.profiler.profiler import DQProfiler
from databricks.labs.dqx.profiler.generator import DQGenerator
from databricks.sdk import WorkspaceClient
ws = WorkspaceClient()
profiler = DQProfiler(ws)
generator = DQGenerator(ws)
df = spark.read.table("catalog.schema.input")
# Step 1 — profile. Returns summary stats + DQProfile candidates per column.
# Three entry points, pick by what you have on hand:
# - profiler.profile(df, ...) — in-memory DataFrame
# - profiler.profile_table(input_config=..., ...) — single Unity Catalog table by InputConfig
# - profiler.profile_tables_for_patterns( — many tables; returns
# patterns=["catalog.schema.*"], ...) dict[table_fqn -> (stats, profiles)]
summary_stats, profiles = profiler.profile(df)
# Step 2 — turn candidates into DQX checks (declarative list[dict]).
checks = generator.generate_dq_rules(profiles) # default criticality="error"
# Step 3 — inspect / edit, then persist. See dqx-storage for save targets.
for c in checks:
print(c)
Profiling is a one-time bootstrap action per dataset. The candidate checks need human review before apply — don't auto-apply the raw output to production data.
DQProfiler.profile(df, columns=None, options=None) — columns is a top-level kwarg limiting the profiled columns; the following optional keys are set via the options dict:
sample_fraction — float 0–1 (e.g. 0.1 for 10% sample). Use on large tables.sample_seed — int; pair with sample_fraction for reproducible runs.limit — absolute row cap (e.g. 1_000_000).filter — SQL string applied before profiling ("event_date >= '2026-01-01'").criticality — default for every generated rule ("error" or "warn", default "error").summary_stats, profiles = profiler.profile(
df,
columns=["order_id", "total_amount", "country_code"],
options={"sample_fraction": 0.1, "sample_seed": 42, "criticality": "warn"},
)
from databricks.labs.dqx.profiler.dlt_generator import DQDltGenerator
dlt_expectations = DQDltGenerator(ws).generate_dlt_rules(profiles, language="python")
# language can be "python" or "sql"
DQX can generate rules from natural-language requirements via DSPy-backed LLMs — see the companion skills / docs rather than hand-rolling prompts:
databricks labs dqx install # once per workspace
databricks labs dqx profile # all run configs
databricks labs dqx profile --run-config default # one run config
databricks labs dqx profile --run-config default \
--patterns "main.product001.*;main.product002" \
--exclude-patterns "*_output;*_quarantine"
The workflow writes the generated candidates + summary stats to the checks_location on the run config (see dqx-storage).
criticality / bounds before rolling to production._dq_output / _dq_quarantine suffixes; keep the convention.limit or sample_fraction against the current backfill.Canonical docs: https://databrickslabs.github.io/dqx/docs/guide/data_profiling.