Skill

dqx-profile-and-generate

From dqx

Profiles Spark DataFrames or Unity Catalog tables and generates DQX data quality rule candidates with summary statistics. Supports sampling, filters, DLT expectations, and AI-assisted variants.

Python

SQL

data-engineering

npx claudepluginhub databrickslabs/dqx --plugin dqx

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Typical one-shot bootstrap for a new table:

SKILL.md

Similar Skills

dqx-apply-checks

405

Validates PySpark DataFrames or Delta tables against DQX quality rules using DQEngine. Appends results as columns, splits valid/invalid rows, or uses metadata rules.

dqx

data-quality

3.6k

References data quality dimensions with qsv checks and provides remediation decision tree for tabular CSV assessment and fixes.

qsv-data-wrangling

data-quality-frameworks

Validates data quality using Great Expectations, dbt tests, and data contracts for formal rules, expectation suites, checkpoints, and CI/CD pipelines.

Stats

Parent Repo Stars405

Parent Repo Forks111

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

DQX — Profile and generate rule candidates

Typical one-shot bootstrap for a new table:

from databricks.labs.dqx.profiler.profiler import DQProfiler
from databricks.labs.dqx.profiler.generator import DQGenerator
from databricks.sdk import WorkspaceClient

ws = WorkspaceClient()
profiler = DQProfiler(ws)
generator = DQGenerator(ws)

df = spark.read.table("catalog.schema.input")

# Step 1 — profile. Returns summary stats + DQProfile candidates per column.
# Three entry points, pick by what you have on hand:
#   - profiler.profile(df, ...)                       — in-memory DataFrame
#   - profiler.profile_table(input_config=..., ...)   — single Unity Catalog table by InputConfig
#   - profiler.profile_tables_for_patterns(           — many tables; returns
#         patterns=["catalog.schema.*"], ...)              dict[table_fqn -> (stats, profiles)]
summary_stats, profiles = profiler.profile(df)

# Step 2 — turn candidates into DQX checks (declarative list[dict]).
checks = generator.generate_dq_rules(profiles)   # default criticality="error"

# Step 3 — inspect / edit, then persist. See dqx-storage for save targets.
for c in checks:
    print(c)

Profiling is a one-time bootstrap action per dataset. The candidate checks need human review before apply — don't auto-apply the raw output to production data.

Scoping the profile

DQProfiler.profile(df, columns=None, options=None) — columns is a top-level kwarg limiting the profiled columns; the following optional keys are set via the options dict:

sample_fraction — float 0–1 (e.g. 0.1 for 10% sample). Use on large tables.
sample_seed — int; pair with sample_fraction for reproducible runs.
limit — absolute row cap (e.g. 1_000_000).
filter — SQL string applied before profiling ("event_date >= '2026-01-01'").
criticality — default for every generated rule ("error" or "warn", default "error").

summary_stats, profiles = profiler.profile(
    df,
    columns=["order_id", "total_amount", "country_code"],
    options={"sample_fraction": 0.1, "sample_seed": 42, "criticality": "warn"},
)

Generating DLT / Lakeflow expectations

from databricks.labs.dqx.profiler.dlt_generator import DQDltGenerator
dlt_expectations = DQDltGenerator(ws).generate_dlt_rules(profiles, language="python")
# language can be "python" or "sql"

AI-assisted rule generation

DQX can generate rules from natural-language requirements via DSPy-backed LLMs — see the companion skills / docs rather than hand-rolling prompts:

Natural-language rules → https://databrickslabs.github.io/dqx/docs/guide/ai_assisted_quality_checks_generation
Primary-key detection → https://databrickslabs.github.io/dqx/docs/guide/ai_assisted_primary_key_detection
Data-contract rules → https://databrickslabs.github.io/dqx/docs/guide/data_contract_quality_rules_generation

No-code / workflow path (DQX installed as a workspace tool)

databricks labs dqx install                         # once per workspace
databricks labs dqx profile                         # all run configs
databricks labs dqx profile --run-config default    # one run config
databricks labs dqx profile --run-config default \
    --patterns "main.product001.*;main.product002" \
    --exclude-patterns "*_output;*_quarantine"

The workflow writes the generated candidates + summary stats to the checks_location on the run config (see dqx-storage).

Do / Don't

Do review the generated checks and tighten criticality / bounds before rolling to production.
Do re-run profiling after a schema change or a large distribution shift — not on a schedule.
Don't profile the output / quarantine table — the CLI auto-excludes _dq_output / _dq_quarantine suffixes; keep the convention.
Don't run profiling on the full streaming firehose — use limit or sample_fraction against the current backfill.

Canonical docs: https://databrickslabs.github.io/dqx/docs/guide/data_profiling.