From data-agent-kit-starter-pack
Automates data profiling, quality assessment, and transformations for BigQuery/GCS sources using Dataplex in Dataform/dbt pipelines. Essential for ingestion, movement, schema mapping, and cleaning.
npx claudepluginhub gemini-cli-extensions/data-agent-kit-starter-pack --plugin data-agent-kit-starter-packThis skill uses the workspace's default tool permissions.
Automated data profiling, quality assessment, and transformation for data
Implements structured self-debugging workflow for AI agent failures: capture errors, diagnose patterns like loops or context overflow, apply contained recoveries, and generate introspection reports.
Monitors deployed URLs for regressions in HTTP status, console errors, performance metrics, content, network, and APIs after deploys, merges, or upgrades.
Provides React and Next.js patterns for component composition, compound components, state management, data fetching, performance optimization, forms, routing, and accessible UIs.
Automated data profiling, quality assessment, and transformation for data sourced from BigQuery or Google Cloud Storage (GCS).
[!IMPORTANT]
You MUST use this skill for ANY task where the source is BigQuery or GCS — including seemingly simple operations like "move data" or "copy table".
Perform these checks before generating the implementation_plan.md.
Check Eligibility — You MUST confirm the source is a BigQuery table or GCS source.
Gather Data Profile via Dataplex:
GCS sources: For GCS sources, you MUST create an external table first before running the dataplex scan.
Wait for results: You MUST NOT proceed until the Dataplex profile is available, unless user scan approval was denied.
Use the profile as input for cleansing and schema mapping decisions. The transformations MUST NOT be finalized before profile information is available (unless scan was denied).
Commands:
Obtain user approval: Present the scripts/dataplex_scanner.py
scan command to the user and obtain explicit approval before
executing it. Use the following template to present the command:
python3 scripts/dataplex_scanner.py ... (Fetch
full arguments from step 6 below)Run the scripts/dataplex_scanner.py script located in the same
directory as this SKILL.md file. This script handles concurrent
scan creation, dynamic sampling for large tables, and polling for
results. Use --help to learn more.
The script will save the full results as JSON files in the specified output directory.
[!IMPORTANT] The location MUST be a specific Google Cloud region
like us-central1; multi-regions like us are not supported in
Dataplex scan.
If there are multiple tables to scan, provide them all in the
--tables argument to run them concurrently.
Use the following command template:
python3 scripts/dataplex_scanner.py \
--tables <project.dataset.table> <project.catalog.namespace.table> \
--location <location> \
--output-dir <output_dir>
Note: The script accepts table IDs in the format
project.dataset.table for BigQuery tables and
project.catalog.namespace.table for BigLake Iceberg tables.
Fetch Schema & Samples — Use bq commands to fetch schema and sample
data for both source and destination tables.
implementation_plan.md MUST include a Profiling Evidence
section. Note: If scan execution was denied by the user, document the
denial reason here instead of Job IDs.## Profiling Evidence
- [ ] Dataplex Data Profile Job ID: <JOB_ID>
- [ ] Profile Result Summary: <Brief summary of key findings, e.g., % nulls, distinct values>
implementation_plan.md MUST include a step to generate cleansing
SQL transformations based on the profile output and instructions in Step 2:
Generate Transformations.implementation_plan.md MUST also reference Step 3 (Quality
Review) under its Verification Plan section.[!CAUTION]
Do not proceed to implementation until both sections are completed. You MUST ensure that the verification phase only validates that your transformations successfully addressed the anomalies found in Step 1.
NULL only for malformed data (e.g.,
unparseable dates, zero-length strings for non-nullable integers).'C' → 'F')
to the most common unit. If units are too varied (e.g., mg, liter),
leave as-is.COALESCE with SAFE.PARSE_* functions for
multiple date/time/datetime/timestamp formats. Fetch diverse samples when
source data shows high variance.SAFE.PARSE_JSON to cast JSON strings to JSON type.
Never use deprecated JSON_EXTRACT_*.JSON_VALUE, JSON_QUERY, JSON_QUERY_ARRAY,
JSON_VALUE_ARRAY without SAFE. prefix (they are safe by default).SAFE.PARSE_JSON returns NULL, keep the original
string and note the invalid JSON in the cleaning summary.NULL elements after SAFE_CAST (e.g., using
ARRAY) as BigQuery arrays cannot contain NULLs.LOWER(),
UPPER()) unless explicitly required.NULL values using ARRAY_FILTER(array_column, e -> e IS NOT NULL).ARRAY(SELECT DISTINCT x FROM UNNEST(array_column))
for case-sensitive deduplication.ARRAY_TRANSFORM or UNNEST/ARRAY_AGG for
element-wise changes (e.g., date parsing).UNNEST to expand to rows, or ARRAY_AGG to group
rows into an array, as required by the destination schema.SAFE_CAST based on the destination
schema or inferred profile.LOWER(),
UPPER()) unless explicitly required.struct.field) to extract; or use STRUCT() constructor to group columns.NULL and drop fields
not present in the destination schema.[!IMPORTANT]
You MUST verify transformations strictly using the protocol below before completing the task. Never skip this step. Use Dataplex profiling only (unless scan was denied by the user) — not ad-hoc SQL queries.
Quality review protocol:
Extract the SELECT query containing all generated transformations
(autocleaning, schema mapping, JSON extractions).
Create a temporary sample output table (max 1M rows, 1-hour TTL) by running the transformation query.
Fix any runtime errors and re-run until the query succeeds.
Profile the temporary sample output table using Dataplex:
bq sample queries to ensure transformations were successful.scripts/dataplex_scanner.py script on the temporary table.Compare profiles (Skip if scans were denied) — Check the new profile against the Step 1 profile for every transformed column:
| Anomaly Type | Threshold |
| --- | --- |
| **NULL increase** | >1% increase compared to source (unless expected) |
| **Value range shift** | Unexpected ranges or formats |
Iterate on anomalies — For each anomaly:
NOT NULL but transformed
value IS NULL.Your walkthrough.md MUST include a Quality Review Profiling Evidence
section. Note: If scan execution was denied by the user, document the denial
reason here instead of Job IDs.
## Quality Review Profiling Evidence
- [ ] Post-Transformation Dataplex Profile Job ID: <JOB_ID>
- [ ] Profile Comparison Summary: <Detailed comparison between initial and final profiles per column>
[!CAUTION]
Do not conclude the task or ask for user review until this section is filled and the profile comparison is documented.
Your walkthrough.md must contain a table for each transformation in the
following format:
| Field | Description |
| --- | --- |
| **Destination schema considered** | The target column/type being matched |
| **Issue Detected** | What data quality problem was found |
| **Transformation Applied** | The SQL logic used to fix it |
| **Benefit** | Why this transformation improves the data |
Include a summary of all quality review steps and profiling evidence.