From mozilla-bigquery-etl-skills
Use this skill to enrich schema.yaml files for BigQuery tables in the bigquery-etl repository. Handles creating schema.yaml when it doesn't exist, finding and filling missing column descriptions (from base schemas, upstream source schema, query context, or application context), validating columns against the query, and generating a summary with recommendations for where to add new descriptions (global.yaml, <dataset_name>.yaml, or app_<name>.yaml). Works with column-description-finder skill.
npx claudepluginhub mozilla/bigquery-etl-skills --plugin bigquery-etl-skillsThis skill uses the workspace's default tool permissions.
**Composable:** Orchestrates column-description-finder (for finding descriptions)
Creates isolated Git worktrees for feature branches with prioritized directory selection, gitignore safety checks, auto project setup for Node/Python/Rust/Go, and baseline verification.
Executes implementation plans in current session by dispatching fresh subagents per independent task, with two-stage reviews: spec compliance then code quality.
Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.
Composable: Orchestrates column-description-finder (for finding descriptions) When to use: Creating a new schema.yaml, filling missing column descriptions, validating schema columns against the query output, or auditing an existing schema for completeness
BEFORE performing any schema work, review the following:
Column description sources โ note: Base schema files (global.yaml, app_<name>.yaml, dataset-specific YAML) are fetched live from GitHub by the column-description-finder skill in Step 0c. Do not read these files inline โ wait for the skill invocation in Step 0c.
Schema YAML conventions: READ references/schema_yaml_conventions.md
Result file format: READ assets/missing_descriptions_template.yaml
column-description-finder โ ๏ธ MANDATORY โ do not skipBEFORE looking up any column descriptions, invoke the column-description-finder skill to audit base schema coverage. The skill fetches live base schema files from GitHub and uses deterministic scripts โ avoiding the hallucination risk of reading files inline.
0a โ Identify the applicable schemas:
metadata.yaml exists, check for an app_schema: field (e.g., app_newtab). If absent, proceed without an app schema flag.0b โ Confirm schema.yaml and query file:
query.sql and read it to understand SELECT output columns:
query.sql exists: read it โquery.py exists: note it โ Step 3 column validation will be skipped since column output cannot be reliably parsed from Pythonschema.yaml exists:
./bqetl query schema update <dataset>.<table>
Then read the generated schema.yaml before proceeding to Step 0c.0c โ Invoke column-description-finder skill:
Run the audit script with the appropriate flags:
# With app schema:
python scripts/audit_base_schema_coverage.py <dataset>.<table> --app-schema <app_schema> --dataset-schema
# Without app schema:
python scripts/audit_base_schema_coverage.py <dataset>.<table> --dataset-schema
Capture the full output โ it identifies:
โ ๏ธ Wait for the skill to complete and capture its full output before proceeding to Step 1.
Using the audit output from Step 0c, categorize every column:
| Column status | Source | Action |
|---|---|---|
| Covered by live base schema | audit output (column โ source file) | Use the base schema description |
| Has own description | audit output | Retain existing description |
| No description (not in live schemas) | audit output | Check local base schema files first; if found โ local_only_base_schema_columns; if not โ flag for Step 2 |
Checking local base schema files: The column-description-finder audit fetches schemas live from GitHub. If a base schema file exists locally but not on GitHub main (returns 404 when WebFetched), the audit will not find descriptions from it. For columns marked "no description" by the audit:
bigquery_etl/schema/*.yaml files for a matchlocal-only and capture it in local_only_base_schema_columns in Step 6.Base schema priority order applied by the audit script is defined in skills/column-description-finder/references/column_definition_yaml_guide.md (app-specific โ dataset-specific โ global).
Proceed to Step 2 for all columns with no description in any base schema (live or local).
For each column without a current description in schema.yaml, use this priority order:
1โ3. Base schemas โ descriptions are already identified in the Step 0c audit output. Apply them directly from the audit results (app-specific โ dataset-specific โ global).
Upstream source schema.yaml โ for columns that are pass-through dimensions from a source table:
query.sql_live or _stable (a Glean ingestion table): invoke the glean-description-lookup skill to fetch descriptions from the Glean Dictionary โ no local schema.yaml exists for those tablessql/ and read its schema.yamlschema.yaml is absent or the column has no description there, fall through to priority 5Query context โ examine query.sql to understand what the column computes:
Application context โ if no query context is clear, derive description from:
โ ๏ธ For any column whose description came from upstream source schema.yaml (priority 4), query context (priority 5), or application context (priority 6) โ i.e., not from a base schema โ capture it in <table_name>_missing_metadata.yaml (Step 6).
Base schema matching covers top-level fields only (as noted in the Step 0c audit output). For nested fields within RECORD types, apply the same priority order as above โ upstream source schema.yaml first, then query context, then application context. Descriptions derived from priorities 4, 5, or 6 for nested fields should also be captured in <table_name>_missing_metadata.yaml. Quality check (Step 4) applies to nested field descriptions as well.
Compare columns in schema.yaml against the query's SELECT output. Skip this step if only query.py exists (noted in Step 0b).
โ ๏ธ This command may update schema.yaml on disk. Do NOT re-read schema.yaml after running โ continue working from the field list and descriptions established in Steps 1โ2. Use the command output (diff) only to identify structural additions and removals; apply those changes to your in-memory enriched schema, then write everything in Step 5.
./bqetl query schema update <dataset>.<table>
Review the diff to identify:
Every description โ whether sourced from a base schema, retained from the existing schema.yaml, or derived from context โ is verified against:
Failing descriptions are corrected automatically.
Write the enriched schema.yaml preserving field order, names, types, and modes. Only description: entries are added or updated.
On write failure: attempt write to schema_enriched_draft.yaml in the same directory and report the alternate path.
Then read back the written file and confirm:
Then validate the schema against BigQuery:
./bqetl query schema validate <dataset>.<table>
On verification or validation failure: report which fields are missing/incorrect and attempt automatic remediation.
<table_name>_missing_metadata.yaml (if non-live-base-schema descriptions were needed)Create this file if ANY of the following is true:
Write to:
bigquery_etl/schema/missing_metadata/<table_name>_missing_metadata.yaml
Structure โ READ assets/missing_descriptions_template.yaml and COPY its format. The block below is a structural preview only; if it differs from the template file, the template is authoritative:
table: "<project>.<dataset>.<table>"
generated_date: "<YYYY-MM-DD>"
# Section 1: Columns whose descriptions came from upstream source schema, query context, or application context
# (not found in any base schema โ live or local)
missing_columns:
- name: "<column_name>"
type: "<BigQuery type>"
mode: "<NULLABLE|REQUIRED|REPEATED>"
inferred_description: >-
<Description derived from upstream source schema, query context, or application/product context.>
inference_basis: "<column name semantics | related column | product domain | data type signal | query context | upstream source schema>"
recommended_target: "<global.yaml | app_<name>.yaml | <dataset_name>.yaml>"
recommendation_reason: >-
<Why this column belongs in the recommended target.>
# Section 2: Columns matched in LOCAL-ONLY base schema files
# (descriptions are valid but sourced from files not yet on GitHub main)
local_only_base_schema_columns:
- name: "<column_name>"
type: "<BigQuery type>"
mode: "<NULLABLE|REQUIRED|REPEATED>"
description: >-
<Description sourced from the local-only base schema file.>
source_file: "<app_<name>.yaml | <dataset_name>.yaml>"
source_status: "local-only โ not on GitHub main"
recommended_action: >-
Merge <source_file> to GitHub main to make this description authoritative.
Omit a section entirely if it has no entries.
recommended_target rules for missing_columns โ this is also the decision logic for choosing which base schema file to recommend:
| Condition | Target |
|---|---|
| Column useful across many datasets/products | global.yaml |
| Column specific to one product (newtab, pocket, ads) | app_<product_name>.yaml |
| Column specific to one dataset | <dataset_name>.yaml |
โ ๏ธ Apply product-specificity check first. A column that appears only in one
dataset is not automatically a <dataset_name>.yaml candidate โ if the column
represents a product feature or product-scoped concept, it belongs in
app_<product_name>.yaml regardless of which dataset it lives in.
For auditing recommended_target across multiple tables at once, use the
base-schema-audit skill (decision tree:
skills/base-schema-audit/references/base_schema_classification_guide.md).
If neither condition is met, skip this step and note in the summary:
<table_name>_missing_metadata.yaml not created โ all columns matched in live base schemas
Write a summary markdown file to:
bigquery_etl/schema/missing_metadata/<table_name>-metadata-summary.md
Use the Write tool to create this file with the following structure:
# Schema Enricher โ Metadata Summary
**Table:** `<project>.<dataset>.<table>`
**Run date:** <YYYY-MM-DD>
---
## Phase Results
| Phase (Step) | Status | Notes |
|---|---|---|
| Discovery (Step 0) | โ
/โ | Base schemas found |
| Categorization (Step 1) | โ
/โ | X covered, Y retained, Z flagged |
| Description Fill (Step 2) | โ
/โ | X from base schemas, Y from upstream schema/query/application context |
| Column Validation (Step 3) | โ
/โ/โญ๏ธ | X/Y match query output (or: skipped โ only query.py exists) |
| Quality Check (Step 4) | โ
/โ | X descriptions pass |
| Write (Step 5) | โ
/โ | schema.yaml written |
| Verification (Step 5) | โ
/โ | All fields confirmed |
---
## Base Schema Coverage
| Column | Matched Field | Source File | Source Status | Alias Used |
|---|---|---|---|---|
| <column_name> | <matched_field> | <source_file> | live | no |
| <column_name> | <canonical_name> | <source_file> | live | yes (alias: <column_name>) |
| <column_name> | <matched_field> | <source_file> | local-only | no |
| <column_name> | โ | (none) | non-base-schema | โ |
---
## Local-Only Base Schema Columns
| Column | Source File | Recommended Action |
|---|---|---|
| <column_name> | <source_file> | Merge <source_file> to GitHub main |
*(Omit this section if all matched base schemas are live)*
---
## Non-Base-Schema Descriptions
| Column | Description | Inference Basis | Recommended Base Schema |
|---|---|---|---|
| <column_name> | <description> | <upstream source schema \| query context \| application context> | <global.yaml \| app_<name>.yaml \| <dataset_name>.yaml> |
*(Omit this section if no non-base-schema descriptions were needed)*
---
## Column Validation Results
_(If Step 3 was skipped because only query.py exists, write: "Column validation skipped โ only query.py exists")_
- **Columns added (missing from schema):** [list or "none"]
- **Columns removed (not in query):** [list or "none"]
- **Type corrections:** [list or "none"]
---
## Output Files
- `sql/<project>/<dataset>/<table>/schema.yaml` โ enriched with all X descriptions
- `bigquery_etl/schema/missing_metadata/<table_name>_missing_metadata.yaml` โ Y non-base-schema descriptions (priorities 4โ6) + Z local-only base schema columns
(or: `<table_name>_missing_metadata.yaml` not created โ all columns matched in live base schemas)
- `bigquery_etl/schema/missing_metadata/<table_name>-metadata-summary.md` โ this summary file
Before reporting done, confirm all required output files have been written using the Read tool or ls:
sql/<project>/<dataset>/<table>/schema.yaml โ every field has a non-empty descriptionbigquery_etl/schema/missing_metadata/<table_name>-metadata-summary.md โ summary markdown writtenbigquery_etl/schema/missing_metadata/<table_name>_missing_metadata.yaml โ written if any columns required description from priorities 4, 5, or 6 OR came from local-only base schemas; explicitly noted as not created if all columns matched in live base schemasIf any file is missing, create it before proceeding.
| Skill | When to invoke |
|---|---|
column-description-finder | Always โ invoked in Step 0c to audit base schema coverage using live GitHub schema files |
glean-description-lookup | Step 2 priority 4 โ when the upstream source table ends in _live or _stable (no local schema.yaml) |
base-schema-audit | Before enriching many tables โ classifies missing columns into correct base schema targets across a full dataset |
create-pr | After enrichment is complete โ stages, commits, and opens a draft PR |
# Generate schema structure from query output (Step 0b)
./bqetl query schema update <dataset>.<table>
# Validate schema against BigQuery (Step 5)
./bqetl query schema validate <dataset>.<table>