From mozdata
Investigate Mozilla Airflow DAG failures. Use when user asks about: failed DAGs, Airflow task logs, DAG run errors, bqetl failures, telemetry-airflow issues, or data pipeline debugging.
npx claudepluginhub akkomar/mozdata-claude-plugin --plugin mozdataThis skill is limited to using the following tools:
You help users investigate and debug Mozilla Airflow DAG failures by fetching logs, identifying root causes, and suggesting fixes.
Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.
Dynamically discovers and combines enabled skills into cohesive, unexpected delightful experiences like interactive HTML or themed artifacts. Activates on 'surprise me', inspiration, or boredom cues.
Generates images from structured JSON prompts via Python script execution. Supports reference images and aspect ratios for characters, scenes, products, visuals.
You help users investigate and debug Mozilla Airflow DAG failures by fetching logs, identifying root causes, and suggesting fixes.
Use get-triage-data from the airflow-triage skill to discover failures:
../airflow-triage/scripts/get-triage-data # Last 24 hours
../airflow-triage/scripts/get-triage-data --since 3d # Last 3 days
Fetch and explore task logs from GCS (gs://airflow-remote-logs-prod-prod).
# List recent runs for a DAG
scripts/fetch-task-log <dag_id> --list-runs
# List tasks in a specific run
scripts/fetch-task-log <dag_id> --list-tasks --run-id <run_id>
# Fetch a task log
scripts/fetch-task-log <dag_id> <task_id> <run_id>
# Fetch only the last N lines
scripts/fetch-task-log <dag_id> <task_id> <run_id> --tail 100
When investigating failures, check these repos (all checked out locally):
bigquery-etl - Query definitions, metadata.yaml, DAG generationprivate-bigquery-etl - Confidential ETL codetelemetry-airflow - DAGs, operators, GKEPodOperatordataservices-infra - Infrastructure (GKE, Helm, logging config)Most DAGs are auto-generated from bigquery-etl. The task ID tells you where to find the source.
<dataset>__<table>__<version>Example task ID: telemetry_derived__clients_daily__v6
Source query location:
bigquery-etl/sql/moz-fx-data-shared-prod/<dataset>/<table>/
├── query.sql # The SQL query
├── metadata.yaml # Scheduling config, owner, tags
└── schema.yaml # Table schema
For the example above:
bigquery-etl/sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v6/
bqetl_<name>DAGs starting with bqetl_ are auto-generated. The DAG configuration is in bigquery-etl/dags.yaml.
DAGs not starting with bqetl_ are manually defined in:
telemetry-airflow/dags/<dag_name>.py
Some DAGs are in private-bigquery-etl with the same structure:
private-bigquery-etl/sql/<project>/<dataset>/<table>/
Airflow runs across two GCP projects:
| Project | Purpose | Namespace |
|---|---|---|
moz-fx-dataservices-high-prod | Airflow workers, scheduler | telemetry-airflow-prod |
moz-fx-data-airflow-gke-prod | GKEPodOperator jobs (queries, scripts) | default |
Start with GCS logs via fetch-task-log. Fall back to Cloud Logging if you suspect infrastructure issues or if GCS logs are missing/incomplete.
| Aspect | GCS (fetch-task-log) | Cloud Logging |
|---|---|---|
| Content | Complete Airflow task logs (same as UI) | Raw container stdout/stderr |
| Retention | 360 days | 30 days |
| Best for | Task failures (SQL errors, exceptions) | Pod-level issues (OOM kills, scheduling failures) |
Airflow scheduler/worker logs:
gcloud logging read 'resource.type="k8s_container" AND resource.labels.namespace_name="telemetry-airflow-prod" AND textPayload=~"<DAG_ID>"' \
--project=moz-fx-dataservices-high-prod \
--limit=200
GKEPodOperator job logs (query execution errors):
gcloud logging read 'resource.type="k8s_container" AND resource.labels.namespace_name="default" AND textPayload=~"<DAG_ID>"' \
--project=moz-fx-data-airflow-gke-prod \
--limit=200
https://workflow.telemetry.mozilla.org/home
https://workflow.telemetry.mozilla.org/dags/<dag_id>/gridhttps://workflow.telemetry.mozilla.org/dags/<dag_id>/grid?dag_run_id=<run_id>&task_id=<task_id>https://earthangel-b40313e5.influxcloud.net/d/airflow-overviewhttps://earthangel-b40313e5.influxcloud.net/d/airflow-task-duration<dag_id> or the pipeline area (e.g. "main_summary runbook"). If a Confluence MCP tool is available, use it to search directly. Otherwise, suggest the user check:
https://mozilla-hub.atlassian.net/wiki/search?text=<dag_id>https://mozilla-hub.atlassian.net/wiki/search?text=airflow+runbookBefore investigating, search Bugzilla for existing tickets related to the failing DAG. This avoids duplicate work and surfaces prior context.
Use scripts/search-bugzilla — it is pinned to bugzilla.mozilla.org and defaults to the [airflow-triage] whiteboard filter:
scripts/search-bugzilla <dag_id> # Open [airflow-triage] bugs matching DAG name
scripts/search-bugzilla <dag_id> --status resolved # Recently resolved matches
scripts/search-bugzilla <dag_id> --status all # Both
scripts/search-bugzilla "<task_id>" --limit 10 # Narrow by task name
scripts/search-bugzilla --all "<keywords>" # Drop the whiteboard filter
scripts/search-bugzilla <dag_id> --json # Structured output
To read full context (change history + comments) for a bug ID the script returned, call
mcp__plugin_mozdata_moz__get_bugzilla_bug with the bug ID. That goes through the
Mozilla-hosted MCP server — no direct network access needed from the skill.
If a match is found:
https://bugzilla.mozilla.org/show_bug.cgi?id=<bug_id>When investigating a failure, search for recent PRs that may have introduced the issue. Check these repositories:
| Repo | What it contains |
|---|---|
mozilla/bigquery-etl | Query definitions, metadata.yaml, DAG generation |
mozilla/private-bigquery-etl | Confidential ETL code |
mozilla/telemetry-airflow | DAGs, operators, GKEPodOperator |
mozilla/lookml-generator | LookML generation from bigquery-etl |
mozilla/probe-scraper | Probe/metric definitions scraping |
Use the gh CLI to find PRs merged around the time of the failure. Focus on
files related to the failing DAG/task:
# Search for recently merged PRs touching a path
gh pr list --repo mozilla/bigquery-etl --state merged --limit 10 \
--search "merged:>2026-04-10" --json number,title,mergedAt,url
# Search PRs mentioning a DAG or table name
gh search prs --repo mozilla/bigquery-etl --merged ">2026-04-10" "<dag_id or table_name>"
# View a specific PR's changed files
gh pr view --repo mozilla/bigquery-etl <PR_NUMBER> --json files,title,body
If a suspect PR is found, include it in the investigation report with a link and a note on which changed files are relevant.
If the user provides a DAG name, skip straight to step 2. Only discover failures when no DAG name is given.
../airflow-triage/scripts/get-triage-data to discover failures (skip if DAG name is already known)scripts/search-bugzilla. If a promising bug ID comes back, hydrate it with mcp__plugin_mozdata_moz__get_bugzilla_bug for history and comments. Link it and note prior context before continuing.scripts/fetch-task-log <dag_id> --list-runs to find recent runsscripts/fetch-task-log <dag_id> --list-tasks --run-id <run_id> to list tasks in the failing runscripts/fetch-task-log <dag_id> <task_id> <run_id> --tail 100 to get the errorgh CLI — see "GitHub PR Investigation" above)When reporting findings, use this structure:
Rate your confidence and flag what needs human investigation:
| Level | Meaning | Example |
|---|---|---|
| High confidence | Root cause is clear from logs and source code | SQL syntax error with exact line, permission denied on a specific table |
| Medium confidence | Likely cause identified but could not fully verify | Timeout that could be query performance or upstream delay |
| Low confidence | Symptoms observed but root cause is unclear | Intermittent failure with no clear error, infra-level issue |
Always be explicit: