Skill

airflow-debugging

Investigate Mozilla Airflow DAG failures. Use when user asks about: failed DAGs, Airflow task logs, DAG run errors, bqetl failures, telemetry-airflow issues, or data pipeline debugging.

Install

npx claudepluginhub akkomar/mozdata-claude-plugin --plugin mozdata

Tool Access

This skill is limited to using the following tools:

ReadBash(gcloud logging read:*)Bash(gcloud storage cat:*)Bash(gcloud storage ls:*)Bash(git log:*)Bash(git show:*)Bash(git diff:*)Bash(gh search prs:*)Bash(gh pr view:*)Bash(gh pr list:*)Bash(*/scripts/fetch-task-log:*)Bash(*/scripts/get-triage-data:*)Bash(*/scripts/search-bugzilla:*)mcp__plugin_mozdata_moz__get_bugzilla_bug

Preview

You help users investigate and debug Mozilla Airflow DAG failures by fetching logs, identifying root causes, and suggesting fixes.

Supporting Assets

scripts/fetch-task-logscripts/list-failed-dagsscripts/search-bugzilla

SKILL.md

Similar Skills

github-deep-research

2 files

Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.

bytedance-deer-flow-1

63.9k

surprise-me

Dynamically discovers and combines enabled skills into cohesive, unexpected delightful experiences like interactive HTML or themed artifacts. Activates on 'surprise me', inspiration, or boredom cues.

bytedance-deer-flow-1

63.9k

image-generation

2 files

Generates images from structured JSON prompts via Python script execution. Supports reference images and aspect ratios for characters, scenes, products, visuals.

bytedance-deer-flow-1

63.9k

Stats

Stars0

Forks2

Last CommitApr 24, 2026

Actions

View Source View Plugin View on GitHub View README

Airflow DAG Failure Investigation

You help users investigate and debug Mozilla Airflow DAG failures by fetching logs, identifying root causes, and suggesting fixes.

Helper Scripts

Discovering failures

Use get-triage-data from the airflow-triage skill to discover failures:

../airflow-triage/scripts/get-triage-data              # Last 24 hours
../airflow-triage/scripts/get-triage-data --since 3d   # Last 3 days

fetch-task-log

Fetch and explore task logs from GCS (gs://airflow-remote-logs-prod-prod).

# List recent runs for a DAG
scripts/fetch-task-log <dag_id> --list-runs

# List tasks in a specific run
scripts/fetch-task-log <dag_id> --list-tasks --run-id <run_id>

# Fetch a task log
scripts/fetch-task-log <dag_id> <task_id> <run_id>

# Fetch only the last N lines
scripts/fetch-task-log <dag_id> <task_id> <run_id> --tail 100

Related Repositories

When investigating failures, check these repos (all checked out locally):

bigquery-etl - Query definitions, metadata.yaml, DAG generation
private-bigquery-etl - Confidential ETL code
telemetry-airflow - DAGs, operators, GKEPodOperator
dataservices-infra - Infrastructure (GKE, Helm, logging config)

Where DAGs Are Defined

Most DAGs are auto-generated from bigquery-etl. The task ID tells you where to find the source.

Task ID Pattern: `<dataset><table><version>`

Example task ID: telemetry_derived__clients_daily__v6

Source query location:

bigquery-etl/sql/moz-fx-data-shared-prod/<dataset>/<table>/
├── query.sql          # The SQL query
├── metadata.yaml      # Scheduling config, owner, tags
└── schema.yaml        # Table schema

For the example above:

bigquery-etl/sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v6/

DAG ID Pattern: `bqetl_<name>`

DAGs starting with bqetl_ are auto-generated. The DAG configuration is in bigquery-etl/dags.yaml.

Non-bqetl DAGs

DAGs not starting with bqetl_ are manually defined in:

telemetry-airflow/dags/<dag_name>.py

Private/Confidential DAGs

Some DAGs are in private-bigquery-etl with the same structure:

private-bigquery-etl/sql/<project>/<dataset>/<table>/

GCP Projects & Namespaces

Airflow runs across two GCP projects:

Project	Purpose	Namespace
`moz-fx-dataservices-high-prod`	Airflow workers, scheduler	`telemetry-airflow-prod`
`moz-fx-data-airflow-gke-prod`	GKEPodOperator jobs (queries, scripts)	`default`

Cloud Logging (Fallback)

Start with GCS logs via fetch-task-log. Fall back to Cloud Logging if you suspect infrastructure issues or if GCS logs are missing/incomplete.

Aspect	GCS (`fetch-task-log`)	Cloud Logging
Content	Complete Airflow task logs (same as UI)	Raw container stdout/stderr
Retention	360 days	30 days
Best for	Task failures (SQL errors, exceptions)	Pod-level issues (OOM kills, scheduling failures)

Airflow scheduler/worker logs:

gcloud logging read 'resource.type="k8s_container" AND resource.labels.namespace_name="telemetry-airflow-prod" AND textPayload=~"<DAG_ID>"' \
  --project=moz-fx-dataservices-high-prod \
  --limit=200

GKEPodOperator job logs (query execution errors):

gcloud logging read 'resource.type="k8s_container" AND resource.labels.namespace_name="default" AND textPayload=~"<DAG_ID>"' \
  --project=moz-fx-data-airflow-gke-prod \
  --limit=200

Useful Links

Airflow UI: https://workflow.telemetry.mozilla.org/home
- DAG detail: https://workflow.telemetry.mozilla.org/dags/<dag_id>/grid
- Task logs: https://workflow.telemetry.mozilla.org/dags/<dag_id>/grid?dag_run_id=<run_id>&task_id=<task_id>
Grafana dashboards:
- Airflow overview: https://earthangel-b40313e5.influxcloud.net/d/airflow-overview
- Task duration: https://earthangel-b40313e5.influxcloud.net/d/airflow-task-duration
Bugzilla (airflow-triage bugs): buglist
Runbooks: Search Confluence for <dag_id> or the pipeline area (e.g. "main_summary runbook"). If a Confluence MCP tool is available, use it to search directly. Otherwise, suggest the user check:
- https://mozilla-hub.atlassian.net/wiki/search?text=<dag_id>
- https://mozilla-hub.atlassian.net/wiki/search?text=airflow+runbook

Bugzilla Search

Before investigating, search Bugzilla for existing tickets related to the failing DAG. This avoids duplicate work and surfaces prior context.

Use scripts/search-bugzilla — it is pinned to bugzilla.mozilla.org and defaults to the [airflow-triage] whiteboard filter:

scripts/search-bugzilla <dag_id>                        # Open [airflow-triage] bugs matching DAG name
scripts/search-bugzilla <dag_id> --status resolved      # Recently resolved matches
scripts/search-bugzilla <dag_id> --status all           # Both
scripts/search-bugzilla "<task_id>" --limit 10          # Narrow by task name
scripts/search-bugzilla --all "<keywords>"              # Drop the whiteboard filter
scripts/search-bugzilla <dag_id> --json                 # Structured output

To read full context (change history + comments) for a bug ID the script returned, call mcp__plugin_mozdata_moz__get_bugzilla_bug with the bug ID. That goes through the Mozilla-hosted MCP server — no direct network access needed from the skill.

If a match is found:

Link to the existing bug: https://bugzilla.mozilla.org/show_bug.cgi?id=<bug_id>
Note whether it's a known/ongoing issue
Include any context from the bug summary/comments in your analysis

GitHub PR Investigation

When investigating a failure, search for recent PRs that may have introduced the issue. Check these repositories:

Repo	What it contains
`mozilla/bigquery-etl`	Query definitions, metadata.yaml, DAG generation
`mozilla/private-bigquery-etl`	Confidential ETL code
`mozilla/telemetry-airflow`	DAGs, operators, GKEPodOperator
`mozilla/lookml-generator`	LookML generation from bigquery-etl
`mozilla/probe-scraper`	Probe/metric definitions scraping

How to search

Use the gh CLI to find PRs merged around the time of the failure. Focus on files related to the failing DAG/task:

# Search for recently merged PRs touching a path
gh pr list --repo mozilla/bigquery-etl --state merged --limit 10 \
  --search "merged:>2026-04-10" --json number,title,mergedAt,url

# Search PRs mentioning a DAG or table name
gh search prs --repo mozilla/bigquery-etl --merged ">2026-04-10" "<dag_id or table_name>"

# View a specific PR's changed files
gh pr view --repo mozilla/bigquery-etl <PR_NUMBER> --json files,title,body

What to look for

PRs merged shortly before the first failure timestamp
Changes to the failing task's query, metadata, or schema
Changes to shared dependencies (UDFs, views, upstream tables)
Changes to DAG definitions or operator configuration
Infrastructure changes (Helm, Docker, Airflow version bumps)

If a suspect PR is found, include it in the investigation report with a link and a note on which changed files are relevant.

Investigation Workflow

If the user provides a DAG name, skip straight to step 2. Only discover failures when no DAG name is given.

Run ../airflow-triage/scripts/get-triage-data to discover failures (skip if DAG name is already known)
Search Bugzilla for existing tickets matching the DAG/task name using scripts/search-bugzilla. If a promising bug ID comes back, hydrate it with mcp__plugin_mozdata_moz__get_bugzilla_bug for history and comments. Link it and note prior context before continuing.
Run scripts/fetch-task-log <dag_id> --list-runs to find recent runs
Run scripts/fetch-task-log <dag_id> --list-tasks --run-id <run_id> to list tasks in the failing run
Run scripts/fetch-task-log <dag_id> <task_id> <run_id> --tail 100 to get the error
Identify root cause from the logs
Look at the query/script in bigquery-etl or telemetry-airflow
Search GitHub for recently merged PRs that may have introduced the issue (use gh CLI — see "GitHub PR Investigation" above)
Suggest fix

Response Format

When reporting findings, use this structure:

1. Summary

DAG name and failure time
Key error message quoted from logs
Existing Bugzilla bug (if found), with link

2. Links

Airflow UI: link to the DAG grid view (fill in the dag_id in the URL template above)
Grafana: link to relevant dashboard if the failure relates to performance/duration
Bugzilla: link to existing bug if one was found, or note that none exists
Runbook: link to Confluence search for the DAG name, or note if a known runbook exists

3. Root Cause Analysis

Identify the root cause (SQL error, timeout, OOM, dependency failure, etc.)
Link to the relevant source file in bigquery-etl or telemetry-airflow

4. Confidence Assessment

Rate your confidence and flag what needs human investigation:

Level	Meaning	Example
High confidence	Root cause is clear from logs and source code	SQL syntax error with exact line, permission denied on a specific table
Medium confidence	Likely cause identified but could not fully verify	Timeout that could be query performance or upstream delay
Low confidence	Symptoms observed but root cause is unclear	Intermittent failure with no clear error, infra-level issue

Always be explicit:

"I'm high confidence this is caused by X because the log shows Y"
"I'm medium confidence — the error suggests X but I couldn't verify Z. An engineer should check: [specific thing to check]"
"I'm low confidence on root cause. The logs show X but this could be several things. An engineer should investigate: [list of specific areas]"

5. Suggested Fix or Next Steps

Suggest a concrete fix if confident
If not confident, list specific things an engineer should investigate (not generic advice — point to exact logs, tables, configs, or services to check)

airflow-debugging

Install

Tool Access

Preview

Supporting Assets

SKILL.md

Similar Skills

airflow-debugging

Install

Tool Access

Preview

Supporting Assets

SKILL.md

Airflow DAG Failure Investigation

Helper Scripts

Discovering failures

fetch-task-log

Related Repositories

Where DAGs Are Defined

Task ID Pattern: <dataset>__<table>__<version>

DAG ID Pattern: bqetl_<name>

Non-bqetl DAGs

Private/Confidential DAGs

GCP Projects & Namespaces

Cloud Logging (Fallback)

Useful Links

Bugzilla Search

GitHub PR Investigation

How to search

What to look for

Investigation Workflow

Response Format

1. Summary

2. Links

3. Root Cause Analysis

4. Confidence Assessment

5. Suggested Fix or Next Steps

Similar Skills

Airflow DAG Failure Investigation

Helper Scripts

Discovering failures

fetch-task-log

Related Repositories

Where DAGs Are Defined

Task ID Pattern: <dataset>__<table>__<version>

DAG ID Pattern: bqetl_<name>

Non-bqetl DAGs

Private/Confidential DAGs

GCP Projects & Namespaces

Cloud Logging (Fallback)

Useful Links

Bugzilla Search

GitHub PR Investigation

How to search

What to look for

Investigation Workflow

Response Format

1. Summary

2. Links

3. Root Cause Analysis

4. Confidence Assessment

5. Suggested Fix or Next Steps

Task ID Pattern: `<dataset><table><version>`

DAG ID Pattern: `bqetl_<name>`

Task ID Pattern: `<dataset><table><version>`

DAG ID Pattern: `bqetl_<name>`