From mozdata
Generate Mozilla Airflow triage summaries. Use when user asks to: triage Airflow failures, generate a daily Airflow status update, create a Slack triage message, check which DAGs are new/ongoing/resolved, or summarize Airflow incidents for a given time period.
npx claudepluginhub akkomar/mozdata-claude-plugin --plugin mozdataThis skill is limited to using the following tools:
You generate concise triage summaries of Mozilla Airflow failures, categorized as
Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.
Dynamically discovers and combines enabled skills into cohesive, unexpected delightful experiences like interactive HTML or themed artifacts. Activates on 'surprise me', inspiration, or boredom cues.
Generates images from structured JSON prompts via Python script execution. Supports reference images and aspect ratios for characters, scenes, products, visuals.
You generate concise triage summaries of Mozilla Airflow failures, categorized as ongoing, resolved, or new, in a Slack-ready format.
scripts/run-triage wraps the full pipeline. Use it for normal runs; call the
individual scripts below only when you need intermediate JSON for debugging.
# Default run (last 24h)
scripts/run-triage
# Custom window, verbose stage logs to /tmp/triage.log
scripts/run-triage --since 48h -v
# After weekends
scripts/run-triage --since 3d
# Historical triage
scripts/run-triage --as-of 2025-04-13
The four underlying stages (get-triage-data → get-bugs → auto-investigate → generate-slack-message) are still available individually for debugging:
scripts/get-triage-data | scripts/get-bugs # Just failures + bugs
scripts/get-triage-data | scripts/get-bugs | scripts/auto-investigate # + investigation results
Finds currently-failing and anomalously-slow tasks. Uses Cloud Logging for discovery (catches every failure attempt including retries), BQ for classification and exclusions, and GCS for verification.
Marking task as FAILED and
Marking task as UP_FOR_RETRY events within the --since window.
Unlike BQ (which only records final task state), this catches failures
that were retried successfully.Marking task as SUCCESS events
to filter out tasks that have recovered since failing.triage/no_triage and DAGs
that are paused or inactive.new or ongoing
based on whether failures existed before the --since window.--since
but have since recovered (combines BQ history with Cloud Logging success events).wait_for_ sensor tasks (which fail because
their upstream task failed) and annotates root-cause tasks with the list
of downstream DAGs they are blocking.metadata.yaml in
a local bigquery-etl checkout (falls back to DAG-level owners).Cloud Logging has 30-day retention. For --as-of queries older than 30 days,
the script automatically falls back to BQ-only discovery.
Excludes DAGs tagged triage/no_triage and DAGs that are paused or inactive.
scripts/get-triage-data # Last 24 hours (default)
scripts/get-triage-data --since 48h # Last 48 hours
scripts/get-triage-data --since 3d # Last 3 days
scripts/get-triage-data --as-of 2025-04-13 # Historical triage (skips GCS verification + slow detection)
scripts/get-triage-data --slow-threshold 5 # Flag tasks running 5x longer than avg
scripts/get-triage-data --no-slow # Skip slow-running task detection
scripts/get-triage-data --bqetl-repo /path # Custom bigquery-etl path (default: ~/bigquery-etl)
Output is a JSON array with entries like:
[
{
"dag_id": "bqetl_braze",
"task_id": "checks__fail_braze_derived__products__v1",
"last_failure": "2026-04-14T05:02:01Z",
"first_failure": "2026-04-10T05:00:00Z",
"failure_count": 3,
"owners": "user@mozilla.com",
"owner": "user@mozilla.com",
"category": "ongoing",
"issue_type": "failed"
},
{
"dag_id": "bqetl_main_summary",
"task_id": "telemetry_derived__clients_daily__v6",
"run_id": "scheduled__2026-04-14T02:00:00+00:00",
"start_date": "2026-04-14T02:05:00Z",
"elapsed_seconds": 14400,
"avg_duration_seconds": 3600.0,
"duration_ratio": 4.0,
"sample_count": 25,
"issue_type": "slow"
}
]
Fetches open and recently resolved [airflow-triage] Bugzilla bugs, then enriches
failures with matching bug links. Designed to be piped from get-triage-data.
scripts/get-triage-data | scripts/get-bugs # Pipe from get-triage-data
scripts/get-bugs --failures failures.json # Or read from file
scripts/get-bugs --failures failures.json --since 24h # Limit bug lookup window while matching failures
Output is a JSON object with three arrays:
{
"ongoing": [{ "dag_id": "...", "bug_id": 12345, "bug_url": "...", ... }],
"new": [{ "dag_id": "...", ... }],
"resolved": [{ "dag_id": "...", "bug_id": 67890, "bug_url": "...", ... }],
"slow": [{ "dag_id": "...", "duration_ratio": 4.0, ... }]
}
Auto-investigates new failures: fetches detailed GCS logs, maps task IDs to source files in bigquery-etl, searches GitHub for recent suspect PRs, and generates draft descriptions for the Slack message.
Ongoing failures with existing bugs are passed through with the bug summary as the description — no re-investigation needed.
scripts/get-triage-data | scripts/get-bugs | scripts/auto-investigate
scripts/auto-investigate --failures triage.json --bqetl-repo ~/bigquery-etl
Adds these fields to each item:
description — draft one-liner using the error snippet directly (no regex classification)error_lines — key lines from the log for bug filingsource_path — path to query file in bigquery-etl (if found)suspect_prs — recent merged PRs touching the source fileGenerates copy-pasteable Slack message blocks from investigated triage data. Produces a main message and one thread per DAG, with Markdown links to Airflow UI and Bugzilla.
# Always pass --out so the blocks land in a file with intact URLs.
scripts/auto-investigate --failures triage.json | scripts/generate-slack-message --out /tmp/airflow-triage.txt
scripts/generate-slack-message --failures investigated.json --out /tmp/airflow-triage.txt --date 2026-04-15
Output is plain text blocks separated by ---:
:airflow: Airflow triage YYYY-MM-DD)Format is composer-friendly, not mrkdwn-API. DAG names and task IDs are
shown as *bold* text; URLs sit on their own lines so Slack's composer
auto-links them on paste. The <url|label> mrkdwn syntax is intentionally
not used — it only works for messages posted via the Slack API; pasting it
into the composer produces broken URL-encoded links.
Why --out matters: Terminal UIs (including Claude Code) wrap long URLs
mid-string when displaying stdout. A broken URL doesn't auto-link in Slack
either. --out writes a verbatim copy to a file alongside stdout — always
copy-paste from the file, not from the terminal.
When filing a new bug for a failing task, construct the URL by filling in the placeholders in this template:
https://bugzilla.mozilla.org/enter_bug.cgi?assigned_to=nobody%40mozilla.org&bug_ignored=0&bug_severity=--&bug_status=NEW&bug_type=defect&cf_fx_iteration=---&cf_fx_points=---&comment=Airflow%20task%20<DAG_ID>.<TASK_ID>%20failed%20for%20exec_date%20<EXEC_DATE>%0A%0ATask%20link%3A%0A<TASK_LINK>%0A%0ALog%20extract%3A%0A%60%60%60%0A<ERROR_LOG>%0A%60%60%60&component=General&contenttypemethod=list&contenttypeselection=text%2Fplain&defined_groups=1&filed_via=standard_form&flag_type-4=X&flag_type-607=X&flag_type-800=X&flag_type-803=X&flag_type-936=X&form_name=enter_bug&maketemplate=Remember%20values%20as%20bookmarkable%20template&op_sys=Unspecified&priority=--&product=Data%20Platform%20and%20Tools&rep_platform=Unspecified&short_desc=Airflow%20task%20<DAG_ID>.<TASK_ID>%20failed%20for%20exec_date%20<EXEC_DATE>&status_whiteboard=%5Bairflow-triage%5D&target_milestone=---&version=unspecified
Placeholders to URL-encode and fill in:
<DAG_ID> — the dag_id<TASK_ID> — the task_id<EXEC_DATE> — execution date from the run_id<TASK_LINK> — Airflow UI link: https://workflow.telemetry.mozilla.org/dags/<dag_id>/grid?dag_run_id=<run_id>&task_id=<task_id><ERROR_LOG> — key error lines from the log (keep brief, ~5-10 lines)Run the wrapper (auth is checked automatically — if it fails with exit code 2,
tell the user to run ! gcloud auth login and retry):
scripts/run-triage --since <timeframe> -v
For historical triage:
scripts/run-triage --as-of <date> -v
This produces copy-pasteable Slack message blocks (main message + per-DAG threads)
with all failures categorized, investigated, and linked to Airflow UI and Bugzilla.
Stage stderr is captured in /tmp/triage.log.
Present each failure as a numbered block:
1.
Task: <dag_id>.<task_id>
URL: https://workflow.telemetry.mozilla.org/dags/<dag_id>/grid?dag_run_id=<url_encoded_run_id>&task_id=<task_id>&tab=logs
Error: "<error snippet>"
Category: new|ongoing
Owner: <owner emails>
Bug: <bugzilla_url or "none">
After presenting the table, STOP and ask the user two questions before proceeding:
Wait for the user to respond before continuing.
For each failure the user selects, use the airflow-debugging skill's scripts:
../airflow-debugging/scripts/fetch-task-log <dag_id> <task_id> <run_id> --tail 100../airflow-debugging/scripts/fetch-task-log <dag_id> --list-tasks --run-id <run_id>gh CLISee ../airflow-debugging/SKILL.md for the full investigation workflow,
including Cloud Logging fallback queries, GCP project/namespace details,
and the structured response format.
For each failure the user wants to file a bug for, construct the Bugzilla URL (see template above) with all placeholders filled in. Present the URL to the user. After they file it, ask for the bug ID.
Skip this phase if the user said no to bug filing, or if all failures already have bugs.
Do not show Slack blocks until Phases 3 and 4 are complete (or skipped by the user). This ensures all bug IDs and corrected descriptions are included.
The pipeline in Phase 1 already produces Slack output via scripts/generate-slack-message
and writes it to /tmp/airflow-triage-YYYYMMDD.txt by default (see --out).
If data changed during investigation or bug filing (new bug IDs, corrected descriptions),
re-run generate-slack-message --out <path> with updated JSON, or manually adjust the blocks
and rewrite the file.
Present the output to the user as copy-pasteable blocks (main message + one thread per DAG). Always end the response with the file path, e.g.:
Copy-paste from
/tmp/airflow-triage-YYYYMMDD.txt(useopenorcaton macOS) — do not copy from the displayed blocks above, terminal line-wrapping breaks the Slack<url|label>syntax for long URLs. :airflow: Airflow triage
<dag_id> (owner: @<slack_handle>)
: <task_id> task for <exec_date> . (<bugzilla_url>)
<dag_id> (owner: @<slack_handle>)
: <task_id> task for <exec_date> .
The output has blocks separated by ---. Present each block separately so the
user can post the main message first, then reply with each thread.
When there are no new issues at all, the main message can say:
:airflow: Airflow triage <DATE>
No new issues in Airflow so far today :party-chewbacca:
../airflow-debugging/SKILL.md — full investigation workflow with fetch-task-log, Cloud Logging queries, confidence assessments, and structured reporting