Search everything...

Skill

investigate-data-alerts

Batch-fix Dagster/dbt data alerts: triage, deduplicate by DAG lineage, investigate in parallel with team agents, verify locally, PR, and post Slack summary.

npx claudepluginhub artemis-xyz/theclauu --plugin theclauu

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Batch workflow for triaging, investigating, and fixing Dagster/dbt data alerts. Uses team agents for parallel investigation, DAG lineage for upstream dedup, and local Dagster for verification.

SKILL.md

Similar Skills

design-system

167.4k

Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.

team-skills-platform

ui-demo

167.4k

Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.

team-skills-platform

kotlin-patterns

167.4k

Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.

team-skills-platform

Stats

Parent Repo Stars0

Parent Repo Forks0

Last CommitApr 23, 2026

Actions

View Source View Plugin View on GitHub View README

investigate-data-alerts | theclauu | ClaudePluginHub

Back to Skills

Skill

investigate-data-alerts

From theclauu

Batch-fix Dagster/dbt data alerts: triage, deduplicate by DAG lineage, investigate in parallel with team agents, verify locally, PR, and post Slack summary.

npx claudepluginhub artemis-xyz/theclauu --plugin theclauu

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Batch workflow for triaging, investigating, and fixing Dagster/dbt data alerts. Uses team agents for parallel investigation, DAG lineage for upstream dedup, and local Dagster for verification.

SKILL.md

Fix Data Alerts

Batch workflow for triaging, investigating, and fixing Dagster/dbt data alerts. Uses team agents for parallel investigation, DAG lineage for upstream dedup, and local Dagster for verification.

Persona: Senior data engineer running the on-call data alerts rotation. Methodical, evidence-driven, zero wasted work.

Job Ignore List

These jobs/alerts are out of scope for this skill. Silently skip them during intake — do not investigate, do not count them in triage totals, do not mention them unless the user asks.

Pattern	Reason
`nextgen`	Nextgen pipeline — separate workflow
`modal`	Modal pipeline failures — handled by `/investigate-app`
`duckdb`	DuckDB jobs — different infra
`postgres`	Postgres jobs — different infra
`snowflake_query`, `snowflake_task`	Snowflake-native tasks — not Dagster/dbt
`adhoc_`, `_adhoc_`, `_adhoc`	Ad-hoc jobs — not production

Matching rules:

Case-insensitive
Match against both the job name and the alert title/message
* is a glob wildcard (matches any characters)
If an alert mentions multiple jobs and ALL are ignored, skip the entire alert
If an alert mentions a mix of ignored and non-ignored jobs, keep it but only investigate the non-ignored jobs

To update this list, edit this section directly. No code changes needed — the skill reads this table at runtime.

Procedure

Follow these steps exactly in order.

Step 0: Date and Environment Setup

Capture today's date — used for branch names and Slack summary:

DATE=$(date +%Y-%m-%d)
echo "Session date: $DATE"

Hardcode the absolute paths for the session (shell vars do NOT persist between Bash calls):

WORKSPACE=~/DuneAnalytics/dex_trades
DBT_REPO=$WORKSPACE/dbt
GOKU_REPO=$WORKSPACE/gokustats-back-end

Step 0.5: Authenticate Dagster Cloud + Slack

Two credentials are required for auto-scan:

Dagster Cloud API token (primary — used to pull real run logs):

source $GOKU_REPO/.activate && test -n "$DAGSTER_CLOUD_API_TOKEN" && echo "OK" || echo "MISSING"

If missing, ask the user to generate a user token at https://artemis.dagster.cloud/prod/user-settings and persist it:

echo 'export DAGSTER_CLOUD_API_TOKEN="user:PASTE_HERE"' >> $GOKU_REPO/.env.local

(They can also !export DAGSTER_CLOUD_API_TOKEN="user:..." in the chat, but that doesn't propagate to Bash subprocesses — .env.local + source .activate is the reliable path.)

Slack token (for alert discovery):

test -f ~/.config/slack/token && echo "OK" || echo "MISSING"

Needs scopes channels:history, channels:read. If missing, ask the user to install one at ~/.config/slack/token.

DO NOT rely on scripts/download_alert_logs.py or stdout .txt attachments in Slack threads. The Dagster bot no longer attaches stdout files; threads now contain only Milo investigation replies. The script will report "0 files downloaded" and be useless. See "Learnings" section for history.

Step 1: Intake — Collect Alerts

Accept alerts in one of three ways (check in order):

A. Auto-Scan #data-alerts via Slack + Dagster GraphQL (most automated — use when user says "fix data alerts" with no input):

The flow is: Slack → failure thread list → run_ids → Dagster GraphQL → real error chain + stdout.

1. Enumerate failure threads from Slack (last 24h, channel C05S8H76M08)

Write a short Python helper that:

Calls conversations_history(channel="C05S8H76M08", oldest=<24h_ago>)
Filters messages containing "failed", "Error:", :red_circle:
For each, fetches conversations_replies(ts=thread_ts) to capture Milo's replies
Extracts run_id (UUID pattern) and job_name (regex "([a-z][a-z0-9_]*_job)") from parent text
Applies the job ignore list (nextgen, modal, duckdb, postgres, snowflake_query, snowflake_task, adhoc_)

Save parsed threads to ~/data-alerts-logs/<DATE>/threads.json for reuse.

2. Pull real run details from Dagster Cloud GraphQL (primary source of truth)

For each run_id, query https://artemis.dagster.cloud/prod/graphql with header Dagster-Cloud-Api-Token: $DAGSTER_CLOUD_API_TOKEN:

query ($runId: ID!, $cursor: String) {
  runOrError(runId: $runId) {
    __typename
    ... on Run {
      runId jobName status startTime endTime
      stepStats { ... on RunStepStats { stepKey status } }
      eventConnection(limit: 1000, afterCursor: $cursor) {
        cursor hasMore
        events {
          __typename
          ... on ExecutionStepFailureEvent {
            stepKey
            error {
              message className
              errorChain { error { message className } isExplicitLink }
            }
          }
        }
      }
    }
    ... on RunNotFoundError { message }
  }
}

Critical schema facts (schema drift has caused 400s in prior runs):

Run field is eventConnection (NOT logsForRun / pipelineRunLogs)
eventConnection.limit max is 1000; paginate via afterCursor if hasMore is true
message/level on events require inline fragment on MessageEvent (interface)
errorChain on PythonError is essential for wrapped errors — RetryRequestedFromPolicy and DagsterExecutionStepExecutionError wrap the real cause (e.g. Forbidden: 403 Access Denied, ErrorDuneQueryException, AssertionError: No records found for mony...). Without errorChain you only see the wrapper.

3. Optionally fetch stdout tails for each failed step via capturedLogsMetadata:

query ($logKey: [String!]!) {
  capturedLogsMetadata(logKey: $logKey) {
    stdoutDownloadUrl stderrDownloadUrl
  }
}

logKey is [run_id, "compute", step_key]. Then GET the stdoutDownloadUrl (same Dagster-Cloud-Api-Token header) — URL is a presigned S3 link. Keep only the tail (last ~6-8KB) — errors cluster at the end.

Save results to ~/data-alerts-logs/<DATE>/dagster_direct/<NN>_<job>_<run_id[:8]>.txt for reuse.

4. (Optional) Huntress MCP as secondary lookup — works but is behind a 6h sync cron (cron: "15 1,7,13,19 * * *" UTC). Newest runs are typically "Run not found — may not have been synced yet." Only useful if Dagster GraphQL is down. Tool: mcp__huntress__get_run_details(run_id, include_stdout=true). MCP key rotates — if keys are stale, regenerate at https://huntress.vercel.app/setup.

B. Slack Thread URL (user provides a specific thread): Extract channel + message_ts from URL, then follow path A from step 2 onward.

C. Pasted Dagster Output: If user pastes raw stdout/stderr, parse it directly.

D. Manual Description: If user describes failures in plain text, extract job names + run_ids.

Apply Ignore List

After collecting all alerts, immediately filter out ignored jobs using the patterns in the "Job Ignore List" section above. For each alert:

Extract the job name from the alert
Match against every pattern in the ignore list (case-insensitive glob matching)
If the job matches ANY ignore pattern, silently drop it
If an alert thread contains multiple jobs, keep only the non-ignored ones

Report how many alerts were filtered:

Collected N alerts from #data-alerts
Filtered out M ignored jobs (nextgen: X, modal: Y, adhoc: Z, ...)
Proceeding with K alerts for investigation

Extract Fields

For each non-ignored alert, extract:

Field	Source
Job name	`dagster job execute -j <name>` or Slack alert title
Failed model(s)	dbt error output: `Model <name>` or `Error in model <name>`
Error type	compilation, runtime, test failure, freshness, timeout
Error message	The specific error text
Timestamp	When the failure occurred
Slack thread URL	For linking back in the fix summary

Store parsed alerts in a structured list for triage.

Step 2: Triage — Categorize and Deduplicate

2a. Auto-Categorize

Tag each alert:

Category	Signal	Investigation Path
`compilation`	"Compilation Error", syntax errors	Check model SQL + macros
`runtime`	"Runtime Error", "Database Error"	Check data types, upstream tables
`test-failure`	"Failure in test"	Check test config + data
`freshness`	"Freshness check", "stale"	Check upstream extract jobs
`timeout`	"max_runtime_seconds", "timed out"	Check query plan, data volume
`crash`	Python traceback, OOM	Check Dagster job config

2b. DAG Lineage Dedup

This is critical to avoid duplicative investigation work. Before spawning investigators, determine which failures are cascading from a common upstream root cause.

For each failed model, trace upstream dependencies:

cd $DBT_REPO && source $WORKSPACE/dbt/venv/bin/activate && set -a && source $WORKSPACE/dbt/.env.local && set +a && dbt ls --resource-type model --select +<failed_model> --output json 2>/dev/null | jq -r '.unique_id' 2>/dev/null

Build a dependency graph across all failed models
Dedup rule: If model A depends on model B (directly or transitively) and BOTH failed, group them — only investigate model B (the upstream root). Model A's failure is likely a cascade.
Error similarity: If two models in different DAG branches have identical error messages (e.g., same macro bug), group them under one root cause.

Optional — Huntress MCP for lineage visualization: If configured, use the Huntress MCP at https://huntress.vercel.app/setup to query the full DAG lineage. This provides richer dependency data than dbt ls alone. If not configured, fall back to dbt ls (works fine for most cases).

2b.5. Check Milo's In-Flight PRs — Salvage Work

Milo (milo-bot) auto-files fix PRs for many alerts — always check before investigating from scratch.

For each thread, scan replies for Milo-filed PR URLs (github.com/Artemis-xyz/<repo>/pull/<num>). For each found:

gh pr view --repo Artemis-xyz/<repo> <num> --json state,mergedAt,title,reviewDecision,closedAt
# If CLOSED (not merged), inspect closing review:
gh pr view --repo Artemis-xyz/<repo> <num> --json comments --jq '[.comments[] | select(.author.login=="claude") ][-1].body'

Classify each Milo PR:

OPEN + all checks pass + MERGEABLE → candidate to merge. Still review the diff yourself before landing.
CLOSED (rejected) → read the reviewer's comment. Common rejection patterns (seen in practice):
- Masking the alert (widening test interval + downgrading severity) — "test will never block again"
- Non-deterministic dedup (QUALIFY ORDER BY using a column that ties across duplicates)
- Wrong target (addresses a symptom that isn't the actual failing step)
- When reimplementing, write the PR description explicitly to call out: "This doesn't repeat #NNNN's mistake because..."
MERGED → alert should be resolved already; sanity check if it's still firing.

Also check for in-flight PRs on coordinated branches (not from Milo):

# Is there an existing branch/PR that already fixes one of our groups?
cd $GOKU_REPO && git branch -a | grep -iE "stablecoin|usdt0|<other_keyword>"
cd $GOKU_REPO && gh pr list --state open --search "in:title <keyword>" --json number,title,headRefName

If found, do NOT create a parallel PR. Options:

Wait for it to merge, then investigate remaining alerts
Rebase the existing branch onto fresh origin/main, cherry-pick any additional fixes onto it
Coordinate with the PR author

Seen-in-practice example: PR #5397 for stablecoin sensor gating was open while a new batch of alerts for the same root cause came in — the fix was to extend #5397, not create #5397-v2.

2c. Git History — Recent Breakage Check

Before investigating, check if recent commits or PRs introduced the breakage:

# Check dbt repo for recent changes to affected models
cd $DBT_REPO && git log --oneline --since="3 days ago" -- models/**/<model_name>*.sql macros/**/*.sql

# Check gokustats for recent job/sensor changes
cd $GOKU_REPO && git log --oneline --since="3 days ago" -- artemis_dagster/jobs/ artemis_dagster/sensors/

# Check for recently merged PRs that touched affected areas
cd $DBT_REPO && gh pr list --state merged --limit 10 --json title,number,mergedAt,files
cd $GOKU_REPO && gh pr list --state merged --limit 10 --json title,number,mergedAt,files

If a recent PR clearly introduced the issue (e.g., PR merged 2 hours ago touching the exact failing model), flag it prominently in the triage output.

2d. Present Triage Results

Display the triage to the user before proceeding:

Data Alert Triage — <DATE>
═══════════════════════════════════════════════════════════════════
  Total alerts:           N
  Deduplicated groups:    M (after DAG lineage dedup)
  Categories:             X compilation, Y runtime, Z freshness

  Root Cause Groups:
  ─────────────────────────────────────────────────────────────────
  Group 1: <upstream_model> (compilation)
    Affects: model_a, model_b, model_c (cascade)
    Jobs:    daily_chain_job, daily_stablecoin_job
    Repos:   dbt
    Recent PR: #3301 merged 2h ago — LIKELY CAUSE

  Group 2: <model_x> (runtime)
    Affects: model_x only
    Jobs:    daily_defi_job
    Repos:   dbt, gokustats
    Recent PR: none found
  ─────────────────────────────────────────────────────────────────

  Proceed with investigation? (y/n, or adjust groups)
═══════════════════════════════════════════════════════════════════

Do NOT proceed until the user confirms the triage.

Step 3: Worktree Setup

Create worktrees in BOTH repos with the same branch name:

DATE=$(date +%Y-%m-%d)
BRANCH="<engineer>/data-alerts-$DATE"

# dbt worktree
cd $DBT_REPO && git worktree add -b "$BRANCH" "$WORKSPACE/dbt-worktrees/data-alerts-$DATE" origin/main

# gokustats worktree
cd $GOKU_REPO && git worktree add -b "$BRANCH" "$WORKSPACE/worktrees/data-alerts-$DATE/gokustats-back-end" origin/main

Capture absolute worktree paths for agent prompts:

DBT_WT=$WORKSPACE/dbt-worktrees/data-alerts-$DATE
GOKU_WT=$WORKSPACE/worktrees/data-alerts-$DATE/gokustats-back-end

Step 4: Investigate — Team Agents

Create a team for this alert batch:

TeamCreate: team_name="data-alerts-<DATE>", description="Fix data alerts batch <DATE>"

4a. Create Tasks

Create one task per deduplicated root-cause group from Step 2:

TaskCreate for each group:
  title: "Investigate: <root_cause_model> (<category>)"
  description: Full context including error logs, affected models, DAG lineage, git history findings

4b. Spawn Investigators

Spawn one general-purpose teammate per root-cause group. Each investigator gets the full investigate-dagster-error workflow inlined in their prompt.

Agent prompt template for each investigator:

You are investigating a Dagster/dbt data alert. Your name is "investigator-<N>".

TEAM: data-alerts-<DATE>
DBT WORKTREE: <DBT_WT> (absolute path — use this for ALL dbt file operations)
GOKU WORKTREE: <GOKU_WT> (absolute path — use this for ALL gokustats file operations)
MAIN DBT REPO: <DBT_REPO> (for git history only)
MAIN GOKU REPO: <GOKU_REPO> (for git history only)

## CRITICAL PATH RULES

EVERY Bash command targeting dbt: cd <DBT_WT> && <command>
EVERY Bash command targeting gokustats: cd <GOKU_WT> && <command>
EVERY Read/Edit/Write for dbt: use absolute paths under <DBT_WT>/
EVERY Read/Edit/Write for gokustats: use absolute paths under <GOKU_WT>/
Git history commands: use the MAIN repos (not worktrees)

## Environment Bootstrap

For dbt commands in the worktree:
source <DBT_REPO>/venv/bin/activate && set -a && source <DBT_REPO>/.env.local && set +a

For Snowflake queries:
snowsql -c artemis -q "<SQL>"
Always use PC_DBT_DB.PROD.<TABLE> FQN. Always include LIMIT.

## Alert Details

<paste the specific alert(s) for this root-cause group>

## Investigation Procedure

Follow these phases in order:

### Phase 1: Parse Error and Identify Affected Models
- Extract model names, error type, error message from the alert
- Locate model files in <DBT_WT>/models/
- Check if macros are involved (<DBT_WT>/macros/)

### Phase 2: Dependency Analysis
- Trace upstream: find all {{ ref('...') }} and {{ source('...') }} in the model
- Trace downstream: search for models that ref the affected model
- Check Dagster job config in <GOKU_WT>/artemis_dagster/jobs/

### Phase 3: Diagnostic SQL
- Table metadata: INFORMATION_SCHEMA.TABLES for affected tables
- Column types: INFORMATION_SCHEMA.COLUMNS if type mismatch suspected
- Reproduce the error with a targeted query (always LIMIT 100)
- Check Snowflake query history for recent errors on affected warehouses:
  ```sql
  SELECT query_text, error_code, error_message, warehouse_name, start_time
  FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
  WHERE error_code != '000000' AND warehouse_name = '<WH>'
    AND start_time >= DATEADD(day, -2, CURRENT_TIMESTAMP())
  ORDER BY start_time DESC LIMIT 20;

Phase 4: Git History

Recent changes to the model: git log in MAIN DBT REPO
Recent changes to related macros
Recent Dagster job changes: git log in MAIN GOKU REPO
Identify the commit/PR that introduced the bug

Phase 5: Root Cause Summary

Produce:

What failed (model + error type)
Why it failed (technical root cause)
When the bug was introduced (commit, PR, date)
The fix (specific code change needed)

Phase 6: Apply Fix

Edit the affected files IN THE WORKTREE (not main repo)
Each fix = one commit with a descriptive message following conventional commits: fix():
Do NOT push or create PRs — the orchestrator handles that
After committing, report back via SendMessage:
- What you fixed
- Files changed
- Commit hash
- Confidence level (High/Medium/Low)

Coordination

Check TaskList after completing your investigation
If you discover your root cause is the same as another group's, notify the team lead
Mark your task as completed via TaskUpdate when done


Launch all investigators in parallel:

Agent(subagent_type="general-purpose", team_name="data-alerts-", name="investigator-1", run_in_background=true, prompt=) Agent(subagent_type="general-purpose", team_name="data-alerts-", name="investigator-2", run_in_background=true, prompt=) ...


#### 4c. Monitor and Coordinate

- Wait for team agents to complete via automatic message delivery
- If two agents discover the same root cause, consolidate their fixes
- If an agent gets stuck, provide guidance via SendMessage
- Collect results: list of commits per worktree, files changed, confidence levels

---

### Step 5: Verify — Local Testing

After all fixes are committed in the worktrees, verify them.

#### 5a. dbt Compile Check

```bash
cd $DBT_WT && source $DBT_REPO/venv/bin/activate && set -a && source $DBT_REPO/.env.local && set +a && dbt compile --select <affected_models_space_separated>

5b. Run Affected Jobs Locally (dagjob)

For each affected Dagster job, run it using the dagjob alias:

cd $GOKU_WT && source $GOKU_REPO/.activate && dagjob <job_name>

If dagjob isn't available (worktree context), use the raw command:

cd $GOKU_WT && source $GOKU_REPO/.activate && dagster job execute -f artemis_dagster/primary_definitions.py -j <job_name>

5c. Spin Up Local Dagster for UI Verification

IMPORTANT: Never run make dagster — it loads ALL jobs, which is slow and unnecessary.

CRITICAL: Must unset AWS_ACCESS_KEY_ID before starting Dagster locally (conflicts with Snowflake auth).

Source env from the MAIN gokustats repo (worktrees lack venv/env), then launch with DAGSTER_ONLY_JOBS:

cd $GOKU_REPO && source .activate && unset AWS_ACCESS_KEY_ID && DAGSTER_ONLY_JOBS=<comma_separated_job_names> dagster dev -f artemis_dagster/primary_definitions.py

Or use the convenience script:

cd $GOKU_REPO && source .activate && unset AWS_ACCESS_KEY_ID && python scripts/dagster_local.py <job_name_1> <job_name_2> ...

Example:

cd $GOKU_REPO && source .activate && unset AWS_ACCESS_KEY_ID && DAGSTER_ONLY_JOBS=daily_bsc_job,daily_graph_job dagster dev -f artemis_dagster/primary_definitions.py

Note: Even with job filtering, asset loading takes ~5 minutes (all assets must load for job selection to resolve). This is normal.

Then tell the user:

Local Dagster is running with only the affected jobs. Please:
1. Open http://localhost:3000 in your browser
2. Navigate to the affected job(s): <list job names>
3. Trigger a run and verify success
4. Screenshot the successful run(s)
5. Tell me when you're done — I'll proceed to PR creation

Do NOT proceed until the user confirms verification is complete.

5d. Snowflake Spot-Check

After successful job runs, verify data landed correctly:

snowsql -c artemis -q "
SELECT table_name, row_count, last_altered
FROM PC_DBT_DB.INFORMATION_SCHEMA.TABLES
WHERE table_schema = 'PROD'
  AND table_name IN (<affected_table_names_upper>)
ORDER BY last_altered DESC;
"

For freshness alerts, also check the most recent records:

snowsql -c artemis -q "
SELECT MAX(block_timestamp) as latest_record, COUNT(*) as row_count
FROM PC_DBT_DB.PROD.<TABLE>
WHERE block_timestamp >= DATEADD(day, -2, CURRENT_TIMESTAMP())
LIMIT 1;
"

Present results to the user.

Step 6: User Gate — Approval

Present a summary of all fixes before PR creation:

Fix Summary — data-alerts-<DATE>
═══════════════════════════════════════════════════════════════════
  Root Cause Groups Fixed:  M / N

  dbt/ commits:
    abc1234  fix(stablecoin): correct column reference in agg model
    def5678  fix(ethereum): update macro for new schema

  gokustats-back-end/ commits:
    (none — all fixes were in dbt)

  Verification:
    dbt compile:     PASS
    dagjob runs:     3/3 PASS
    Snowflake check: Data landed, freshness OK

  Ready to create PRs? (y/n)
═══════════════════════════════════════════════════════════════════

Do NOT proceed until the user approves.

Step 7: Create PRs

One PR per repo. Each root-cause fix is already a separate commit.

7a. Push and PR for dbt

cd $DBT_WT && git push -u origin <engineer>/data-alerts-$DATE

cd $DBT_WT && gh pr create --title "fix: data alerts batch $DATE (dbt)" --body "$(cat <<'EOF'
## Summary
Batch fix for data alerts on <DATE>.

## Root Causes Fixed
<for each root-cause group:>
- **<model>** (<category>): <1-line description of fix>
  - Introduced by: PR #XXX (if known)
  - Commits: `abc1234`

## Files Changed
<list files>

## Testing
- [x] `dbt compile` — all affected models compile
- [x] Local Dagster job runs — all passed
- [x] Snowflake spot-check — data landed, freshness OK

## Affected Jobs
<list dagster job names>

---
Generated with Claude Code
EOF
)"

7b. Push and PR for gokustats (if changes exist)

Same pattern, only if there are commits in the gokustats worktree.

7c. Return PR URLs

PRs created:
  dbt:             <URL>
  gokustats:       <URL> (or "no changes needed")

Step 8: Slack Summary

Post a fix summary to #data-alerts using the Slack MCP "send message" tool.

Message format:

:white_check_mark: *Data Alerts Fixed — <DATE>*

*Alerts resolved:* N
*Root causes:* M

<for each root-cause group:>
:point_right: *<model_name>* (<category>)
> <1-line description of what broke and why>
> Fix: <1-line description of the fix>
> PR: <dbt PR URL> / <gokustats PR URL>

*Introduced by:* PR #XXX (if known)
*Verified:* Local Dagster runs + Snowflake spot-check

Send to channel #data-alerts (channel ID: C05S8H76M08).

If the original alerts were from a specific Slack thread, reply in that thread instead of posting a new top-level message. Use the Slack MCP "reply to thread" tool with the original thread timestamp.

Show the user the message before sending. Ask: "Post this to #data-alerts? (y/n, or edit)"

Step 9: Cleanup

After PRs are merged (not before):

# Remove worktrees
cd $DBT_REPO && git worktree remove $DBT_WT
cd $GOKU_REPO && git worktree remove $GOKU_WT
git worktree prune

# Delete local branches (remote branches auto-delete on PR merge)
cd $DBT_REPO && git branch -D <engineer>/data-alerts-$DATE
cd $GOKU_REPO && git branch -D <engineer>/data-alerts-$DATE

Do NOT run cleanup automatically. Tell the user: "Run /cleanup when the PRs are merged to remove the worktrees."

Step 10: Update the Learnings Log (REQUIRED)

Before closing the run, append at least one entry to the Session Log under ## Learnings Log — UPDATE THIS EVERY RUN at the bottom of this skill file (~/.claude/skills/investigate-data-alerts/SKILL.md).

Even if the run was uneventful, write one of:

A concrete learning (tooling quirk, API drift, new Milo rejection pattern, useful investigator prompt improvement)
An explicit "no new learnings — playbook worked as written" entry (so future agents know this ran without drift)

If the learning should ALWAYS apply from now on, promote it into the relevant Step above per the "How to promote an entry" procedure — and note the promotion in the log entry.

This step is not optional. Skills that are never updated rot against the underlying systems they describe.

Critical Rules for Dev Testing

NEVER run entire large jobs locally

For jobs with many models (e.g., daily_ez_metrics_job with 3000+ models):

Do NOT run dbt build -s tag:ez_metrics or similar broad selections locally
Instead, grep the stdout file for ERROR|FAIL lines to identify the specific failing models
Then investigate and fix only those specific models

Clone before running large incremental models in dev

Large incremental models (e.g., fact_bsc_transactions_v2 on BAM_TRANSACTION_XLG) will fail or take forever in dev without existing data. Always clone first:

# Option 1: Use the clone script
cd $DBT_REPO && python dbt_scripts/clone_object_into_dev_schema.py <model_name>

# Option 2: Manual clone via SQL (for tables in non-default databases)
snowsql -c artemis -q "CREATE TRANSIENT TABLE PC_DBT_DB.DEV_$USERNAME.<TABLE> CLONE PC_DBT_DB.PROD.<TABLE>;"

Then run the incremental model in dev — it will process only the delta.

Use ANALYTICS_XL warehouse for large dev runs

For models that normally run on large warehouses (XLG, XXL), override the warehouse in dev to avoid timeouts:

dbt run -s <model> --vars '{"snowflake_warehouse_override": "ANALYTICS_XL"}'

Or if the model doesn't support the override var, temporarily edit the config in the worktree.

Notes

User gates at Steps 2, 5c, and 6. Never auto-proceed past triage, verification, or PR approval.
One PR per repo, one commit per root cause. Clean git history for bisect/revert.
Worktrees in both repos always. Even if we think only dbt is affected — Dagster job config changes sometimes needed too.
Source main repo envs for worktrees. Worktrees lack .activate, venv, and .env. Always source from the main repo.
Slack stdout files are the primary error source. The alert message only says "Steps failed: ['open_source_snowflake_dbt_assets']" — the actual failing models and error messages are in the stdout .txt attachment. Always read these first.
Never display secrets. Mask any credentials that appear in logs.
Team agents are general-purpose. They need Bash, Read, Edit, Write for investigation and fixing. Never use Explore or Plan subagent types for fix work.
investigate-dagster-error phases are inlined in the agent prompt (Step 4b) rather than referenced as a skill, since team agents cannot invoke skills.

Investigation Principles

These principles override default behavior. Violations of these have caused bad fixes in the past.

Always fix at the source, never mask downstream

When a test fails (not_null, recency, etc.), trace the bad data upstream to its origin. Do NOT add WHERE x IS NOT NULL filters in downstream models to suppress the symptom. The user will reject fixes that mask the root cause. Ask: "Why does this bad data exist in the first place?" and fix that.

Always verify Dagster job existence before blaming test config

When recency tests fail, check whether a Dagster job actually exists for the failing model's asset group. Common pattern: dbt models exist and have tags, but no define_asset_job selects their group — the fact tables never get materialized. The fix is creating/fixing the job, not adjusting test thresholds or adding ignore_time_component.

Always check upstream job timing

When creating or modifying Dagster job schedules, trace all upstream {{ ref() }} and {{ source() }} dependencies and verify their jobs complete before the new schedule fires. Steps:

Read model SQL → extract refs
Determine which Dagster group each ref belongs to (via CustomDagsterDbtTranslator.get_group_name())
Find that group's job definition and cron schedule
Ensure upstream completes before downstream starts

Always check for orphaned Dagster groups

The CustomDagsterDbtTranslator.get_group_name() assigns groups via tag first, then falls back to directory path. A model at models/staging/foo/bar.sql gets group foo automatically. If no daily_foo_job = define_asset_job(... groups("foo")) exists, those assets are orphaned — they have a group but no job runs them. When investigating stale data, always check this.

Present triage with root cause hypothesis before fixing

Never jump to applying fixes. Present the triage with your root cause hypothesis and let the user validate. Common failure mode: applying a surface-level fix (test config change, WHERE filter) when the actual issue is infrastructure (missing job, wrong schedule, dead source).

Use `errorChain`, not just `error.message`

Dagster wraps errors: RetryRequestedFromPolicy wraps the real cause; DagsterExecutionStepExecutionError wraps an op's Python exception. The top-level error.message is the wrapper ("Exceeded max_retries of 1"). The real cause (e.g. Forbidden: 403 Access Denied, ErrorDuneQueryException: Dune Query NNNN failed, AssertionError: No records found for mony...) is in error.errorChain[0].error.message. Always request errorChain on the GraphQL query.

Sensor-based gating beats cron-based coordination

When an alert's root cause is "downstream cron fired before upstream finished landing," the fix is a completion sensor (make_dbt_models_updated_sensor or make_job_completion_sensor), not a later cron or a test interval widen. Test widens mask the problem; cron tuning is brittle. See the d72b947cb / PR #5397 pattern.

Worktree Environment Gotchas

These have broken the skill's pre-commit / push steps in the past.

Worktrees lack `.env` files and venv

Worktrees under $WORKSPACE/worktrees/ or $WORKSPACE/dbt-worktrees/ don't have .env.local, .env.shared, .env.1pass, or venv-artemis. Before any git operation that triggers pre-commit hooks (which import Django), source env + venv from the MAIN repo:

MAIN=$GOKU_REPO  # or $DBT_REPO
cd $MAIN && set -a && source .env.shared && source .env.local 2>/dev/null; set +a
source $MAIN/venv-artemis/bin/activate
cd $WORKTREE && git commit ...   # hooks now have SECRET_KEY + python

Without this, the django-migration-lint pre-commit hook fails with Executable 'python' not found or ImproperlyConfigured: SECRET_KEY must not be empty.

Cherry-pick cascade when target branch is stale

When rebasing fixes onto an existing PR branch, first git merge origin/main into that branch. Example pitfall: Branch X was cut before PR Y merged; PR Y added the exact lines your cherry-pick touches. Without merging main first, your cherry-pick applies against pre-Y state and becomes a no-op diff (or a confusing merge-commit resolution at the final merge).

Correct order:

cd <target_worktree>
git reset --hard origin/<target_branch>   # drop stale local commits first
git merge origin/main --no-edit           # bring in everything since branch cut
git cherry-pick <fix_commit>              # now applies cleanly
# resolve conflicts if any, then push

Learnings Log — UPDATE THIS EVERY RUN

This section is a living log. Every invocation of /investigate-data-alerts MUST append at least one entry to "Session Log" before closing, even if the run was uneventful. Before you end the run, ask yourself:

Did I discover something about tooling that isn't documented above? → new entry under Tooling
Did I adjust the investigation playbook mid-flight? → new entry under Playbook adjustments
Did a Milo PR teach me a new rejection pattern? → new entry under Milo PR rejection patterns
Did the Dagster/dbt/Huntress API behave differently than expected? → new entry under API drift

If an entry here becomes permanent behavior (i.e., it should always happen), promote it into the relevant Step above and delete the log entry. The log is ephemeral knowledge; the Steps are canonical.

Session log

2026-04-21 — Dagster Slack bot no longer attaches dagster_stdout_*.txt to alert threads. The scripts/download_alert_logs.py helper (on stash in fix/huntress-internal-token-auth branch) returns 0 files. Threads now contain only Milo auto-investigation replies. Bypass: use Dagster GraphQL per-run_id. Promoted to Step 0.5 / Step 1.
2026-04-21 — Dagster GraphQL schema: logsForRun and pipelineRunLogs don't exist at the root; Run.eventConnection is the correct path. eventConnection.limit max is 1000 — paginate via afterCursor. DagsterRunEvent is an interface; message/level need inline fragment on MessageEvent. Promoted to Step 1.A.2.
2026-04-21 — PythonError.errorChain is required to see the actual cause wrapped by RetryRequestedFromPolicy or DagsterExecutionStepExecutionError. Without it you only see wrapper text like "Exceeded max_retries of 1." Promoted to Investigation Principles.
2026-04-21 — Huntress MCP get_run_details works but is 6h-stale (sync cron 15 1,7,13,19 * * * UTC). For fresh alerts, Dagster GraphQL is primary; Huntress is fallback. The /dagster/sync-run HTTP endpoint on Huntress returned 502 (provider not configured on the web service container — only on cron container). Promoted to Step 1.A.4.
2026-04-21 — DAGSTER_CLOUD_API_TOKEN is NOT in .env.1pass. Generate a user token at https://artemis.dagster.cloud/prod/user-settings, write to $GOKU_REPO/.env.local, source via .activate. The !export chat trick does not propagate to Bash-tool subprocesses — file persistence is reliable. Promoted to Step 0.5.
2026-04-21 — Milo filed 5 PRs for the batch; 3 were CLOSED (rejected) by claude[bot] reviewer. Common rejection patterns: (1) widening test interval + downgrading severity (masks the alert permanently); (2) non-deterministic QUALIFY ORDER BY when tiebreak column ties across dups; (3) fix targets a symptom in a different model than the actual failure. Promoted to Step 2b.5.
2026-04-21 — When reimplementing after a rejected Milo PR, explicitly address the rejection reason in the PR description ("This doesn't repeat #NNNN's mistake because..."). This shortcuts re-review. Seen in dbt#3872.
2026-04-21 — For byte-identical duplicate rows, dedup ORDER BY should list every payload column — not because any column breaks the tie (none does), but to make the SQL self-documenting about the assumption that any winner is equivalent.
2026-04-21 — Before creating a PR for a root cause, search existing in-flight PRs by keyword and branch: gh pr list --state open --search "in:title <keyword>" and git branch -a | grep -iE "<keyword>". A parallel PR for an already-open fix is waste. Extend the existing PR instead. Promoted to Step 2b.5.
2026-04-21 — Pre-commit hooks in gokustats worktrees fail without Django SECRET_KEY. Source .env.local + venv-artemis from the MAIN gokustats repo before any commit in a worktree. Promoted to "Worktree Environment Gotchas".

How to promote an entry

When a learning is stable and should always apply:

Find the relevant Step above (Step 0.5, Step 1, Step 2, Principles, etc.)
Integrate the learning into the procedural text
In the Session Log entry, append **Promoted to <location>.**
Leave the log entry in place as history for future agents to see the origin

Delete log entries only when they are both promoted and older than 6 months.

Template for new entries

- **YYYY-MM-DD** — <one-sentence summary of what you learned>. <why it matters / what failed without this knowledge>. <Promoted to Step X.Y> OR <kept in log, re-evaluate next run>

Similar Skills

design-system

167.4k

Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.

team-skills-platform

ui-demo

167.4k

team-skills-platform

kotlin-patterns

167.4k

team-skills-platform

Stats

Parent Repo Stars0

Parent Repo Forks0

Last CommitApr 23, 2026

Actions

View Source View Plugin View on GitHub View README

Fix Data Alerts

Batch workflow for triaging, investigating, and fixing Dagster/dbt data alerts. Uses team agents for parallel investigation, DAG lineage for upstream dedup, and local Dagster for verification.

Persona: Senior data engineer running the on-call data alerts rotation. Methodical, evidence-driven, zero wasted work.

Job Ignore List

These jobs/alerts are out of scope for this skill. Silently skip them during intake — do not investigate, do not count them in triage totals, do not mention them unless the user asks.

Pattern	Reason
`nextgen`	Nextgen pipeline — separate workflow
`modal`	Modal pipeline failures — handled by `/investigate-app`
`duckdb`	DuckDB jobs — different infra
`postgres`	Postgres jobs — different infra
`snowflake_query`, `snowflake_task`	Snowflake-native tasks — not Dagster/dbt
`adhoc_`, `_adhoc_`, `_adhoc`	Ad-hoc jobs — not production

Matching rules:

Case-insensitive
Match against both the job name and the alert title/message
* is a glob wildcard (matches any characters)
If an alert mentions multiple jobs and ALL are ignored, skip the entire alert
If an alert mentions a mix of ignored and non-ignored jobs, keep it but only investigate the non-ignored jobs

To update this list, edit this section directly. No code changes needed — the skill reads this table at runtime.

Procedure

Follow these steps exactly in order.

Step 0: Date and Environment Setup

Capture today's date — used for branch names and Slack summary:

DATE=$(date +%Y-%m-%d)
echo "Session date: $DATE"

Hardcode the absolute paths for the session (shell vars do NOT persist between Bash calls):

WORKSPACE=~/DuneAnalytics/dex_trades
DBT_REPO=$WORKSPACE/dbt
GOKU_REPO=$WORKSPACE/gokustats-back-end

Step 0.5: Authenticate Dagster Cloud + Slack

Two credentials are required for auto-scan:

Dagster Cloud API token (primary — used to pull real run logs):

source $GOKU_REPO/.activate && test -n "$DAGSTER_CLOUD_API_TOKEN" && echo "OK" || echo "MISSING"

If missing, ask the user to generate a user token at https://artemis.dagster.cloud/prod/user-settings and persist it:

echo 'export DAGSTER_CLOUD_API_TOKEN="user:PASTE_HERE"' >> $GOKU_REPO/.env.local

(They can also !export DAGSTER_CLOUD_API_TOKEN="user:..." in the chat, but that doesn't propagate to Bash subprocesses — .env.local + source .activate is the reliable path.)

Slack token (for alert discovery):

test -f ~/.config/slack/token && echo "OK" || echo "MISSING"

Needs scopes channels:history, channels:read. If missing, ask the user to install one at ~/.config/slack/token.

Step 1: Intake — Collect Alerts

Accept alerts in one of three ways (check in order):

A. Auto-Scan #data-alerts via Slack + Dagster GraphQL (most automated — use when user says "fix data alerts" with no input):

The flow is: Slack → failure thread list → run_ids → Dagster GraphQL → real error chain + stdout.

1. Enumerate failure threads from Slack (last 24h, channel C05S8H76M08)

Write a short Python helper that:

Calls conversations_history(channel="C05S8H76M08", oldest=<24h_ago>)
Filters messages containing "failed", "Error:", :red_circle:
For each, fetches conversations_replies(ts=thread_ts) to capture Milo's replies
Extracts run_id (UUID pattern) and job_name (regex "([a-z][a-z0-9_]*_job)") from parent text
Applies the job ignore list (nextgen, modal, duckdb, postgres, snowflake_query, snowflake_task, adhoc_)

Save parsed threads to ~/data-alerts-logs/<DATE>/threads.json for reuse.

2. Pull real run details from Dagster Cloud GraphQL (primary source of truth)

For each run_id, query https://artemis.dagster.cloud/prod/graphql with header Dagster-Cloud-Api-Token: $DAGSTER_CLOUD_API_TOKEN:

query ($runId: ID!, $cursor: String) {
  runOrError(runId: $runId) {
    __typename
    ... on Run {
      runId jobName status startTime endTime
      stepStats { ... on RunStepStats { stepKey status } }
      eventConnection(limit: 1000, afterCursor: $cursor) {
        cursor hasMore
        events {
          __typename
          ... on ExecutionStepFailureEvent {
            stepKey
            error {
              message className
              errorChain { error { message className } isExplicitLink }
            }
          }
        }
      }
    }
    ... on RunNotFoundError { message }
  }
}

Critical schema facts (schema drift has caused 400s in prior runs):

Run field is eventConnection (NOT logsForRun / pipelineRunLogs)
eventConnection.limit max is 1000; paginate via afterCursor if hasMore is true
message/level on events require inline fragment on MessageEvent (interface)
errorChain on PythonError is essential for wrapped errors — RetryRequestedFromPolicy and DagsterExecutionStepExecutionError wrap the real cause (e.g. Forbidden: 403 Access Denied, ErrorDuneQueryException, AssertionError: No records found for mony...). Without errorChain you only see the wrapper.

3. Optionally fetch stdout tails for each failed step via capturedLogsMetadata:

query ($logKey: [String!]!) {
  capturedLogsMetadata(logKey: $logKey) {
    stdoutDownloadUrl stderrDownloadUrl
  }
}

Save results to ~/data-alerts-logs/<DATE>/dagster_direct/<NN>_<job>_<run_id[:8]>.txt for reuse.

B. Slack Thread URL (user provides a specific thread): Extract channel + message_ts from URL, then follow path A from step 2 onward.

C. Pasted Dagster Output: If user pastes raw stdout/stderr, parse it directly.

D. Manual Description: If user describes failures in plain text, extract job names + run_ids.

Apply Ignore List

After collecting all alerts, immediately filter out ignored jobs using the patterns in the "Job Ignore List" section above. For each alert:

Extract the job name from the alert
Match against every pattern in the ignore list (case-insensitive glob matching)
If the job matches ANY ignore pattern, silently drop it
If an alert thread contains multiple jobs, keep only the non-ignored ones

Report how many alerts were filtered:

Collected N alerts from #data-alerts
Filtered out M ignored jobs (nextgen: X, modal: Y, adhoc: Z, ...)
Proceeding with K alerts for investigation

Extract Fields

For each non-ignored alert, extract:

Field	Source
Job name	`dagster job execute -j <name>` or Slack alert title
Failed model(s)	dbt error output: `Model <name>` or `Error in model <name>`
Error type	compilation, runtime, test failure, freshness, timeout
Error message	The specific error text
Timestamp	When the failure occurred
Slack thread URL	For linking back in the fix summary

Store parsed alerts in a structured list for triage.

Step 2: Triage — Categorize and Deduplicate

2a. Auto-Categorize

Tag each alert:

Category	Signal	Investigation Path
`compilation`	"Compilation Error", syntax errors	Check model SQL + macros
`runtime`	"Runtime Error", "Database Error"	Check data types, upstream tables
`test-failure`	"Failure in test"	Check test config + data
`freshness`	"Freshness check", "stale"	Check upstream extract jobs
`timeout`	"max_runtime_seconds", "timed out"	Check query plan, data volume
`crash`	Python traceback, OOM	Check Dagster job config

2b. DAG Lineage Dedup

This is critical to avoid duplicative investigation work. Before spawning investigators, determine which failures are cascading from a common upstream root cause.

For each failed model, trace upstream dependencies:

cd $DBT_REPO && source $WORKSPACE/dbt/venv/bin/activate && set -a && source $WORKSPACE/dbt/.env.local && set +a && dbt ls --resource-type model --select +<failed_model> --output json 2>/dev/null | jq -r '.unique_id' 2>/dev/null

Build a dependency graph across all failed models
Dedup rule: If model A depends on model B (directly or transitively) and BOTH failed, group them — only investigate model B (the upstream root). Model A's failure is likely a cascade.
Error similarity: If two models in different DAG branches have identical error messages (e.g., same macro bug), group them under one root cause.

2b.5. Check Milo's In-Flight PRs — Salvage Work

Milo (milo-bot) auto-files fix PRs for many alerts — always check before investigating from scratch.

For each thread, scan replies for Milo-filed PR URLs (github.com/Artemis-xyz/<repo>/pull/<num>). For each found:

gh pr view --repo Artemis-xyz/<repo> <num> --json state,mergedAt,title,reviewDecision,closedAt
# If CLOSED (not merged), inspect closing review:
gh pr view --repo Artemis-xyz/<repo> <num> --json comments --jq '[.comments[] | select(.author.login=="claude") ][-1].body'

Classify each Milo PR:

OPEN + all checks pass + MERGEABLE → candidate to merge. Still review the diff yourself before landing.
CLOSED (rejected) → read the reviewer's comment. Common rejection patterns (seen in practice):
- Masking the alert (widening test interval + downgrading severity) — "test will never block again"
- Non-deterministic dedup (QUALIFY ORDER BY using a column that ties across duplicates)
- Wrong target (addresses a symptom that isn't the actual failing step)
- When reimplementing, write the PR description explicitly to call out: "This doesn't repeat #NNNN's mistake because..."
MERGED → alert should be resolved already; sanity check if it's still firing.

Also check for in-flight PRs on coordinated branches (not from Milo):

# Is there an existing branch/PR that already fixes one of our groups?
cd $GOKU_REPO && git branch -a | grep -iE "stablecoin|usdt0|<other_keyword>"
cd $GOKU_REPO && gh pr list --state open --search "in:title <keyword>" --json number,title,headRefName

If found, do NOT create a parallel PR. Options:

Wait for it to merge, then investigate remaining alerts
Rebase the existing branch onto fresh origin/main, cherry-pick any additional fixes onto it
Coordinate with the PR author

Seen-in-practice example: PR #5397 for stablecoin sensor gating was open while a new batch of alerts for the same root cause came in — the fix was to extend #5397, not create #5397-v2.

2c. Git History — Recent Breakage Check

Before investigating, check if recent commits or PRs introduced the breakage:

# Check dbt repo for recent changes to affected models
cd $DBT_REPO && git log --oneline --since="3 days ago" -- models/**/<model_name>*.sql macros/**/*.sql

# Check gokustats for recent job/sensor changes
cd $GOKU_REPO && git log --oneline --since="3 days ago" -- artemis_dagster/jobs/ artemis_dagster/sensors/

# Check for recently merged PRs that touched affected areas
cd $DBT_REPO && gh pr list --state merged --limit 10 --json title,number,mergedAt,files
cd $GOKU_REPO && gh pr list --state merged --limit 10 --json title,number,mergedAt,files

If a recent PR clearly introduced the issue (e.g., PR merged 2 hours ago touching the exact failing model), flag it prominently in the triage output.

2d. Present Triage Results

Display the triage to the user before proceeding:

Data Alert Triage — <DATE>
═══════════════════════════════════════════════════════════════════
  Total alerts:           N
  Deduplicated groups:    M (after DAG lineage dedup)
  Categories:             X compilation, Y runtime, Z freshness

  Root Cause Groups:
  ─────────────────────────────────────────────────────────────────
  Group 1: <upstream_model> (compilation)
    Affects: model_a, model_b, model_c (cascade)
    Jobs:    daily_chain_job, daily_stablecoin_job
    Repos:   dbt
    Recent PR: #3301 merged 2h ago — LIKELY CAUSE

  Group 2: <model_x> (runtime)
    Affects: model_x only
    Jobs:    daily_defi_job
    Repos:   dbt, gokustats
    Recent PR: none found
  ─────────────────────────────────────────────────────────────────

  Proceed with investigation? (y/n, or adjust groups)
═══════════════════════════════════════════════════════════════════

Do NOT proceed until the user confirms the triage.

Step 3: Worktree Setup

Create worktrees in BOTH repos with the same branch name:

DATE=$(date +%Y-%m-%d)
BRANCH="<engineer>/data-alerts-$DATE"

# dbt worktree
cd $DBT_REPO && git worktree add -b "$BRANCH" "$WORKSPACE/dbt-worktrees/data-alerts-$DATE" origin/main

# gokustats worktree
cd $GOKU_REPO && git worktree add -b "$BRANCH" "$WORKSPACE/worktrees/data-alerts-$DATE/gokustats-back-end" origin/main

Capture absolute worktree paths for agent prompts:

DBT_WT=$WORKSPACE/dbt-worktrees/data-alerts-$DATE
GOKU_WT=$WORKSPACE/worktrees/data-alerts-$DATE/gokustats-back-end

Step 4: Investigate — Team Agents

Create a team for this alert batch:

TeamCreate: team_name="data-alerts-<DATE>", description="Fix data alerts batch <DATE>"

4a. Create Tasks

Create one task per deduplicated root-cause group from Step 2:

TaskCreate for each group:
  title: "Investigate: <root_cause_model> (<category>)"
  description: Full context including error logs, affected models, DAG lineage, git history findings

4b. Spawn Investigators

Spawn one general-purpose teammate per root-cause group. Each investigator gets the full investigate-dagster-error workflow inlined in their prompt.

Agent prompt template for each investigator:

You are investigating a Dagster/dbt data alert. Your name is "investigator-<N>".

TEAM: data-alerts-<DATE>
DBT WORKTREE: <DBT_WT> (absolute path — use this for ALL dbt file operations)
GOKU WORKTREE: <GOKU_WT> (absolute path — use this for ALL gokustats file operations)
MAIN DBT REPO: <DBT_REPO> (for git history only)
MAIN GOKU REPO: <GOKU_REPO> (for git history only)

## CRITICAL PATH RULES

EVERY Bash command targeting dbt: cd <DBT_WT> && <command>
EVERY Bash command targeting gokustats: cd <GOKU_WT> && <command>
EVERY Read/Edit/Write for dbt: use absolute paths under <DBT_WT>/
EVERY Read/Edit/Write for gokustats: use absolute paths under <GOKU_WT>/
Git history commands: use the MAIN repos (not worktrees)

## Environment Bootstrap

For dbt commands in the worktree:
source <DBT_REPO>/venv/bin/activate && set -a && source <DBT_REPO>/.env.local && set +a

For Snowflake queries:
snowsql -c artemis -q "<SQL>"
Always use PC_DBT_DB.PROD.<TABLE> FQN. Always include LIMIT.

## Alert Details

<paste the specific alert(s) for this root-cause group>

## Investigation Procedure

Follow these phases in order:

### Phase 1: Parse Error and Identify Affected Models
- Extract model names, error type, error message from the alert
- Locate model files in <DBT_WT>/models/
- Check if macros are involved (<DBT_WT>/macros/)

### Phase 2: Dependency Analysis
- Trace upstream: find all {{ ref('...') }} and {{ source('...') }} in the model
- Trace downstream: search for models that ref the affected model
- Check Dagster job config in <GOKU_WT>/artemis_dagster/jobs/

### Phase 3: Diagnostic SQL
- Table metadata: INFORMATION_SCHEMA.TABLES for affected tables
- Column types: INFORMATION_SCHEMA.COLUMNS if type mismatch suspected
- Reproduce the error with a targeted query (always LIMIT 100)
- Check Snowflake query history for recent errors on affected warehouses:
  ```sql
  SELECT query_text, error_code, error_message, warehouse_name, start_time
  FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
  WHERE error_code != '000000' AND warehouse_name = '<WH>'
    AND start_time >= DATEADD(day, -2, CURRENT_TIMESTAMP())
  ORDER BY start_time DESC LIMIT 20;

Phase 4: Git History

Recent changes to the model: git log in MAIN DBT REPO
Recent changes to related macros
Recent Dagster job changes: git log in MAIN GOKU REPO
Identify the commit/PR that introduced the bug

Phase 5: Root Cause Summary

Produce:

What failed (model + error type)
Why it failed (technical root cause)
When the bug was introduced (commit, PR, date)
The fix (specific code change needed)

Phase 6: Apply Fix

Edit the affected files IN THE WORKTREE (not main repo)
Each fix = one commit with a descriptive message following conventional commits: fix():
Do NOT push or create PRs — the orchestrator handles that
After committing, report back via SendMessage:
- What you fixed
- Files changed
- Commit hash
- Confidence level (High/Medium/Low)

Coordination

Check TaskList after completing your investigation
If you discover your root cause is the same as another group's, notify the team lead
Mark your task as completed via TaskUpdate when done


Launch all investigators in parallel:


#### 4c. Monitor and Coordinate

- Wait for team agents to complete via automatic message delivery
- If two agents discover the same root cause, consolidate their fixes
- If an agent gets stuck, provide guidance via SendMessage
- Collect results: list of commits per worktree, files changed, confidence levels

---

### Step 5: Verify — Local Testing

After all fixes are committed in the worktrees, verify them.

#### 5a. dbt Compile Check

```bash
cd $DBT_WT && source $DBT_REPO/venv/bin/activate && set -a && source $DBT_REPO/.env.local && set +a && dbt compile --select <affected_models_space_separated>

5b. Run Affected Jobs Locally (dagjob)

For each affected Dagster job, run it using the dagjob alias:

cd $GOKU_WT && source $GOKU_REPO/.activate && dagjob <job_name>

If dagjob isn't available (worktree context), use the raw command:

cd $GOKU_WT && source $GOKU_REPO/.activate && dagster job execute -f artemis_dagster/primary_definitions.py -j <job_name>

5c. Spin Up Local Dagster for UI Verification

IMPORTANT: Never run make dagster — it loads ALL jobs, which is slow and unnecessary.

CRITICAL: Must unset AWS_ACCESS_KEY_ID before starting Dagster locally (conflicts with Snowflake auth).

Source env from the MAIN gokustats repo (worktrees lack venv/env), then launch with DAGSTER_ONLY_JOBS:

cd $GOKU_REPO && source .activate && unset AWS_ACCESS_KEY_ID && DAGSTER_ONLY_JOBS=<comma_separated_job_names> dagster dev -f artemis_dagster/primary_definitions.py

Or use the convenience script:

cd $GOKU_REPO && source .activate && unset AWS_ACCESS_KEY_ID && python scripts/dagster_local.py <job_name_1> <job_name_2> ...

Example:

cd $GOKU_REPO && source .activate && unset AWS_ACCESS_KEY_ID && DAGSTER_ONLY_JOBS=daily_bsc_job,daily_graph_job dagster dev -f artemis_dagster/primary_definitions.py

Note: Even with job filtering, asset loading takes ~5 minutes (all assets must load for job selection to resolve). This is normal.

Then tell the user:

Local Dagster is running with only the affected jobs. Please:
1. Open http://localhost:3000 in your browser
2. Navigate to the affected job(s): <list job names>
3. Trigger a run and verify success
4. Screenshot the successful run(s)
5. Tell me when you're done — I'll proceed to PR creation

Do NOT proceed until the user confirms verification is complete.

5d. Snowflake Spot-Check

After successful job runs, verify data landed correctly:

snowsql -c artemis -q "
SELECT table_name, row_count, last_altered
FROM PC_DBT_DB.INFORMATION_SCHEMA.TABLES
WHERE table_schema = 'PROD'
  AND table_name IN (<affected_table_names_upper>)
ORDER BY last_altered DESC;
"

For freshness alerts, also check the most recent records:

snowsql -c artemis -q "
SELECT MAX(block_timestamp) as latest_record, COUNT(*) as row_count
FROM PC_DBT_DB.PROD.<TABLE>
WHERE block_timestamp >= DATEADD(day, -2, CURRENT_TIMESTAMP())
LIMIT 1;
"

Present results to the user.

Step 6: User Gate — Approval

Present a summary of all fixes before PR creation:

Fix Summary — data-alerts-<DATE>
═══════════════════════════════════════════════════════════════════
  Root Cause Groups Fixed:  M / N

  dbt/ commits:
    abc1234  fix(stablecoin): correct column reference in agg model
    def5678  fix(ethereum): update macro for new schema

  gokustats-back-end/ commits:
    (none — all fixes were in dbt)

  Verification:
    dbt compile:     PASS
    dagjob runs:     3/3 PASS
    Snowflake check: Data landed, freshness OK

  Ready to create PRs? (y/n)
═══════════════════════════════════════════════════════════════════

Do NOT proceed until the user approves.

Step 7: Create PRs

One PR per repo. Each root-cause fix is already a separate commit.

7a. Push and PR for dbt

cd $DBT_WT && git push -u origin <engineer>/data-alerts-$DATE

cd $DBT_WT && gh pr create --title "fix: data alerts batch $DATE (dbt)" --body "$(cat <<'EOF'
## Summary
Batch fix for data alerts on <DATE>.

## Root Causes Fixed
<for each root-cause group:>
- **<model>** (<category>): <1-line description of fix>
  - Introduced by: PR #XXX (if known)
  - Commits: `abc1234`

## Files Changed
<list files>

## Testing
- [x] `dbt compile` — all affected models compile
- [x] Local Dagster job runs — all passed
- [x] Snowflake spot-check — data landed, freshness OK

## Affected Jobs
<list dagster job names>

---
Generated with Claude Code
EOF
)"

7b. Push and PR for gokustats (if changes exist)

Same pattern, only if there are commits in the gokustats worktree.

7c. Return PR URLs

PRs created:
  dbt:             <URL>
  gokustats:       <URL> (or "no changes needed")

Step 8: Slack Summary

Post a fix summary to #data-alerts using the Slack MCP "send message" tool.

Message format:

:white_check_mark: *Data Alerts Fixed — <DATE>*

*Alerts resolved:* N
*Root causes:* M

<for each root-cause group:>
:point_right: *<model_name>* (<category>)
> <1-line description of what broke and why>
> Fix: <1-line description of the fix>
> PR: <dbt PR URL> / <gokustats PR URL>

*Introduced by:* PR #XXX (if known)
*Verified:* Local Dagster runs + Snowflake spot-check

Send to channel #data-alerts (channel ID: C05S8H76M08).

Show the user the message before sending. Ask: "Post this to #data-alerts? (y/n, or edit)"

Step 9: Cleanup

After PRs are merged (not before):

# Remove worktrees
cd $DBT_REPO && git worktree remove $DBT_WT
cd $GOKU_REPO && git worktree remove $GOKU_WT
git worktree prune

# Delete local branches (remote branches auto-delete on PR merge)
cd $DBT_REPO && git branch -D <engineer>/data-alerts-$DATE
cd $GOKU_REPO && git branch -D <engineer>/data-alerts-$DATE

Do NOT run cleanup automatically. Tell the user: "Run /cleanup when the PRs are merged to remove the worktrees."

Step 10: Update the Learnings Log (REQUIRED)

Even if the run was uneventful, write one of:

A concrete learning (tooling quirk, API drift, new Milo rejection pattern, useful investigator prompt improvement)
An explicit "no new learnings — playbook worked as written" entry (so future agents know this ran without drift)

If the learning should ALWAYS apply from now on, promote it into the relevant Step above per the "How to promote an entry" procedure — and note the promotion in the log entry.

This step is not optional. Skills that are never updated rot against the underlying systems they describe.

Critical Rules for Dev Testing

NEVER run entire large jobs locally

For jobs with many models (e.g., daily_ez_metrics_job with 3000+ models):

Do NOT run dbt build -s tag:ez_metrics or similar broad selections locally
Instead, grep the stdout file for ERROR|FAIL lines to identify the specific failing models
Then investigate and fix only those specific models

Clone before running large incremental models in dev

Large incremental models (e.g., fact_bsc_transactions_v2 on BAM_TRANSACTION_XLG) will fail or take forever in dev without existing data. Always clone first:

# Option 1: Use the clone script
cd $DBT_REPO && python dbt_scripts/clone_object_into_dev_schema.py <model_name>

# Option 2: Manual clone via SQL (for tables in non-default databases)
snowsql -c artemis -q "CREATE TRANSIENT TABLE PC_DBT_DB.DEV_$USERNAME.<TABLE> CLONE PC_DBT_DB.PROD.<TABLE>;"

Then run the incremental model in dev — it will process only the delta.

Use ANALYTICS_XL warehouse for large dev runs

For models that normally run on large warehouses (XLG, XXL), override the warehouse in dev to avoid timeouts:

dbt run -s <model> --vars '{"snowflake_warehouse_override": "ANALYTICS_XL"}'

Or if the model doesn't support the override var, temporarily edit the config in the worktree.

Notes

User gates at Steps 2, 5c, and 6. Never auto-proceed past triage, verification, or PR approval.
One PR per repo, one commit per root cause. Clean git history for bisect/revert.
Worktrees in both repos always. Even if we think only dbt is affected — Dagster job config changes sometimes needed too.
Source main repo envs for worktrees. Worktrees lack .activate, venv, and .env. Always source from the main repo.
Slack stdout files are the primary error source. The alert message only says "Steps failed: ['open_source_snowflake_dbt_assets']" — the actual failing models and error messages are in the stdout .txt attachment. Always read these first.
Never display secrets. Mask any credentials that appear in logs.
Team agents are general-purpose. They need Bash, Read, Edit, Write for investigation and fixing. Never use Explore or Plan subagent types for fix work.
investigate-dagster-error phases are inlined in the agent prompt (Step 4b) rather than referenced as a skill, since team agents cannot invoke skills.

Investigation Principles

These principles override default behavior. Violations of these have caused bad fixes in the past.

Always fix at the source, never mask downstream

Always verify Dagster job existence before blaming test config

Always check upstream job timing

When creating or modifying Dagster job schedules, trace all upstream {{ ref() }} and {{ source() }} dependencies and verify their jobs complete before the new schedule fires. Steps:

Read model SQL → extract refs
Determine which Dagster group each ref belongs to (via CustomDagsterDbtTranslator.get_group_name())
Find that group's job definition and cron schedule
Ensure upstream completes before downstream starts

Always check for orphaned Dagster groups

Present triage with root cause hypothesis before fixing

Use `errorChain`, not just `error.message`

Sensor-based gating beats cron-based coordination

Worktree Environment Gotchas

These have broken the skill's pre-commit / push steps in the past.

Worktrees lack `.env` files and venv

MAIN=$GOKU_REPO  # or $DBT_REPO
cd $MAIN && set -a && source .env.shared && source .env.local 2>/dev/null; set +a
source $MAIN/venv-artemis/bin/activate
cd $WORKTREE && git commit ...   # hooks now have SECRET_KEY + python

Without this, the django-migration-lint pre-commit hook fails with Executable 'python' not found or ImproperlyConfigured: SECRET_KEY must not be empty.

Cherry-pick cascade when target branch is stale

Correct order:

cd <target_worktree>
git reset --hard origin/<target_branch>   # drop stale local commits first
git merge origin/main --no-edit           # bring in everything since branch cut
git cherry-pick <fix_commit>              # now applies cleanly
# resolve conflicts if any, then push

Learnings Log — UPDATE THIS EVERY RUN

Did I discover something about tooling that isn't documented above? → new entry under Tooling
Did I adjust the investigation playbook mid-flight? → new entry under Playbook adjustments
Did a Milo PR teach me a new rejection pattern? → new entry under Milo PR rejection patterns
Did the Dagster/dbt/Huntress API behave differently than expected? → new entry under API drift

Session log

2026-04-21 — Dagster Slack bot no longer attaches dagster_stdout_*.txt to alert threads. The scripts/download_alert_logs.py helper (on stash in fix/huntress-internal-token-auth branch) returns 0 files. Threads now contain only Milo auto-investigation replies. Bypass: use Dagster GraphQL per-run_id. Promoted to Step 0.5 / Step 1.
2026-04-21 — Dagster GraphQL schema: logsForRun and pipelineRunLogs don't exist at the root; Run.eventConnection is the correct path. eventConnection.limit max is 1000 — paginate via afterCursor. DagsterRunEvent is an interface; message/level need inline fragment on MessageEvent. Promoted to Step 1.A.2.
2026-04-21 — PythonError.errorChain is required to see the actual cause wrapped by RetryRequestedFromPolicy or DagsterExecutionStepExecutionError. Without it you only see wrapper text like "Exceeded max_retries of 1." Promoted to Investigation Principles.
2026-04-21 — Huntress MCP get_run_details works but is 6h-stale (sync cron 15 1,7,13,19 * * * UTC). For fresh alerts, Dagster GraphQL is primary; Huntress is fallback. The /dagster/sync-run HTTP endpoint on Huntress returned 502 (provider not configured on the web service container — only on cron container). Promoted to Step 1.A.4.
2026-04-21 — DAGSTER_CLOUD_API_TOKEN is NOT in .env.1pass. Generate a user token at https://artemis.dagster.cloud/prod/user-settings, write to $GOKU_REPO/.env.local, source via .activate. The !export chat trick does not propagate to Bash-tool subprocesses — file persistence is reliable. Promoted to Step 0.5.
2026-04-21 — Milo filed 5 PRs for the batch; 3 were CLOSED (rejected) by claude[bot] reviewer. Common rejection patterns: (1) widening test interval + downgrading severity (masks the alert permanently); (2) non-deterministic QUALIFY ORDER BY when tiebreak column ties across dups; (3) fix targets a symptom in a different model than the actual failure. Promoted to Step 2b.5.
2026-04-21 — When reimplementing after a rejected Milo PR, explicitly address the rejection reason in the PR description ("This doesn't repeat #NNNN's mistake because..."). This shortcuts re-review. Seen in dbt#3872.
2026-04-21 — For byte-identical duplicate rows, dedup ORDER BY should list every payload column — not because any column breaks the tie (none does), but to make the SQL self-documenting about the assumption that any winner is equivalent.
2026-04-21 — Before creating a PR for a root cause, search existing in-flight PRs by keyword and branch: gh pr list --state open --search "in:title <keyword>" and git branch -a | grep -iE "<keyword>". A parallel PR for an already-open fix is waste. Extend the existing PR instead. Promoted to Step 2b.5.
2026-04-21 — Pre-commit hooks in gokustats worktrees fail without Django SECRET_KEY. Source .env.local + venv-artemis from the MAIN gokustats repo before any commit in a worktree. Promoted to "Worktree Environment Gotchas".

How to promote an entry

When a learning is stable and should always apply:

Find the relevant Step above (Step 0.5, Step 1, Step 2, Principles, etc.)
Integrate the learning into the procedural text
In the Session Log entry, append **Promoted to <location>.**
Leave the log entry in place as history for future agents to see the origin

Delete log entries only when they are both promoted and older than 6 months.

Template for new entries

- **YYYY-MM-DD** — <one-sentence summary of what you learned>. <why it matters / what failed without this knowledge>. <Promoted to Step X.Y> OR <kept in log, re-evaluate next run>

investigate-data-alerts

Tool Access

Preview

SKILL.md

Similar Skills

investigate-data-alerts

Tool Access

Preview

SKILL.md

Fix Data Alerts

Job Ignore List

Procedure

Step 0: Date and Environment Setup

Step 0.5: Authenticate Dagster Cloud + Slack

Step 1: Intake — Collect Alerts

Apply Ignore List

Extract Fields

Step 2: Triage — Categorize and Deduplicate

2a. Auto-Categorize

2b. DAG Lineage Dedup

2b.5. Check Milo's In-Flight PRs — Salvage Work

2c. Git History — Recent Breakage Check

2d. Present Triage Results

Step 3: Worktree Setup

Step 4: Investigate — Team Agents

4a. Create Tasks

4b. Spawn Investigators

Phase 4: Git History

Phase 5: Root Cause Summary

Phase 6: Apply Fix

Coordination

5b. Run Affected Jobs Locally (dagjob)

5c. Spin Up Local Dagster for UI Verification

5d. Snowflake Spot-Check

Step 6: User Gate — Approval

Step 7: Create PRs

7a. Push and PR for dbt

7b. Push and PR for gokustats (if changes exist)

7c. Return PR URLs

Step 8: Slack Summary

Step 9: Cleanup

Step 10: Update the Learnings Log (REQUIRED)

Critical Rules for Dev Testing

NEVER run entire large jobs locally

Clone before running large incremental models in dev

Use ANALYTICS_XL warehouse for large dev runs

Notes

Investigation Principles

Always fix at the source, never mask downstream

Always verify Dagster job existence before blaming test config

Always check upstream job timing

Always check for orphaned Dagster groups

Present triage with root cause hypothesis before fixing

Use errorChain, not just error.message

Sensor-based gating beats cron-based coordination

Worktree Environment Gotchas

Worktrees lack .env files and venv

Cherry-pick cascade when target branch is stale

Learnings Log — UPDATE THIS EVERY RUN

Session log

How to promote an entry

Template for new entries

Similar Skills

Fix Data Alerts

Job Ignore List

Procedure

Step 0: Date and Environment Setup

Step 0.5: Authenticate Dagster Cloud + Slack

Step 1: Intake — Collect Alerts

Apply Ignore List

Extract Fields

Step 2: Triage — Categorize and Deduplicate

2a. Auto-Categorize

2b. DAG Lineage Dedup

2b.5. Check Milo's In-Flight PRs — Salvage Work

2c. Git History — Recent Breakage Check

2d. Present Triage Results

Step 3: Worktree Setup

Step 4: Investigate — Team Agents

4a. Create Tasks

Use `errorChain`, not just `error.message`

Worktrees lack `.env` files and venv

Use `errorChain`, not just `error.message`

Worktrees lack `.env` files and venv