From theclauu
Batch-fix Dagster/dbt data alerts: triage, deduplicate by DAG lineage, investigate in parallel with team agents, verify locally, PR, and post Slack summary.
npx claudepluginhub artemis-xyz/theclauu --plugin theclauuThis skill uses the workspace's default tool permissions.
Batch workflow for triaging, investigating, and fixing Dagster/dbt data alerts. Uses team agents for parallel investigation, DAG lineage for upstream dedup, and local Dagster for verification.
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
Batch workflow for triaging, investigating, and fixing Dagster/dbt data alerts. Uses team agents for parallel investigation, DAG lineage for upstream dedup, and local Dagster for verification.
Persona: Senior data engineer running the on-call data alerts rotation. Methodical, evidence-driven, zero wasted work.
These jobs/alerts are out of scope for this skill. Silently skip them during intake — do not investigate, do not count them in triage totals, do not mention them unless the user asks.
| Pattern | Reason |
|---|---|
*nextgen* | Nextgen pipeline — separate workflow |
*modal* | Modal pipeline failures — handled by /investigate-app |
*duckdb* | DuckDB jobs — different infra |
*postgres* | Postgres jobs — different infra |
*snowflake_query*, *snowflake_task* | Snowflake-native tasks — not Dagster/dbt |
adhoc_*, *_adhoc_*, *_adhoc | Ad-hoc jobs — not production |
Matching rules:
* is a glob wildcard (matches any characters)To update this list, edit this section directly. No code changes needed — the skill reads this table at runtime.
Follow these steps exactly in order.
Capture today's date — used for branch names and Slack summary:
DATE=$(date +%Y-%m-%d)
echo "Session date: $DATE"
Hardcode the absolute paths for the session (shell vars do NOT persist between Bash calls):
WORKSPACE=~/DuneAnalytics/dex_trades
DBT_REPO=$WORKSPACE/dbt
GOKU_REPO=$WORKSPACE/gokustats-back-end
Two credentials are required for auto-scan:
Dagster Cloud API token (primary — used to pull real run logs):
source $GOKU_REPO/.activate && test -n "$DAGSTER_CLOUD_API_TOKEN" && echo "OK" || echo "MISSING"
If missing, ask the user to generate a user token at https://artemis.dagster.cloud/prod/user-settings and persist it:
echo 'export DAGSTER_CLOUD_API_TOKEN="user:PASTE_HERE"' >> $GOKU_REPO/.env.local
(They can also !export DAGSTER_CLOUD_API_TOKEN="user:..." in the chat, but that doesn't propagate to Bash subprocesses — .env.local + source .activate is the reliable path.)
Slack token (for alert discovery):
test -f ~/.config/slack/token && echo "OK" || echo "MISSING"
Needs scopes channels:history, channels:read. If missing, ask the user to install one at ~/.config/slack/token.
DO NOT rely on scripts/download_alert_logs.py or stdout .txt attachments in Slack threads. The Dagster bot no longer attaches stdout files; threads now contain only Milo investigation replies. The script will report "0 files downloaded" and be useless. See "Learnings" section for history.
Accept alerts in one of three ways (check in order):
A. Auto-Scan #data-alerts via Slack + Dagster GraphQL (most automated — use when user says "fix data alerts" with no input):
The flow is: Slack → failure thread list → run_ids → Dagster GraphQL → real error chain + stdout.
1. Enumerate failure threads from Slack (last 24h, channel C05S8H76M08)
Write a short Python helper that:
conversations_history(channel="C05S8H76M08", oldest=<24h_ago>)"failed", "Error:", :red_circle:conversations_replies(ts=thread_ts) to capture Milo's repliesrun_id (UUID pattern) and job_name (regex "([a-z][a-z0-9_]*_job)") from parent textSave parsed threads to ~/data-alerts-logs/<DATE>/threads.json for reuse.
2. Pull real run details from Dagster Cloud GraphQL (primary source of truth)
For each run_id, query https://artemis.dagster.cloud/prod/graphql with header Dagster-Cloud-Api-Token: $DAGSTER_CLOUD_API_TOKEN:
query ($runId: ID!, $cursor: String) {
runOrError(runId: $runId) {
__typename
... on Run {
runId jobName status startTime endTime
stepStats { ... on RunStepStats { stepKey status } }
eventConnection(limit: 1000, afterCursor: $cursor) {
cursor hasMore
events {
__typename
... on ExecutionStepFailureEvent {
stepKey
error {
message className
errorChain { error { message className } isExplicitLink }
}
}
}
}
}
... on RunNotFoundError { message }
}
}
Critical schema facts (schema drift has caused 400s in prior runs):
eventConnection (NOT logsForRun / pipelineRunLogs)eventConnection.limit max is 1000; paginate via afterCursor if hasMore is truemessage/level on events require inline fragment on MessageEvent (interface)errorChain on PythonError is essential for wrapped errors — RetryRequestedFromPolicy and DagsterExecutionStepExecutionError wrap the real cause (e.g. Forbidden: 403 Access Denied, ErrorDuneQueryException, AssertionError: No records found for mony...). Without errorChain you only see the wrapper.3. Optionally fetch stdout tails for each failed step via capturedLogsMetadata:
query ($logKey: [String!]!) {
capturedLogsMetadata(logKey: $logKey) {
stdoutDownloadUrl stderrDownloadUrl
}
}
logKey is [run_id, "compute", step_key]. Then GET the stdoutDownloadUrl (same Dagster-Cloud-Api-Token header) — URL is a presigned S3 link. Keep only the tail (last ~6-8KB) — errors cluster at the end.
Save results to ~/data-alerts-logs/<DATE>/dagster_direct/<NN>_<job>_<run_id[:8]>.txt for reuse.
4. (Optional) Huntress MCP as secondary lookup — works but is behind a 6h sync cron (cron: "15 1,7,13,19 * * *" UTC). Newest runs are typically "Run not found — may not have been synced yet." Only useful if Dagster GraphQL is down. Tool: mcp__huntress__get_run_details(run_id, include_stdout=true). MCP key rotates — if keys are stale, regenerate at https://huntress.vercel.app/setup.
B. Slack Thread URL (user provides a specific thread): Extract channel + message_ts from URL, then follow path A from step 2 onward.
C. Pasted Dagster Output: If user pastes raw stdout/stderr, parse it directly.
D. Manual Description: If user describes failures in plain text, extract job names + run_ids.
After collecting all alerts, immediately filter out ignored jobs using the patterns in the "Job Ignore List" section above. For each alert:
Report how many alerts were filtered:
Collected N alerts from #data-alerts
Filtered out M ignored jobs (nextgen: X, modal: Y, adhoc: Z, ...)
Proceeding with K alerts for investigation
For each non-ignored alert, extract:
| Field | Source |
|---|---|
| Job name | dagster job execute -j <name> or Slack alert title |
| Failed model(s) | dbt error output: Model <name> or Error in model <name> |
| Error type | compilation, runtime, test failure, freshness, timeout |
| Error message | The specific error text |
| Timestamp | When the failure occurred |
| Slack thread URL | For linking back in the fix summary |
Store parsed alerts in a structured list for triage.
Tag each alert:
| Category | Signal | Investigation Path |
|---|---|---|
compilation | "Compilation Error", syntax errors | Check model SQL + macros |
runtime | "Runtime Error", "Database Error" | Check data types, upstream tables |
test-failure | "Failure in test" | Check test config + data |
freshness | "Freshness check", "stale" | Check upstream extract jobs |
timeout | "max_runtime_seconds", "timed out" | Check query plan, data volume |
crash | Python traceback, OOM | Check Dagster job config |
This is critical to avoid duplicative investigation work. Before spawning investigators, determine which failures are cascading from a common upstream root cause.
For each failed model, trace upstream dependencies:
cd $DBT_REPO && source $WORKSPACE/dbt/venv/bin/activate && set -a && source $WORKSPACE/dbt/.env.local && set +a && dbt ls --resource-type model --select +<failed_model> --output json 2>/dev/null | jq -r '.unique_id' 2>/dev/null
Build a dependency graph across all failed models
Dedup rule: If model A depends on model B (directly or transitively) and BOTH failed, group them — only investigate model B (the upstream root). Model A's failure is likely a cascade.
Error similarity: If two models in different DAG branches have identical error messages (e.g., same macro bug), group them under one root cause.
Optional — Huntress MCP for lineage visualization:
If configured, use the Huntress MCP at https://huntress.vercel.app/setup to query the full DAG lineage. This provides richer dependency data than dbt ls alone. If not configured, fall back to dbt ls (works fine for most cases).
Milo (milo-bot) auto-files fix PRs for many alerts — always check before investigating from scratch.
For each thread, scan replies for Milo-filed PR URLs (github.com/Artemis-xyz/<repo>/pull/<num>). For each found:
gh pr view --repo Artemis-xyz/<repo> <num> --json state,mergedAt,title,reviewDecision,closedAt
# If CLOSED (not merged), inspect closing review:
gh pr view --repo Artemis-xyz/<repo> <num> --json comments --jq '[.comments[] | select(.author.login=="claude") ][-1].body'
Classify each Milo PR:
OPEN + all checks pass + MERGEABLE → candidate to merge. Still review the diff yourself before landing.CLOSED (rejected) → read the reviewer's comment. Common rejection patterns (seen in practice):
QUALIFY ORDER BY using a column that ties across duplicates)MERGED → alert should be resolved already; sanity check if it's still firing.Also check for in-flight PRs on coordinated branches (not from Milo):
# Is there an existing branch/PR that already fixes one of our groups?
cd $GOKU_REPO && git branch -a | grep -iE "stablecoin|usdt0|<other_keyword>"
cd $GOKU_REPO && gh pr list --state open --search "in:title <keyword>" --json number,title,headRefName
If found, do NOT create a parallel PR. Options:
origin/main, cherry-pick any additional fixes onto itSeen-in-practice example: PR #5397 for stablecoin sensor gating was open while a new batch of alerts for the same root cause came in — the fix was to extend #5397, not create #5397-v2.
Before investigating, check if recent commits or PRs introduced the breakage:
# Check dbt repo for recent changes to affected models
cd $DBT_REPO && git log --oneline --since="3 days ago" -- models/**/<model_name>*.sql macros/**/*.sql
# Check gokustats for recent job/sensor changes
cd $GOKU_REPO && git log --oneline --since="3 days ago" -- artemis_dagster/jobs/ artemis_dagster/sensors/
# Check for recently merged PRs that touched affected areas
cd $DBT_REPO && gh pr list --state merged --limit 10 --json title,number,mergedAt,files
cd $GOKU_REPO && gh pr list --state merged --limit 10 --json title,number,mergedAt,files
If a recent PR clearly introduced the issue (e.g., PR merged 2 hours ago touching the exact failing model), flag it prominently in the triage output.
Display the triage to the user before proceeding:
Data Alert Triage — <DATE>
═══════════════════════════════════════════════════════════════════
Total alerts: N
Deduplicated groups: M (after DAG lineage dedup)
Categories: X compilation, Y runtime, Z freshness
Root Cause Groups:
─────────────────────────────────────────────────────────────────
Group 1: <upstream_model> (compilation)
Affects: model_a, model_b, model_c (cascade)
Jobs: daily_chain_job, daily_stablecoin_job
Repos: dbt
Recent PR: #3301 merged 2h ago — LIKELY CAUSE
Group 2: <model_x> (runtime)
Affects: model_x only
Jobs: daily_defi_job
Repos: dbt, gokustats
Recent PR: none found
─────────────────────────────────────────────────────────────────
Proceed with investigation? (y/n, or adjust groups)
═══════════════════════════════════════════════════════════════════
Do NOT proceed until the user confirms the triage.
Create worktrees in BOTH repos with the same branch name:
DATE=$(date +%Y-%m-%d)
BRANCH="<engineer>/data-alerts-$DATE"
# dbt worktree
cd $DBT_REPO && git worktree add -b "$BRANCH" "$WORKSPACE/dbt-worktrees/data-alerts-$DATE" origin/main
# gokustats worktree
cd $GOKU_REPO && git worktree add -b "$BRANCH" "$WORKSPACE/worktrees/data-alerts-$DATE/gokustats-back-end" origin/main
Capture absolute worktree paths for agent prompts:
DBT_WT=$WORKSPACE/dbt-worktrees/data-alerts-$DATE
GOKU_WT=$WORKSPACE/worktrees/data-alerts-$DATE/gokustats-back-end
Create a team for this alert batch:
TeamCreate: team_name="data-alerts-<DATE>", description="Fix data alerts batch <DATE>"
Create one task per deduplicated root-cause group from Step 2:
TaskCreate for each group:
title: "Investigate: <root_cause_model> (<category>)"
description: Full context including error logs, affected models, DAG lineage, git history findings
Spawn one general-purpose teammate per root-cause group. Each investigator gets the full investigate-dagster-error workflow inlined in their prompt.
Agent prompt template for each investigator:
You are investigating a Dagster/dbt data alert. Your name is "investigator-<N>".
TEAM: data-alerts-<DATE>
DBT WORKTREE: <DBT_WT> (absolute path — use this for ALL dbt file operations)
GOKU WORKTREE: <GOKU_WT> (absolute path — use this for ALL gokustats file operations)
MAIN DBT REPO: <DBT_REPO> (for git history only)
MAIN GOKU REPO: <GOKU_REPO> (for git history only)
## CRITICAL PATH RULES
EVERY Bash command targeting dbt: cd <DBT_WT> && <command>
EVERY Bash command targeting gokustats: cd <GOKU_WT> && <command>
EVERY Read/Edit/Write for dbt: use absolute paths under <DBT_WT>/
EVERY Read/Edit/Write for gokustats: use absolute paths under <GOKU_WT>/
Git history commands: use the MAIN repos (not worktrees)
## Environment Bootstrap
For dbt commands in the worktree:
source <DBT_REPO>/venv/bin/activate && set -a && source <DBT_REPO>/.env.local && set +a
For Snowflake queries:
snowsql -c artemis -q "<SQL>"
Always use PC_DBT_DB.PROD.<TABLE> FQN. Always include LIMIT.
## Alert Details
<paste the specific alert(s) for this root-cause group>
## Investigation Procedure
Follow these phases in order:
### Phase 1: Parse Error and Identify Affected Models
- Extract model names, error type, error message from the alert
- Locate model files in <DBT_WT>/models/
- Check if macros are involved (<DBT_WT>/macros/)
### Phase 2: Dependency Analysis
- Trace upstream: find all {{ ref('...') }} and {{ source('...') }} in the model
- Trace downstream: search for models that ref the affected model
- Check Dagster job config in <GOKU_WT>/artemis_dagster/jobs/
### Phase 3: Diagnostic SQL
- Table metadata: INFORMATION_SCHEMA.TABLES for affected tables
- Column types: INFORMATION_SCHEMA.COLUMNS if type mismatch suspected
- Reproduce the error with a targeted query (always LIMIT 100)
- Check Snowflake query history for recent errors on affected warehouses:
```sql
SELECT query_text, error_code, error_message, warehouse_name, start_time
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE error_code != '000000' AND warehouse_name = '<WH>'
AND start_time >= DATEADD(day, -2, CURRENT_TIMESTAMP())
ORDER BY start_time DESC LIMIT 20;
Produce:
Launch all investigators in parallel:
Agent(subagent_type="general-purpose", team_name="data-alerts-", name="investigator-1", run_in_background=true, prompt=) Agent(subagent_type="general-purpose", team_name="data-alerts-", name="investigator-2", run_in_background=true, prompt=) ...
#### 4c. Monitor and Coordinate
- Wait for team agents to complete via automatic message delivery
- If two agents discover the same root cause, consolidate their fixes
- If an agent gets stuck, provide guidance via SendMessage
- Collect results: list of commits per worktree, files changed, confidence levels
---
### Step 5: Verify — Local Testing
After all fixes are committed in the worktrees, verify them.
#### 5a. dbt Compile Check
```bash
cd $DBT_WT && source $DBT_REPO/venv/bin/activate && set -a && source $DBT_REPO/.env.local && set +a && dbt compile --select <affected_models_space_separated>
For each affected Dagster job, run it using the dagjob alias:
cd $GOKU_WT && source $GOKU_REPO/.activate && dagjob <job_name>
If dagjob isn't available (worktree context), use the raw command:
cd $GOKU_WT && source $GOKU_REPO/.activate && dagster job execute -f artemis_dagster/primary_definitions.py -j <job_name>
IMPORTANT: Never run make dagster — it loads ALL jobs, which is slow and unnecessary.
CRITICAL: Must unset AWS_ACCESS_KEY_ID before starting Dagster locally (conflicts with Snowflake auth).
Source env from the MAIN gokustats repo (worktrees lack venv/env), then launch with DAGSTER_ONLY_JOBS:
cd $GOKU_REPO && source .activate && unset AWS_ACCESS_KEY_ID && DAGSTER_ONLY_JOBS=<comma_separated_job_names> dagster dev -f artemis_dagster/primary_definitions.py
Or use the convenience script:
cd $GOKU_REPO && source .activate && unset AWS_ACCESS_KEY_ID && python scripts/dagster_local.py <job_name_1> <job_name_2> ...
Example:
cd $GOKU_REPO && source .activate && unset AWS_ACCESS_KEY_ID && DAGSTER_ONLY_JOBS=daily_bsc_job,daily_graph_job dagster dev -f artemis_dagster/primary_definitions.py
Note: Even with job filtering, asset loading takes ~5 minutes (all assets must load for job selection to resolve). This is normal.
Then tell the user:
Local Dagster is running with only the affected jobs. Please:
1. Open http://localhost:3000 in your browser
2. Navigate to the affected job(s): <list job names>
3. Trigger a run and verify success
4. Screenshot the successful run(s)
5. Tell me when you're done — I'll proceed to PR creation
Do NOT proceed until the user confirms verification is complete.
After successful job runs, verify data landed correctly:
snowsql -c artemis -q "
SELECT table_name, row_count, last_altered
FROM PC_DBT_DB.INFORMATION_SCHEMA.TABLES
WHERE table_schema = 'PROD'
AND table_name IN (<affected_table_names_upper>)
ORDER BY last_altered DESC;
"
For freshness alerts, also check the most recent records:
snowsql -c artemis -q "
SELECT MAX(block_timestamp) as latest_record, COUNT(*) as row_count
FROM PC_DBT_DB.PROD.<TABLE>
WHERE block_timestamp >= DATEADD(day, -2, CURRENT_TIMESTAMP())
LIMIT 1;
"
Present results to the user.
Present a summary of all fixes before PR creation:
Fix Summary — data-alerts-<DATE>
═══════════════════════════════════════════════════════════════════
Root Cause Groups Fixed: M / N
dbt/ commits:
abc1234 fix(stablecoin): correct column reference in agg model
def5678 fix(ethereum): update macro for new schema
gokustats-back-end/ commits:
(none — all fixes were in dbt)
Verification:
dbt compile: PASS
dagjob runs: 3/3 PASS
Snowflake check: Data landed, freshness OK
Ready to create PRs? (y/n)
═══════════════════════════════════════════════════════════════════
Do NOT proceed until the user approves.
One PR per repo. Each root-cause fix is already a separate commit.
cd $DBT_WT && git push -u origin <engineer>/data-alerts-$DATE
cd $DBT_WT && gh pr create --title "fix: data alerts batch $DATE (dbt)" --body "$(cat <<'EOF'
## Summary
Batch fix for data alerts on <DATE>.
## Root Causes Fixed
<for each root-cause group:>
- **<model>** (<category>): <1-line description of fix>
- Introduced by: PR #XXX (if known)
- Commits: `abc1234`
## Files Changed
<list files>
## Testing
- [x] `dbt compile` — all affected models compile
- [x] Local Dagster job runs — all passed
- [x] Snowflake spot-check — data landed, freshness OK
## Affected Jobs
<list dagster job names>
---
Generated with Claude Code
EOF
)"
Same pattern, only if there are commits in the gokustats worktree.
PRs created:
dbt: <URL>
gokustats: <URL> (or "no changes needed")
Post a fix summary to #data-alerts using the Slack MCP "send message" tool.
Message format:
:white_check_mark: *Data Alerts Fixed — <DATE>*
*Alerts resolved:* N
*Root causes:* M
<for each root-cause group:>
:point_right: *<model_name>* (<category>)
> <1-line description of what broke and why>
> Fix: <1-line description of the fix>
> PR: <dbt PR URL> / <gokustats PR URL>
*Introduced by:* PR #XXX (if known)
*Verified:* Local Dagster runs + Snowflake spot-check
Send to channel #data-alerts (channel ID: C05S8H76M08).
If the original alerts were from a specific Slack thread, reply in that thread instead of posting a new top-level message. Use the Slack MCP "reply to thread" tool with the original thread timestamp.
Show the user the message before sending. Ask: "Post this to #data-alerts? (y/n, or edit)"
After PRs are merged (not before):
# Remove worktrees
cd $DBT_REPO && git worktree remove $DBT_WT
cd $GOKU_REPO && git worktree remove $GOKU_WT
git worktree prune
# Delete local branches (remote branches auto-delete on PR merge)
cd $DBT_REPO && git branch -D <engineer>/data-alerts-$DATE
cd $GOKU_REPO && git branch -D <engineer>/data-alerts-$DATE
Do NOT run cleanup automatically. Tell the user: "Run /cleanup when the PRs are merged to remove the worktrees."
Before closing the run, append at least one entry to the Session Log under ## Learnings Log — UPDATE THIS EVERY RUN at the bottom of this skill file (~/.claude/skills/investigate-data-alerts/SKILL.md).
Even if the run was uneventful, write one of:
If the learning should ALWAYS apply from now on, promote it into the relevant Step above per the "How to promote an entry" procedure — and note the promotion in the log entry.
This step is not optional. Skills that are never updated rot against the underlying systems they describe.
For jobs with many models (e.g., daily_ez_metrics_job with 3000+ models):
dbt build -s tag:ez_metrics or similar broad selections locallyERROR|FAIL lines to identify the specific failing modelsLarge incremental models (e.g., fact_bsc_transactions_v2 on BAM_TRANSACTION_XLG) will fail or take forever in dev without existing data. Always clone first:
# Option 1: Use the clone script
cd $DBT_REPO && python dbt_scripts/clone_object_into_dev_schema.py <model_name>
# Option 2: Manual clone via SQL (for tables in non-default databases)
snowsql -c artemis -q "CREATE TRANSIENT TABLE PC_DBT_DB.DEV_$USERNAME.<TABLE> CLONE PC_DBT_DB.PROD.<TABLE>;"
Then run the incremental model in dev — it will process only the delta.
For models that normally run on large warehouses (XLG, XXL), override the warehouse in dev to avoid timeouts:
dbt run -s <model> --vars '{"snowflake_warehouse_override": "ANALYTICS_XL"}'
Or if the model doesn't support the override var, temporarily edit the config in the worktree.
.activate, venv, and .env. Always source from the main repo.These principles override default behavior. Violations of these have caused bad fixes in the past.
When a test fails (not_null, recency, etc.), trace the bad data upstream to its origin. Do NOT add WHERE x IS NOT NULL filters in downstream models to suppress the symptom. The user will reject fixes that mask the root cause. Ask: "Why does this bad data exist in the first place?" and fix that.
When recency tests fail, check whether a Dagster job actually exists for the failing model's asset group. Common pattern: dbt models exist and have tags, but no define_asset_job selects their group — the fact tables never get materialized. The fix is creating/fixing the job, not adjusting test thresholds or adding ignore_time_component.
When creating or modifying Dagster job schedules, trace all upstream {{ ref() }} and {{ source() }} dependencies and verify their jobs complete before the new schedule fires. Steps:
CustomDagsterDbtTranslator.get_group_name())The CustomDagsterDbtTranslator.get_group_name() assigns groups via tag first, then falls back to directory path. A model at models/staging/foo/bar.sql gets group foo automatically. If no daily_foo_job = define_asset_job(... groups("foo")) exists, those assets are orphaned — they have a group but no job runs them. When investigating stale data, always check this.
Never jump to applying fixes. Present the triage with your root cause hypothesis and let the user validate. Common failure mode: applying a surface-level fix (test config change, WHERE filter) when the actual issue is infrastructure (missing job, wrong schedule, dead source).
errorChain, not just error.messageDagster wraps errors: RetryRequestedFromPolicy wraps the real cause; DagsterExecutionStepExecutionError wraps an op's Python exception. The top-level error.message is the wrapper ("Exceeded max_retries of 1"). The real cause (e.g. Forbidden: 403 Access Denied, ErrorDuneQueryException: Dune Query NNNN failed, AssertionError: No records found for mony...) is in error.errorChain[0].error.message. Always request errorChain on the GraphQL query.
When an alert's root cause is "downstream cron fired before upstream finished landing," the fix is a completion sensor (make_dbt_models_updated_sensor or make_job_completion_sensor), not a later cron or a test interval widen. Test widens mask the problem; cron tuning is brittle. See the d72b947cb / PR #5397 pattern.
These have broken the skill's pre-commit / push steps in the past.
.env files and venvWorktrees under $WORKSPACE/worktrees/ or $WORKSPACE/dbt-worktrees/ don't have .env.local, .env.shared, .env.1pass, or venv-artemis. Before any git operation that triggers pre-commit hooks (which import Django), source env + venv from the MAIN repo:
MAIN=$GOKU_REPO # or $DBT_REPO
cd $MAIN && set -a && source .env.shared && source .env.local 2>/dev/null; set +a
source $MAIN/venv-artemis/bin/activate
cd $WORKTREE && git commit ... # hooks now have SECRET_KEY + python
Without this, the django-migration-lint pre-commit hook fails with Executable 'python' not found or ImproperlyConfigured: SECRET_KEY must not be empty.
When rebasing fixes onto an existing PR branch, first git merge origin/main into that branch. Example pitfall: Branch X was cut before PR Y merged; PR Y added the exact lines your cherry-pick touches. Without merging main first, your cherry-pick applies against pre-Y state and becomes a no-op diff (or a confusing merge-commit resolution at the final merge).
Correct order:
cd <target_worktree>
git reset --hard origin/<target_branch> # drop stale local commits first
git merge origin/main --no-edit # bring in everything since branch cut
git cherry-pick <fix_commit> # now applies cleanly
# resolve conflicts if any, then push
This section is a living log. Every invocation of /investigate-data-alerts MUST append at least one entry to "Session Log" before closing, even if the run was uneventful. Before you end the run, ask yourself:
If an entry here becomes permanent behavior (i.e., it should always happen), promote it into the relevant Step above and delete the log entry. The log is ephemeral knowledge; the Steps are canonical.
dagster_stdout_*.txt to alert threads. The scripts/download_alert_logs.py helper (on stash in fix/huntress-internal-token-auth branch) returns 0 files. Threads now contain only Milo auto-investigation replies. Bypass: use Dagster GraphQL per-run_id. Promoted to Step 0.5 / Step 1.logsForRun and pipelineRunLogs don't exist at the root; Run.eventConnection is the correct path. eventConnection.limit max is 1000 — paginate via afterCursor. DagsterRunEvent is an interface; message/level need inline fragment on MessageEvent. Promoted to Step 1.A.2.PythonError.errorChain is required to see the actual cause wrapped by RetryRequestedFromPolicy or DagsterExecutionStepExecutionError. Without it you only see wrapper text like "Exceeded max_retries of 1." Promoted to Investigation Principles.get_run_details works but is 6h-stale (sync cron 15 1,7,13,19 * * * UTC). For fresh alerts, Dagster GraphQL is primary; Huntress is fallback. The /dagster/sync-run HTTP endpoint on Huntress returned 502 (provider not configured on the web service container — only on cron container). Promoted to Step 1.A.4.DAGSTER_CLOUD_API_TOKEN is NOT in .env.1pass. Generate a user token at https://artemis.dagster.cloud/prod/user-settings, write to $GOKU_REPO/.env.local, source via .activate. The !export chat trick does not propagate to Bash-tool subprocesses — file persistence is reliable. Promoted to Step 0.5.QUALIFY ORDER BY when tiebreak column ties across dups; (3) fix targets a symptom in a different model than the actual failure. Promoted to Step 2b.5.gh pr list --state open --search "in:title <keyword>" and git branch -a | grep -iE "<keyword>". A parallel PR for an already-open fix is waste. Extend the existing PR instead. Promoted to Step 2b.5.SECRET_KEY. Source .env.local + venv-artemis from the MAIN gokustats repo before any commit in a worktree. Promoted to "Worktree Environment Gotchas".When a learning is stable and should always apply:
**Promoted to <location>.**Delete log entries only when they are both promoted and older than 6 months.
- **YYYY-MM-DD** — <one-sentence summary of what you learned>. <why it matters / what failed without this knowledge>. <Promoted to Step X.Y> OR <kept in log, re-evaluate next run>