From ops-suite
Diagnose why messages are failing in dead letter queues. Surveys DLQs, fetches sample messages, analyzes failure patterns, inspects consumer code, and produces a root cause report. Use when asked about "DLQ triage", "dead letter diagnosis", "failed messages", "message failures", "queue errors".
npx claudepluginhub weorbitant/workbench-dev --plugin ops-suiteThis skill is limited to using the following tools:
Check if `/tmp/ops-suite-session/config.json` exists:
Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Check if /tmp/ops-suite-session/config.json exists:
config.yaml, parse it, and write to /tmp/ops-suite-session/config.json for other skills to reuse.
If neither exists, tell the user to copy config.example.yaml to config.yaml and fill in their values. Stop here.Extract:
message_broker — determines which adapter to loadorchestrator — for connecting to the brokerenvironments — connection detailsAlso read the reference at references/known-patterns.md (in this skill's directory) for common failure patterns.
Read the adapter file at adapters/{message_broker}.md (in this skill's directory).
If the adapter does not exist, tell the user that the message broker {message_broker} is not yet supported and stop.
If $ARGUMENTS contains an environment name, use it. Otherwise ask the user.
Store the selected environment config as env.
Determine whether the target queue is a DLQ or a main queue:
.dead_letter, .dlq, or .errorUse the adapter's "list DLQs" command to find all dead letter queues with messages.
If $ARGUMENTS contains a specific queue name, focus on that queue.
Present the DLQ overview:
DLQs with messages:
| DLQ Name | Messages | Original Queue |
|--------------------------|----------|----------------------|
| {dlq_name} | {count} | {source_queue} |
If there are multiple DLQs, ask the user which one to triage first.
Then continue to Step 4.
This is a missing consumer investigation. Skip Steps 4-5 and go directly to Step 6 (Inspect consumer code), focusing on:
After completing the code inspection, skip to Step 9 to produce the report.
Use the adapter's "peek messages" command to get 3-5 sample messages without consuming them.
For each message, extract:
x-death, x-first-death-reason, x-first-death-queue)Classify failures using the decision tree:
| Indicator | Likely Failure Mode |
|---|---|
x-first-death-reason: rejected | Consumer explicitly rejected the message |
x-first-death-reason: expired | Message TTL exceeded, consumer too slow or down |
| Malformed JSON in body | Serialization error from producer |
| Missing required fields | Schema mismatch between producer and consumer |
| Valid payload, same error repeated | Consumer bug or external dependency failure |
| Messages from different producers | Shared failure (e.g., database down) |
| All messages have same entity ID | Entity-specific data issue |
Use the analyze_messages.py script for bulk analysis if there are many messages:
python3 scripts/analyze_messages.py {messages_file}
If the codebase is available:
grep -r "queue_name" src/config/). Identify the subscription name mapped to this queue.subscribe('{subscription_name}' in the codebase. Compare all subscriptions declared in config vs actual subscribe() calls — any mismatch means orphaned config.application/amqp/) and read its onApplicationBootstrap() method to see which subscriptions are actually registered.If the consumer code is missing or the subscribe() call doesn't exist:
git log --all --oneline -- {subscriber_file_path}git log --all --oneline --diff-filter=D -- {subscriber_directory}/git show {commit_hash} -- {file_path} to see what was removedThis step is critical when the subscription exists in config but no code subscribes to it.
Based on the failure analysis, use read-only skills to gather context automatically:
Use session state from /tmp/ops-suite-session/ — do not re-ask for environment.
Use ops-suite:service-logs with arguments: {consumer_service} {env_name}. Focus on errors related to the DLQ entities. Use session state — do not re-ask for environment.
Present the triage report:
Queue Triage Report
===================
Queue: {dlq_name}
Environment: {env_name}
Messages: {total_count}
Time range: {earliest_timestamp} — {latest_timestamp}
Failure Mode: {classification}
Root Cause: {description}
Evidence:
- {evidence_point_1}
- {evidence_point_2}
- {evidence_point_3}
Sample Message:
Headers: {relevant_headers}
Routing Key: {routing_key}
Body (truncated): {first_200_chars}
Recommendation:
1. {action_1}
2. {action_2}
3. {action_3}
Reprocessable: {yes/no/partial}
{explanation of whether messages can be safely reprocessed after fix}
If messages are reprocessable, add:
Next steps:
→ Run `/ops-suite:queue-reprocess {dlq_name} {env_name}` to move messages back to the main queue.
If the root cause is a missing migration, add:
Next steps:
→ Run `/ops-suite:db-migrate {env_name}` to apply pending migrations.
→ Then run `/ops-suite:queue-reprocess {dlq_name} {env_name}` to reprocess failed messages.
Save triage results to /tmp/ops-suite-session/last-triage.json for use by ops-suite:queue-reprocess.