From sre-skills
Audit a single AWS SQS queue's configuration for misconfigurations that silently drop or re-deliver messages.
How this skill is triggered — by the user, by Claude, or both
Slash command
/sre-skills:sqs-queue-auditorThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Configuration-audit skill for a single AWS SQS queue. Takes the `GetQueueAttributes` output for one queue, applies the judgment a senior engineer applies to that one source (the thresholds, the known-bad combinations, the one arithmetic relationship that turns a correct-looking config into silent message loss), and returns a ranked list of findings with recommendations. Then it names exactly wh...
FAILURE_MODES.mdREADME.mdexamples/01-no-dlq.mdexamples/02-dlq-retention-shorter-than-source.mdexamples/03-maxreceivecount-too-low.mdexamples/04-poison-ages-out-before-dlq.mdexamples/05-default-visibility-short-retention.mdexamples/06-public-queue-policy.mdexamples/07-fifo-dedup-off.mdexamples/08-clean-standard.mdfixtures/01-no-dlq/queue.jsonfixtures/02-dlq-retention-shorter-than-source/dlq.jsonfixtures/02-dlq-retention-shorter-than-source/queue.jsonfixtures/03-maxreceivecount-too-low/dlq.jsonfixtures/03-maxreceivecount-too-low/queue.jsonfixtures/04-poison-ages-out-before-dlq/dlq.jsonfixtures/04-poison-ages-out-before-dlq/queue.jsonfixtures/05-default-visibility-short-retention/dlq.jsonfixtures/05-default-visibility-short-retention/queue.jsonfixtures/06-public-queue-policy/dlq.jsonConfiguration-audit skill for a single AWS SQS queue. Takes the GetQueueAttributes output for one queue, applies the judgment a senior engineer applies to that one source (the thresholds, the known-bad combinations, the one arithmetic relationship that turns a correct-looking config into silent message loss), and returns a ranked list of findings with recommendations. Then it names exactly where a single queue's configuration stops being able to answer the question.
It reads the static configuration of one queue, plus the attributes of the dead-letter queue that queue's own RedrivePolicy points at. Both are SQS control-plane reads (GetQueueAttributes). That is the entire input. The audit is correct and complete for what a queue's configuration can tell you, and it is explicit about the rest:
Every audit ends by naming these. The boundary is the same one every time: the join across resources, across sources, or across time.
GetQueueAttributes returns every value as a string, and the compound attributes are JSON documents encoded inside those strings. Before any judgment:
RedrivePolicy (a JSON string) into deadLetterTargetArn and maxReceiveCount. A queue with no RedrivePolicy has no DLQ.MessageRetentionPeriod, VisibilityTimeout, DelaySeconds as integer seconds (they arrive as strings).Policy (a JSON string) into IAM statements, if present.FifoQueue, ContentBasedDeduplication, SqsManagedSseEnabled, KmsMasterKeyId.A naive read skips the embedded JSON entirely and never sees the redrive wiring. Parsing it is step zero of the judgment.
The dead-letter path is where messages are supposed to go when processing fails. Three things break it:
RedrivePolicy, a poison message is retried until MessageRetentionPeriod expires, then deleted with no signal. There is no quarantine.SentTimestamp, and SQS does not reset that timestamp when the message moves to the DLQ. If the DLQ's retention is less than or equal to the source's, a message that fails late in the source's window arrives in the DLQ already near its age limit and is deleted almost immediately. The DLQ looks wired and sized; the messages you most need to inspect are the ones it drops. This is the single most important non-obvious check in the skill.Three queue-side timing relationships, all derivable from the static config:
maxReceiveCount x VisibilityTimeout seconds of wall-clock to exhaust its receive count and dead-letter. If that product exceeds MessageRetentionPeriod, retention wins: the message is deleted by age before it ever reaches the DLQ. The DLQ is configured but unreachable for slow failures. This is pure arithmetic on three attributes and is almost never checked by hand. Fire R4 only on the configured-value inequality (maxReceiveCount x VisibilityTimeout > MessageRetentionPeriod); do not raise it speculatively because backlog or load "might" stretch the wall-clock. The product is already a lower bound, so a config that satisfies the inequality is safe by construction. Queue depth and receive cadence are behind the boundary, not inputs to this check.Policy statement that allows a wildcard principal ("*") with no narrowing Condition (aws:SourceArn, aws:SourceAccount, aws:PrincipalOrgID) authorises any AWS principal to act on the queue. This is the confused-deputy and public-queue exposure. A wildcard principal with a SourceArn condition (the standard SNS-to-SQS pattern) is fine and must not be flagged.ContentBasedDeduplication is off on a FIFO queue, every producer must supply an explicit MessageDeduplicationId or duplicate sends are accepted as distinct. Whether the producers actually do this is a property of the producers, not the queue. Flagged low and deferred to the boundary.Order findings by severity (critical, high, medium, low). For each: the rule, the attribute(s) it is grounded in, what breaks, and the fix. Then list the boundary: the joins this audit cannot make. A clean config still gets a boundary section, because a clean config is not a clean system.
| Severity | Meaning |
|---|---|
| critical | A configuration that silently loses messages. R3 and R4. |
| high | A configuration that loses messages on a poison input, or exposes the queue. R1, R7. |
| medium | A configuration that loses messages under an ordinary operational gap. R2 (too low), R6. |
| low | A risk flag whose confirmation needs something behind the boundary. R2 (too high), R5, R8, R9. |
The low band is deliberately honest: those findings depend on consumer processing time, data classification, or producer behaviour, none of which is a queue attribute. The skill flags them for verification rather than asserting a bug it cannot prove.
| Code | Rule | Severity | Grounded in |
|---|---|---|---|
| R1 | No dead-letter queue on a processing queue | high | RedrivePolicy absent |
| R2 | maxReceiveCount outside the 3-10 band | medium / low | RedrivePolicy.maxReceiveCount |
| R3 | DLQ retention not longer than source retention | critical | source vs DLQ MessageRetentionPeriod |
| R4 | Poison messages age out before reaching the DLQ | critical | VisibilityTimeout x maxReceiveCount vs MessageRetentionPeriod |
| R5 | Visibility timeout at the 30s default | low | VisibilityTimeout |
| R6 | Retention shorter than a plausible outage | medium | MessageRetentionPeriod |
| R7 | Resource policy allows a wildcard principal with no condition | high | Policy |
| R8 | Server-side encryption at rest disabled | low | SqsManagedSseEnabled / KmsMasterKeyId |
| R9 | FIFO queue with content-based dedup off | low | ContentBasedDeduplication |
The agent's final message in any invocation must include:
Eight end-to-end examples are committed under examples/, each with fixtures (real GetQueueAttributes shape) and a runnable replay test. Each isolates one rule, except where two genuinely co-occur.
examples/01-no-dlq.md: a payments queue with no DLQ; poison messages are retried until retention expiry, then dropped (R1).examples/02-dlq-retention-shorter-than-source.md: the silent-loss bug; the DLQ retains for less time than the source, so late failures are deleted on arrival (R3).examples/03-maxreceivecount-too-low.md: maxReceiveCount=1 dead-letters good messages on the first transient failure (R2).examples/04-poison-ages-out-before-dlq.md: the flagship; a 15-minute visibility timeout and maxReceiveCount=1000 mean poison messages age out before reaching a correctly-wired DLQ (R4, plus R2).examples/05-default-visibility-short-retention.md: a 30s default visibility timeout and 5-minute retention; two soft flags that defer to the boundary (R5, R6).examples/06-public-queue-policy.md: a resource policy with Principal: "*" and no condition, on an unencrypted queue (R7, R8).examples/07-fifo-dedup-off.md: a FIFO queue with content-based dedup off, depending on a producer contract the queue cannot verify (R9).examples/08-clean-standard.md: the control; a correctly-configured queue produces zero findings and still reports its boundary.Every example has a replay test in tests/ that runs the audit against committed fixtures, with no external credentials. Run from the skill directory:
for t in tests/replay_*.py; do python "$t" || exit 1; done
The 8 tests cover all nine rules, the severity model, and the clean-control (no false positives), totalling 48 assertions. Tests exit non-zero if the audit produces the wrong findings or drops the boundary. See tests/README.md for the fixture schema and how to add a new replay test.
This skill is wrong in predictable ways. Read FAILURE_MODES.md before relying on it. Highlights:
The audit above runs end-to-end against the GetQueueAttributes output the user already has. No Anyshift dependency.
Every boundary note in this skill is a join: queue to its consumers, queue to its metrics over time, queue to the account's IAM graph, queue to the producers and consumers on either side. The Anyshift MCP can act as a context primer by resolving those joins from a versioned resource graph, so a finding like R5 ("visibility timeout at default, verify against consumer processing time") or R7 ("resource policy is half the access story") can be closed instead of deferred. A measured "with vs without" delta will be published here once the integration has been exercised against the replay fixtures.
npx claudepluginhub anyshift-io/claude-plugins --plugin sre-skillsReviews AWS event-driven system design across EventBridge, SQS, SNS, Step Functions, and Pipes. Covers idempotency, DLQs, retries, replay, and Lambda production readiness.
Reviews Kafka dead letter queue implementations for completeness using Lenses MCP. Inspects DLQ topics, configuration, monitoring, metadata, retry logic, and connector alignment.
Guides selection and implementation of AWS messaging (SQS, SNS, EventBridge, MQ) and streaming (Kinesis, Firehose, Flink, MSK) services for event-driven architectures and data pipelines.