Search everything...

Skill

observability-first-debugging

Use when debugging a production or staging issue, monitoring a hot path, or instrumenting code that should be observable. Do NOT use for local-only debugging of new code (use your IDE). Covers logs/metrics/traces-first method, structured logging, correlation ID propagation, alarm design.

npx claudepluginhub lgerard314/global-marketplace --plugin global-plugin

Tool Access

This skill is limited to using the following tools:

ReadGrepGlobBash

Preview

Debug by reading the system, not by guessing. Apply when investigating live issues, reviewing hot-path handlers, or writing code for shared environments.

Supporting Assets

references/patterns.md

SKILL.md

Similar Skills

cache-components

139.3k

Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.

cache-components

mcp-builder

124.2k

Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).

9 files

anthropics-skills-13

Stats

Parent Repo Stars1

Parent Repo Forks0

Last CommitApr 29, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

observability-first-debugging | global-plugin | ClaudePluginHub

Back to Skills

Skill

observability-first-debugging

From global-plugin

npx claudepluginhub lgerard314/global-marketplace --plugin global-plugin

Tool Access

This skill is limited to using the following tools:

ReadGrepGlobBash

Preview

Debug by reading the system, not by guessing. Apply when investigating live issues, reviewing hot-path handlers, or writing code for shared environments.

Supporting Assets

references/patterns.md

SKILL.md

Observability-first debugging

Purpose & scope

Debug by reading the system, not by guessing. Apply when investigating live issues, reviewing hot-path handlers, or writing code for shared environments.

Core rules

Start with logs → metrics → traces. Do not reach for console.log patches in prod. — Why: redeploying to add instrumentation adds risk and delay; structured logs/metrics/traces should already contain the answer.
Every log line is structured JSON with requestId, service, level, and relevant domain fields — never free-text strings alone. — Why: structured fields can be indexed, filtered, and used in metric filters without custom parsing.
A single request gets a correlation ID at the edge; every downstream log, trace span, and outbound fetch carries it. — Why: without a correlation ID you cannot reconstruct a single request's lifecycle across services; timestamp-only correlation fails under concurrency.
Metrics exist for every handler: latency histogram, error rate, throughput. Alerts are based on p95/p99 and error rate, not p50. — Why: p50 hides tail suffering; users in the 95th/99th percentile experience real degradation while the median looks fine.
Traces span across services. Critical paths — checkout, auth, payment — are instrumented end to end. — Why: a slow response often has its root cause in a downstream service; without traces you see the symptom but not the cause.
Alarms have runbooks. An alarm that fires without a runbook is an incident amplifier. — Why: at 3 AM, an on-call who does not know what an alarm means will either over-page or take wrong action under pressure.
Error reports include: what the user was doing, request ID, timestamp, input shape (non-PII), and the downstream error cause chain. — Why: the cause chain is the most commonly missing piece and the difference between a 30-minute fix and a riddle.

Red flags

Thought	Reality
"Just `console.log` and redeploy"	You are coding blind. Every redeployment adds risk and delay. Proper structured logging should already capture what you need.
"Logs are strings, grep works"	For a single request at 3 AM across three services with 10k req/min throughput, grep-by-string is a nightmare. Structured fields with an indexed query are the difference between 5 minutes and 50.
"Alarm on p50 latency"	The tail is where users suffer. p50 can look healthy while the slowest 5% of users — often the ones with the most data or the most complex accounts — experience a broken product.

Good vs bad

Structured Pino log line vs free-text console.log

Bad — free-text with no fields for filtering or correlation:

// BAD: unstructured, no requestId, impossible to aggregate or alert on
console.log(`User ${userId} failed to checkout: ${error.message}`);

Good — structured JSON via Pino with requestId and domain fields:

// GOOD: structured fields enable filtering, alerting, and correlation
import { logger } from './logger'; // Pino instance bound to the request context

logger.error(
  {
    requestId:  ctx.requestId,
    service:    'checkout',
    userId,
    orderId,
    errorCode:  error.code,
    durationMs: Date.now() - startTime,
  },
  'Checkout failed — payment declined',
);

Correlation ID middleware propagated into downstream fetch vs missing

Bad — no correlation ID; requests cannot be traced across services:

// BAD: downstream service has no way to link this call to the originating request
async function fetchInventory(skuId: string): Promise<number> {
  const res = await fetch(`https://inventory.internal/sku/${skuId}/stock`);
  return res.json();
}

Good — correlation ID from AsyncLocalStorage context forwarded in every outbound call:

// GOOD: x-request-id flows into downstream logs and traces
import { getRequestContext } from './request-context'; // AsyncLocalStorage-backed

async function fetchInventory(skuId: string): Promise<number> {
  const { requestId } = getRequestContext();
  const res = await fetch(`https://inventory.internal/sku/${skuId}/stock`, {
    headers: { 'x-request-id': requestId },
  });
  return res.json();
}

p99 latency alarm with runbook vs p50 alarm

Bad — alarming on median; tail latency goes undetected:

# BAD: p50 alarm misses the users suffering at the tail
AlarmName: CheckoutLatencyHigh
MetricName: checkout.latency.p50
Threshold: 500   # ms
# No runbook link

Good — p99 alarm with explicit runbook link:

# GOOD: p99 catches tail; runbook tells on-call what to do
AlarmName: CheckoutLatencyP99High
MetricName: checkout.latency.p99
Threshold: 1000  # ms
AlarmDescription: |
  p99 checkout latency exceeded 1 000 ms.
  Runbook: https://runbooks.internal/checkout-latency-high

Structured logging (Pino) setup

For the Pino base config, PII redaction setup, and AsyncLocalStorage-backed request middleware, see references/patterns.md — Structured logging (Pino) setup section.

Correlation IDs end-to-end

A correlation ID is a UUID generated at the entry point and carried through every downstream op (HTTP, queue messages, DB query comments, trace spans).

The AsyncLocalStorage pattern avoids threading the ID through every function argument: middleware establishes the context; anywhere downstream calls getRequestContext().

Propagating into downstream HTTP fetches:

import { getRequestContext } from './request-context';

export async function callDownstreamService<T>(url: string, init?: RequestInit): Promise<T> {
  const { requestId } = getRequestContext();

  const res = await fetch(url, {
    ...init,
    headers: {
      ...init?.headers,
      'x-request-id': requestId,  // standard header; some teams use 'traceparent' (W3C)
      'content-type': 'application/json',
    },
  });

  if (!res.ok) throw new UpstreamError(res.status, url);
  return res.json() as Promise<T>;
}

Propagating into SQS / SNS messages:

import { SQSClient, SendMessageCommand } from '@aws-sdk/client-sqs';
import { getRequestContext } from './request-context';

const sqs = new SQSClient({});

export async function enqueueEvent(queueUrl: string, payload: unknown): Promise<void> {
  const { requestId } = getRequestContext();

  await sqs.send(new SendMessageCommand({
    QueueUrl:    queueUrl,
    MessageBody: JSON.stringify(payload),
    MessageAttributes: {
      requestId: {
        DataType:    'String',
        StringValue: requestId,  // consumer extracts this and re-establishes context
      },
    },
  }));
}

Re-establishing context in the queue consumer:

import { SQSEvent } from 'aws-lambda';
import { storage } from './request-context';
import { logger } from './logger';

export async function handler(event: SQSEvent): Promise<void> {
  for (const record of event.Records) {
    const requestId = record.messageAttributes?.['requestId']?.stringValue ?? crypto.randomUUID();
    const log = logger.child({ requestId, queue: record.eventSourceARN });

    await storage.run({ requestId, log }, () => processRecord(record));
  }
}

W3C traceparent for OpenTelemetry compatibility. With OTel, propagate the standard traceparent header instead of (or alongside) x-request-id. OTel's propagation.inject(context, headers) writes it automatically via the HTTP instrumentation library.

Metrics that matter

Instrument every handler with the four golden signals (latency, traffic, errors, saturation). Examples in CloudWatch EMF; concepts are universal.

Emitting structured metrics via CloudWatch EMF:

import { createMetricsLogger, Unit } from 'aws-embedded-metrics';

export async function checkoutHandler(req: Request, res: Response): Promise<void> {
  const metrics = createMetricsLogger();
  metrics.setNamespace('MyApp/Checkout');
  metrics.setDimensions({ service: 'checkout', env: process.env.NODE_ENV ?? 'prod' });

  const start = Date.now();
  try {
    await processCheckout(req.body);
    metrics.putMetric('CheckoutSuccess', 1, Unit.Count);
    res.status(200).json({ ok: true });
  } catch (err) {
    metrics.putMetric('CheckoutError', 1, Unit.Count);
    throw err;
  } finally {
    metrics.putMetric('CheckoutDuration', Date.now() - start, Unit.Milliseconds);
    await metrics.flush(); // writes JSON metric payload to stdout → CloudWatch
  }
}

Alternatively extract metrics from Pino JSON logs via a AWS::Logs::MetricFilter with FilterPattern: '{ $.level = "error" && $.service = "checkout" }' and dimensions keyed on $.errorCode.

Percentiles over averages. Always publish p50/p95/p99 (CloudWatch PERCENTILE statistic). Averages mask bimodal distributions.

What to measure on every handler at minimum:

Metric	Unit	Purpose
`<handler>.duration`	Milliseconds histogram	Latency; alert on p99
`<handler>.requests`	Count	Throughput; alert on anomalous drops
`<handler>.errors`	Count with `errorCode` dimension	Error rate; alert on rate increase
`<handler>.saturation`	Gauge (concurrent in-flight)	Back-pressure signal

Tracing critical paths

For OpenTelemetry SDK setup, manual span pattern, and trace-context-into-Pino injection, see references/patterns.md — Tracing critical paths section.

Alarm design + runbooks

Alert on user-visible symptoms; every alarm gets a runbook.

Alarm tiers:

Tier	Condition	Paging policy
P1 — Critical	Error rate > 5% for 5 min OR p99 latency > 2×SLO for 5 min	Page immediately, escalate after 15 min
P2 — Warning	Error rate > 1% for 10 min OR p99 latency > 1.5×SLO for 10 min	Slack alert, page if not acknowledged in 30 min
P3 — Info	Unusual traffic drop, circuit breaker opened	Slack only

CloudWatch alarm with runbook (CDK):

import { Alarm, ComparisonOperator, TreatMissingData } from 'aws-cdk-lib/aws-cloudwatch';
import { SnsAction } from 'aws-cdk-lib/aws-cloudwatch-actions';

new Alarm(this, 'CheckoutP99LatencyAlarm', {
  alarmName:        'checkout-p99-latency-high',
  alarmDescription: [
    'p99 checkout latency exceeded 1 000 ms for 5 consecutive minutes.',
    'Runbook: https://runbooks.internal/checkout-latency-high',
    'Dashboard: https://cloudwatch.aws.amazon.com/...#dashboards:name=Checkout',
  ].join('\n'),
  metric: checkoutDurationMetric.with({
    statistic: 'p99',
    period:    Duration.minutes(1),
  }),
  threshold:            1000,
  evaluationPeriods:    5,
  comparisonOperator:   ComparisonOperator.GREATER_THAN_THRESHOLD,
  treatMissingData:     TreatMissingData.NOT_BREACHING,
  actionsEnabled:       true,
}).addAlarmAction(new SnsAction(oncallTopic));

Runbook sections (required): what fired; user impact; dashboard links; numbered investigation steps (check p50 co-movement, Logs Insights error-by-code, downstream status pages, recent deploys); mitigation options (feature-flag fallback, scale read replicas, rollback); escalation path (team lead after 30 min).

Debugging playbook

Work these steps in order.

Step 1 — Define the blast radius. How many users, which features, since when? Check the error-rate dashboard for the inflection point.

Step 2 — Read the logs. Query structured logs for the time window and affected service. Use a known-bad requestId if you have one, or filter by level=error and service=<affected>. Look for error codes, cause chains, unexpected fields, and patterns in affected requests (same userId prefix, skuId, downstream endpoint).

# CloudWatch Logs Insights query for the affected window
fields @timestamp, requestId, errorCode, durationMs, @message
| filter service = "checkout" and level = "error"
| filter @timestamp between <start> and <end>
| sort @timestamp desc
| limit 50

Step 3 — Read the metrics. Pull p99 latency, error rate, and throughput for the affected handler over the last 2 h. Look for correlation with a deploy (vertical annotation), step-function change (config change), or gradual degradation (resource exhaustion/leak).

Step 4 — Pull a trace. Take one requestId from step 2 and find its trace. Common culprits: bad DB query plan, downstream timeout, new code path calling an extra external service.

Step 5 — Form hypotheses (max three). Ordered by probability. Be specific: "Payment provider /charges returning 503 since 14:32 UTC" is a hypothesis; "something wrong with payments" is not.

Step 6 — Verify, then fix. Test the most likely hypothesis with the minimal action (status page, deploy log, specific metric query). Propose a fix only after confirming.

Step 7 — Post-mortem signal. File the log/metric/trace gap that made diagnosis harder than needed.

Interactions with other skills

REQUIRED BACKGROUND: superpowers:systematic-debugging — this skill attaches inside Phase 1 (Root Cause) of that workflow; it does not replace it.

Owns: observability culture and patterns — structured logging, correlation ID propagation, metric instrumentation, distributed tracing, alarm design.
Hands off to: resilience-and-error-handling for when to catch and log errors and how to propagate typed error causes; change-risk-evaluation for which metrics and alarms to watch during and after a deploy.
Does not duplicate: vendor-specific setup docs, or queue-and-retry-safety's dead-letter-queue monitoring patterns.

Review checklist

Produce a markdown report with these sections:

Summary — one line: GREEN / YELLOW / RED.
Handler inventory — for each HTTP handler or queue consumer: file:line, has structured logs (yes/no), emits latency metric (yes/no), emits error metric (yes/no), correlation ID propagated (yes/no).
Findings — per issue: File:line, severity (blocking | concern | info), rule violated, what's wrong, recommended fix.
Safer alternative — for each observability gap flagged in Findings, propose a lower-risk mitigation before reaching for new log lines. If structured logs are unavailable on the hot path, prefer a targeted deploy-tracking dashboard or existing metric drill-down over adding ad-hoc logs. Prefer existing traces/spans over new print-style logs when debugging request flow, and prefer re-using an already-emitted correlation ID over introducing a new one.
Alarm coverage — list alarms found in IaC; for each: metric, threshold, statistic (p50/p95/p99/avg), runbook link present (yes/no).
Checklist coverage — for each of the 7 core rules, mark: PASS / CONCERN / NOT APPLICABLE.
- Rule 1: Logs/metrics/traces consulted before proposing changes
- Rule 2: All log lines are structured JSON with requestId, service, level, and domain fields
- Rule 3: Correlation ID generated at edge and propagated into all downstream calls and messages
- Rule 4: Every handler has latency histogram, error rate, and throughput metrics; alerts on p95/p99
- Rule 5: Traces instrument critical paths end-to-end across service boundaries
- Rule 6: Every alarm has a runbook with investigation steps and mitigation options
- Rule 7: Error reports include request ID, timestamp, input shape, and full downstream cause chain

Similar Skills

cache-components

139.3k

cache-components

mcp-builder

124.2k

Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).

9 files

anthropics-skills-13

Stats

Parent Repo Stars1

Parent Repo Forks0

Last CommitApr 29, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Observability-first debugging

Purpose & scope

Debug by reading the system, not by guessing. Apply when investigating live issues, reviewing hot-path handlers, or writing code for shared environments.

Core rules

Start with logs → metrics → traces. Do not reach for console.log patches in prod. — Why: redeploying to add instrumentation adds risk and delay; structured logs/metrics/traces should already contain the answer.
Every log line is structured JSON with requestId, service, level, and relevant domain fields — never free-text strings alone. — Why: structured fields can be indexed, filtered, and used in metric filters without custom parsing.
A single request gets a correlation ID at the edge; every downstream log, trace span, and outbound fetch carries it. — Why: without a correlation ID you cannot reconstruct a single request's lifecycle across services; timestamp-only correlation fails under concurrency.
Metrics exist for every handler: latency histogram, error rate, throughput. Alerts are based on p95/p99 and error rate, not p50. — Why: p50 hides tail suffering; users in the 95th/99th percentile experience real degradation while the median looks fine.
Traces span across services. Critical paths — checkout, auth, payment — are instrumented end to end. — Why: a slow response often has its root cause in a downstream service; without traces you see the symptom but not the cause.
Alarms have runbooks. An alarm that fires without a runbook is an incident amplifier. — Why: at 3 AM, an on-call who does not know what an alarm means will either over-page or take wrong action under pressure.
Error reports include: what the user was doing, request ID, timestamp, input shape (non-PII), and the downstream error cause chain. — Why: the cause chain is the most commonly missing piece and the difference between a 30-minute fix and a riddle.

Red flags

Thought	Reality
"Just `console.log` and redeploy"	You are coding blind. Every redeployment adds risk and delay. Proper structured logging should already capture what you need.
"Logs are strings, grep works"	For a single request at 3 AM across three services with 10k req/min throughput, grep-by-string is a nightmare. Structured fields with an indexed query are the difference between 5 minutes and 50.
"Alarm on p50 latency"	The tail is where users suffer. p50 can look healthy while the slowest 5% of users — often the ones with the most data or the most complex accounts — experience a broken product.

Good vs bad

Structured Pino log line vs free-text console.log

Bad — free-text with no fields for filtering or correlation:

// BAD: unstructured, no requestId, impossible to aggregate or alert on
console.log(`User ${userId} failed to checkout: ${error.message}`);

Good — structured JSON via Pino with requestId and domain fields:

// GOOD: structured fields enable filtering, alerting, and correlation
import { logger } from './logger'; // Pino instance bound to the request context

logger.error(
  {
    requestId:  ctx.requestId,
    service:    'checkout',
    userId,
    orderId,
    errorCode:  error.code,
    durationMs: Date.now() - startTime,
  },
  'Checkout failed — payment declined',
);

Correlation ID middleware propagated into downstream fetch vs missing

Bad — no correlation ID; requests cannot be traced across services:

// BAD: downstream service has no way to link this call to the originating request
async function fetchInventory(skuId: string): Promise<number> {
  const res = await fetch(`https://inventory.internal/sku/${skuId}/stock`);
  return res.json();
}

Good — correlation ID from AsyncLocalStorage context forwarded in every outbound call:

// GOOD: x-request-id flows into downstream logs and traces
import { getRequestContext } from './request-context'; // AsyncLocalStorage-backed

async function fetchInventory(skuId: string): Promise<number> {
  const { requestId } = getRequestContext();
  const res = await fetch(`https://inventory.internal/sku/${skuId}/stock`, {
    headers: { 'x-request-id': requestId },
  });
  return res.json();
}

p99 latency alarm with runbook vs p50 alarm

Bad — alarming on median; tail latency goes undetected:

# BAD: p50 alarm misses the users suffering at the tail
AlarmName: CheckoutLatencyHigh
MetricName: checkout.latency.p50
Threshold: 500   # ms
# No runbook link

Good — p99 alarm with explicit runbook link:

# GOOD: p99 catches tail; runbook tells on-call what to do
AlarmName: CheckoutLatencyP99High
MetricName: checkout.latency.p99
Threshold: 1000  # ms
AlarmDescription: |
  p99 checkout latency exceeded 1 000 ms.
  Runbook: https://runbooks.internal/checkout-latency-high

Structured logging (Pino) setup

For the Pino base config, PII redaction setup, and AsyncLocalStorage-backed request middleware, see references/patterns.md — Structured logging (Pino) setup section.

Correlation IDs end-to-end

A correlation ID is a UUID generated at the entry point and carried through every downstream op (HTTP, queue messages, DB query comments, trace spans).

The AsyncLocalStorage pattern avoids threading the ID through every function argument: middleware establishes the context; anywhere downstream calls getRequestContext().

Propagating into downstream HTTP fetches:

import { getRequestContext } from './request-context';

export async function callDownstreamService<T>(url: string, init?: RequestInit): Promise<T> {
  const { requestId } = getRequestContext();

  const res = await fetch(url, {
    ...init,
    headers: {
      ...init?.headers,
      'x-request-id': requestId,  // standard header; some teams use 'traceparent' (W3C)
      'content-type': 'application/json',
    },
  });

  if (!res.ok) throw new UpstreamError(res.status, url);
  return res.json() as Promise<T>;
}

Propagating into SQS / SNS messages:

import { SQSClient, SendMessageCommand } from '@aws-sdk/client-sqs';
import { getRequestContext } from './request-context';

const sqs = new SQSClient({});

export async function enqueueEvent(queueUrl: string, payload: unknown): Promise<void> {
  const { requestId } = getRequestContext();

  await sqs.send(new SendMessageCommand({
    QueueUrl:    queueUrl,
    MessageBody: JSON.stringify(payload),
    MessageAttributes: {
      requestId: {
        DataType:    'String',
        StringValue: requestId,  // consumer extracts this and re-establishes context
      },
    },
  }));
}

Re-establishing context in the queue consumer:

import { SQSEvent } from 'aws-lambda';
import { storage } from './request-context';
import { logger } from './logger';

export async function handler(event: SQSEvent): Promise<void> {
  for (const record of event.Records) {
    const requestId = record.messageAttributes?.['requestId']?.stringValue ?? crypto.randomUUID();
    const log = logger.child({ requestId, queue: record.eventSourceARN });

    await storage.run({ requestId, log }, () => processRecord(record));
  }
}

Metrics that matter

Instrument every handler with the four golden signals (latency, traffic, errors, saturation). Examples in CloudWatch EMF; concepts are universal.

Emitting structured metrics via CloudWatch EMF:

import { createMetricsLogger, Unit } from 'aws-embedded-metrics';

export async function checkoutHandler(req: Request, res: Response): Promise<void> {
  const metrics = createMetricsLogger();
  metrics.setNamespace('MyApp/Checkout');
  metrics.setDimensions({ service: 'checkout', env: process.env.NODE_ENV ?? 'prod' });

  const start = Date.now();
  try {
    await processCheckout(req.body);
    metrics.putMetric('CheckoutSuccess', 1, Unit.Count);
    res.status(200).json({ ok: true });
  } catch (err) {
    metrics.putMetric('CheckoutError', 1, Unit.Count);
    throw err;
  } finally {
    metrics.putMetric('CheckoutDuration', Date.now() - start, Unit.Milliseconds);
    await metrics.flush(); // writes JSON metric payload to stdout → CloudWatch
  }
}

Alternatively extract metrics from Pino JSON logs via a AWS::Logs::MetricFilter with FilterPattern: '{ $.level = "error" && $.service = "checkout" }' and dimensions keyed on $.errorCode.

Percentiles over averages. Always publish p50/p95/p99 (CloudWatch PERCENTILE statistic). Averages mask bimodal distributions.

What to measure on every handler at minimum:

Metric	Unit	Purpose
`<handler>.duration`	Milliseconds histogram	Latency; alert on p99
`<handler>.requests`	Count	Throughput; alert on anomalous drops
`<handler>.errors`	Count with `errorCode` dimension	Error rate; alert on rate increase
`<handler>.saturation`	Gauge (concurrent in-flight)	Back-pressure signal

Tracing critical paths

For OpenTelemetry SDK setup, manual span pattern, and trace-context-into-Pino injection, see references/patterns.md — Tracing critical paths section.

Alarm design + runbooks

Alert on user-visible symptoms; every alarm gets a runbook.

Alarm tiers:

Tier	Condition	Paging policy
P1 — Critical	Error rate > 5% for 5 min OR p99 latency > 2×SLO for 5 min	Page immediately, escalate after 15 min
P2 — Warning	Error rate > 1% for 10 min OR p99 latency > 1.5×SLO for 10 min	Slack alert, page if not acknowledged in 30 min
P3 — Info	Unusual traffic drop, circuit breaker opened	Slack only

CloudWatch alarm with runbook (CDK):

import { Alarm, ComparisonOperator, TreatMissingData } from 'aws-cdk-lib/aws-cloudwatch';
import { SnsAction } from 'aws-cdk-lib/aws-cloudwatch-actions';

new Alarm(this, 'CheckoutP99LatencyAlarm', {
  alarmName:        'checkout-p99-latency-high',
  alarmDescription: [
    'p99 checkout latency exceeded 1 000 ms for 5 consecutive minutes.',
    'Runbook: https://runbooks.internal/checkout-latency-high',
    'Dashboard: https://cloudwatch.aws.amazon.com/...#dashboards:name=Checkout',
  ].join('\n'),
  metric: checkoutDurationMetric.with({
    statistic: 'p99',
    period:    Duration.minutes(1),
  }),
  threshold:            1000,
  evaluationPeriods:    5,
  comparisonOperator:   ComparisonOperator.GREATER_THAN_THRESHOLD,
  treatMissingData:     TreatMissingData.NOT_BREACHING,
  actionsEnabled:       true,
}).addAlarmAction(new SnsAction(oncallTopic));

Debugging playbook

Work these steps in order.

Step 1 — Define the blast radius. How many users, which features, since when? Check the error-rate dashboard for the inflection point.

# CloudWatch Logs Insights query for the affected window
fields @timestamp, requestId, errorCode, durationMs, @message
| filter service = "checkout" and level = "error"
| filter @timestamp between <start> and <end>
| sort @timestamp desc
| limit 50

Step 4 — Pull a trace. Take one requestId from step 2 and find its trace. Common culprits: bad DB query plan, downstream timeout, new code path calling an extra external service.

Step 5 — Form hypotheses (max three). Ordered by probability. Be specific: "Payment provider /charges returning 503 since 14:32 UTC" is a hypothesis; "something wrong with payments" is not.

Step 6 — Verify, then fix. Test the most likely hypothesis with the minimal action (status page, deploy log, specific metric query). Propose a fix only after confirming.

Step 7 — Post-mortem signal. File the log/metric/trace gap that made diagnosis harder than needed.

Interactions with other skills

REQUIRED BACKGROUND: superpowers:systematic-debugging — this skill attaches inside Phase 1 (Root Cause) of that workflow; it does not replace it.

Owns: observability culture and patterns — structured logging, correlation ID propagation, metric instrumentation, distributed tracing, alarm design.
Hands off to: resilience-and-error-handling for when to catch and log errors and how to propagate typed error causes; change-risk-evaluation for which metrics and alarms to watch during and after a deploy.
Does not duplicate: vendor-specific setup docs, or queue-and-retry-safety's dead-letter-queue monitoring patterns.

Review checklist

Produce a markdown report with these sections:

Summary — one line: GREEN / YELLOW / RED.
Handler inventory — for each HTTP handler or queue consumer: file:line, has structured logs (yes/no), emits latency metric (yes/no), emits error metric (yes/no), correlation ID propagated (yes/no).
Findings — per issue: File:line, severity (blocking | concern | info), rule violated, what's wrong, recommended fix.
Safer alternative — for each observability gap flagged in Findings, propose a lower-risk mitigation before reaching for new log lines. If structured logs are unavailable on the hot path, prefer a targeted deploy-tracking dashboard or existing metric drill-down over adding ad-hoc logs. Prefer existing traces/spans over new print-style logs when debugging request flow, and prefer re-using an already-emitted correlation ID over introducing a new one.
Alarm coverage — list alarms found in IaC; for each: metric, threshold, statistic (p50/p95/p99/avg), runbook link present (yes/no).
Checklist coverage — for each of the 7 core rules, mark: PASS / CONCERN / NOT APPLICABLE.
- Rule 1: Logs/metrics/traces consulted before proposing changes
- Rule 2: All log lines are structured JSON with requestId, service, level, and domain fields
- Rule 3: Correlation ID generated at edge and propagated into all downstream calls and messages
- Rule 4: Every handler has latency histogram, error rate, and throughput metrics; alerts on p95/p99
- Rule 5: Traces instrument critical paths end-to-end across service boundaries
- Rule 6: Every alarm has a runbook with investigation steps and mitigation options
- Rule 7: Error reports include request ID, timestamp, input shape, and full downstream cause chain