From global-plugin
Use when debugging a production or staging issue, monitoring a hot path, or instrumenting code that should be observable. Do NOT use for local-only debugging of new code (use your IDE). Covers logs/metrics/traces-first method, structured logging, correlation ID propagation, alarm design.
npx claudepluginhub lgerard314/global-marketplace --plugin global-pluginThis skill is limited to using the following tools:
Debug by reading the system, not by guessing. Apply when investigating live issues, reviewing hot-path handlers, or writing code for shared environments.
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).
Share bugs, ideas, or general feedback.
Debug by reading the system, not by guessing. Apply when investigating live issues, reviewing hot-path handlers, or writing code for shared environments.
console.log patches in prod. — Why: redeploying to add instrumentation adds risk and delay; structured logs/metrics/traces should already contain the answer.requestId, service, level, and relevant domain fields — never free-text strings alone. — Why: structured fields can be indexed, filtered, and used in metric filters without custom parsing.| Thought | Reality |
|---|---|
"Just console.log and redeploy" | You are coding blind. Every redeployment adds risk and delay. Proper structured logging should already capture what you need. |
| "Logs are strings, grep works" | For a single request at 3 AM across three services with 10k req/min throughput, grep-by-string is a nightmare. Structured fields with an indexed query are the difference between 5 minutes and 50. |
| "Alarm on p50 latency" | The tail is where users suffer. p50 can look healthy while the slowest 5% of users — often the ones with the most data or the most complex accounts — experience a broken product. |
Bad — free-text with no fields for filtering or correlation:
// BAD: unstructured, no requestId, impossible to aggregate or alert on
console.log(`User ${userId} failed to checkout: ${error.message}`);
Good — structured JSON via Pino with requestId and domain fields:
// GOOD: structured fields enable filtering, alerting, and correlation
import { logger } from './logger'; // Pino instance bound to the request context
logger.error(
{
requestId: ctx.requestId,
service: 'checkout',
userId,
orderId,
errorCode: error.code,
durationMs: Date.now() - startTime,
},
'Checkout failed — payment declined',
);
Bad — no correlation ID; requests cannot be traced across services:
// BAD: downstream service has no way to link this call to the originating request
async function fetchInventory(skuId: string): Promise<number> {
const res = await fetch(`https://inventory.internal/sku/${skuId}/stock`);
return res.json();
}
Good — correlation ID from AsyncLocalStorage context forwarded in every outbound call:
// GOOD: x-request-id flows into downstream logs and traces
import { getRequestContext } from './request-context'; // AsyncLocalStorage-backed
async function fetchInventory(skuId: string): Promise<number> {
const { requestId } = getRequestContext();
const res = await fetch(`https://inventory.internal/sku/${skuId}/stock`, {
headers: { 'x-request-id': requestId },
});
return res.json();
}
Bad — alarming on median; tail latency goes undetected:
# BAD: p50 alarm misses the users suffering at the tail
AlarmName: CheckoutLatencyHigh
MetricName: checkout.latency.p50
Threshold: 500 # ms
# No runbook link
Good — p99 alarm with explicit runbook link:
# GOOD: p99 catches tail; runbook tells on-call what to do
AlarmName: CheckoutLatencyP99High
MetricName: checkout.latency.p99
Threshold: 1000 # ms
AlarmDescription: |
p99 checkout latency exceeded 1 000 ms.
Runbook: https://runbooks.internal/checkout-latency-high
For the Pino base config, PII redaction setup, and AsyncLocalStorage-backed request middleware, see references/patterns.md — Structured logging (Pino) setup section.
A correlation ID is a UUID generated at the entry point and carried through every downstream op (HTTP, queue messages, DB query comments, trace spans).
The AsyncLocalStorage pattern avoids threading the ID through every function argument: middleware establishes the context; anywhere downstream calls getRequestContext().
Propagating into downstream HTTP fetches:
import { getRequestContext } from './request-context';
export async function callDownstreamService<T>(url: string, init?: RequestInit): Promise<T> {
const { requestId } = getRequestContext();
const res = await fetch(url, {
...init,
headers: {
...init?.headers,
'x-request-id': requestId, // standard header; some teams use 'traceparent' (W3C)
'content-type': 'application/json',
},
});
if (!res.ok) throw new UpstreamError(res.status, url);
return res.json() as Promise<T>;
}
Propagating into SQS / SNS messages:
import { SQSClient, SendMessageCommand } from '@aws-sdk/client-sqs';
import { getRequestContext } from './request-context';
const sqs = new SQSClient({});
export async function enqueueEvent(queueUrl: string, payload: unknown): Promise<void> {
const { requestId } = getRequestContext();
await sqs.send(new SendMessageCommand({
QueueUrl: queueUrl,
MessageBody: JSON.stringify(payload),
MessageAttributes: {
requestId: {
DataType: 'String',
StringValue: requestId, // consumer extracts this and re-establishes context
},
},
}));
}
Re-establishing context in the queue consumer:
import { SQSEvent } from 'aws-lambda';
import { storage } from './request-context';
import { logger } from './logger';
export async function handler(event: SQSEvent): Promise<void> {
for (const record of event.Records) {
const requestId = record.messageAttributes?.['requestId']?.stringValue ?? crypto.randomUUID();
const log = logger.child({ requestId, queue: record.eventSourceARN });
await storage.run({ requestId, log }, () => processRecord(record));
}
}
W3C traceparent for OpenTelemetry compatibility. With OTel, propagate the standard traceparent header instead of (or alongside) x-request-id. OTel's propagation.inject(context, headers) writes it automatically via the HTTP instrumentation library.
Instrument every handler with the four golden signals (latency, traffic, errors, saturation). Examples in CloudWatch EMF; concepts are universal.
Emitting structured metrics via CloudWatch EMF:
import { createMetricsLogger, Unit } from 'aws-embedded-metrics';
export async function checkoutHandler(req: Request, res: Response): Promise<void> {
const metrics = createMetricsLogger();
metrics.setNamespace('MyApp/Checkout');
metrics.setDimensions({ service: 'checkout', env: process.env.NODE_ENV ?? 'prod' });
const start = Date.now();
try {
await processCheckout(req.body);
metrics.putMetric('CheckoutSuccess', 1, Unit.Count);
res.status(200).json({ ok: true });
} catch (err) {
metrics.putMetric('CheckoutError', 1, Unit.Count);
throw err;
} finally {
metrics.putMetric('CheckoutDuration', Date.now() - start, Unit.Milliseconds);
await metrics.flush(); // writes JSON metric payload to stdout → CloudWatch
}
}
Alternatively extract metrics from Pino JSON logs via a AWS::Logs::MetricFilter with FilterPattern: '{ $.level = "error" && $.service = "checkout" }' and dimensions keyed on $.errorCode.
Percentiles over averages. Always publish p50/p95/p99 (CloudWatch PERCENTILE statistic). Averages mask bimodal distributions.
What to measure on every handler at minimum:
| Metric | Unit | Purpose |
|---|---|---|
<handler>.duration | Milliseconds histogram | Latency; alert on p99 |
<handler>.requests | Count | Throughput; alert on anomalous drops |
<handler>.errors | Count with errorCode dimension | Error rate; alert on rate increase |
<handler>.saturation | Gauge (concurrent in-flight) | Back-pressure signal |
For OpenTelemetry SDK setup, manual span pattern, and trace-context-into-Pino injection, see references/patterns.md — Tracing critical paths section.
Alert on user-visible symptoms; every alarm gets a runbook.
Alarm tiers:
| Tier | Condition | Paging policy |
|---|---|---|
| P1 — Critical | Error rate > 5% for 5 min OR p99 latency > 2×SLO for 5 min | Page immediately, escalate after 15 min |
| P2 — Warning | Error rate > 1% for 10 min OR p99 latency > 1.5×SLO for 10 min | Slack alert, page if not acknowledged in 30 min |
| P3 — Info | Unusual traffic drop, circuit breaker opened | Slack only |
CloudWatch alarm with runbook (CDK):
import { Alarm, ComparisonOperator, TreatMissingData } from 'aws-cdk-lib/aws-cloudwatch';
import { SnsAction } from 'aws-cdk-lib/aws-cloudwatch-actions';
new Alarm(this, 'CheckoutP99LatencyAlarm', {
alarmName: 'checkout-p99-latency-high',
alarmDescription: [
'p99 checkout latency exceeded 1 000 ms for 5 consecutive minutes.',
'Runbook: https://runbooks.internal/checkout-latency-high',
'Dashboard: https://cloudwatch.aws.amazon.com/...#dashboards:name=Checkout',
].join('\n'),
metric: checkoutDurationMetric.with({
statistic: 'p99',
period: Duration.minutes(1),
}),
threshold: 1000,
evaluationPeriods: 5,
comparisonOperator: ComparisonOperator.GREATER_THAN_THRESHOLD,
treatMissingData: TreatMissingData.NOT_BREACHING,
actionsEnabled: true,
}).addAlarmAction(new SnsAction(oncallTopic));
Runbook sections (required): what fired; user impact; dashboard links; numbered investigation steps (check p50 co-movement, Logs Insights error-by-code, downstream status pages, recent deploys); mitigation options (feature-flag fallback, scale read replicas, rollback); escalation path (team lead after 30 min).
Work these steps in order.
Step 1 — Define the blast radius. How many users, which features, since when? Check the error-rate dashboard for the inflection point.
Step 2 — Read the logs. Query structured logs for the time window and affected service. Use a known-bad requestId if you have one, or filter by level=error and service=<affected>. Look for error codes, cause chains, unexpected fields, and patterns in affected requests (same userId prefix, skuId, downstream endpoint).
# CloudWatch Logs Insights query for the affected window
fields @timestamp, requestId, errorCode, durationMs, @message
| filter service = "checkout" and level = "error"
| filter @timestamp between <start> and <end>
| sort @timestamp desc
| limit 50
Step 3 — Read the metrics. Pull p99 latency, error rate, and throughput for the affected handler over the last 2 h. Look for correlation with a deploy (vertical annotation), step-function change (config change), or gradual degradation (resource exhaustion/leak).
Step 4 — Pull a trace. Take one requestId from step 2 and find its trace. Common culprits: bad DB query plan, downstream timeout, new code path calling an extra external service.
Step 5 — Form hypotheses (max three). Ordered by probability. Be specific: "Payment provider /charges returning 503 since 14:32 UTC" is a hypothesis; "something wrong with payments" is not.
Step 6 — Verify, then fix. Test the most likely hypothesis with the minimal action (status page, deploy log, specific metric query). Propose a fix only after confirming.
Step 7 — Post-mortem signal. File the log/metric/trace gap that made diagnosis harder than needed.
REQUIRED BACKGROUND: superpowers:systematic-debugging — this skill attaches inside Phase 1 (Root Cause) of that workflow; it does not replace it.
resilience-and-error-handling for when to catch and log errors and how to propagate typed error causes; change-risk-evaluation for which metrics and alarms to watch during and after a deploy.queue-and-retry-safety's dead-letter-queue monitoring patterns.Produce a markdown report with these sections:
requestId, service, level, and domain fields