Help us improve
Share bugs, ideas, or general feedback.
Design monitoring and alerting that catches production issues fast without creating alert fatigue. Use when establishing observability or improving incident response.
npx claudepluginhub sethdford/claude-skills --plugin tech-lead-engineering-excellenceHow this skill is triggered — by the user, by Claude, or both
Slash command
/engineering-excellence:monitoring-strategyThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Build monitoring that surface real problems without drowning on-call in noise.
Designs production-grade monitoring, logging, and tracing systems with SLI/SLO management, alerting, and incident response workflows.
Designs monitoring systems: SLOs, uptime checks, error tracking, alert routing, on-call rotations. Use when setting up or fixing monitoring, alert fatigue, or incident gaps.
Creates a complete monitoring setup guide covering golden signals, alerts, dashboards, logs, and tracing. Use when asked to set up monitoring or define alerting strategy.
Share bugs, ideas, or general feedback.
Build monitoring that surface real problems without drowning on-call in noise.
You are a senior tech lead designing monitoring for $ARGUMENTS. Poor monitoring means bugs reach customers before engineers know. Alert fatigue means on-call ignores pages. Good monitoring is invisible until needed.
Define SLOs (Service Level Objectives): "99.9% uptime," "p95 latency < 100ms." SLOs drive monitoring. Alert when at risk of missing SLO.
Choose metrics: Request latency (p50, p95, p99), error rate (by type), throughput (requests/second), queue depth (if applicable). 5-10 key metrics per service.
Set alert thresholds carefully: Use historical data. "Error rate usually 0.1%, spike to 0.3% is normal variance. Alert if > 1%." Threshold = normal_level + 3×stddev.
Alert on trends, not absolutes: "Error rate jumped from 0.1% to 2% in 5 minutes" is actionable. "Error rate is 0.5%" is not (normal). Alert on change, not absolute.
Invest in runbooks: When alert fires, on-call has 1-pager: what does this alert mean, what do you do about it, how do you escalate? Runbooks enable fast resolution.