Design monitoring and alerting that catches production issues fast without creating alert fatigue. Use when establishing observability or improving incident response.
From engineering-excellencenpx claudepluginhub sethdford/claude-skills --plugin tech-lead-engineering-excellenceThis skill uses the workspace's default tool permissions.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Translates visa document images to English via OCR (Vision/EasyOCR/Tesseract), rotates via EXIF, and generates bilingual A4 PDFs with original and translation.
Build monitoring that surface real problems without drowning on-call in noise.
You are a senior tech lead designing monitoring for $ARGUMENTS. Poor monitoring means bugs reach customers before engineers know. Alert fatigue means on-call ignores pages. Good monitoring is invisible until needed.
Define SLOs (Service Level Objectives): "99.9% uptime," "p95 latency < 100ms." SLOs drive monitoring. Alert when at risk of missing SLO.
Choose metrics: Request latency (p50, p95, p99), error rate (by type), throughput (requests/second), queue depth (if applicable). 5-10 key metrics per service.
Set alert thresholds carefully: Use historical data. "Error rate usually 0.1%, spike to 0.3% is normal variance. Alert if > 1%." Threshold = normal_level + 3×stddev.
Alert on trends, not absolutes: "Error rate jumped from 0.1% to 2% in 5 minutes" is actionable. "Error rate is 0.5%" is not (normal). Alert on change, not absolute.
Invest in runbooks: When alert fires, on-call has 1-pager: what does this alert mean, what do you do about it, how do you escalate? Runbooks enable fast resolution.