Help us improve
Share bugs, ideas, or general feedback.
From pm-engineering
Creates a complete monitoring setup guide covering golden signals, alerts, dashboards, logs, and tracing. Use when asked to set up monitoring or define alerting strategy.
npx claudepluginhub mohitagw15856/pm-claude-skills --plugin pm-engineeringHow this skill is triggered — by the user, by Claude, or both
Slash command
/pm-engineering:monitoring-setup-guideThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Produce a complete monitoring setup guide for a service — defining exactly what to measure, how to structure logs, how to configure alerts with actionable thresholds, and how to build dashboards that answer real operational questions. A good monitoring guide eliminates "we don't know what's happening in production" as a root cause category, and gives on-call engineers a single source of truth f...
Provides observability patterns for metrics, logging, tracing, alerting, dashboards, and infrastructure monitoring in production systems with Prometheus, Grafana, OpenTelemetry.
Design monitoring and alerting that catches production issues fast without creating alert fatigue. Use when establishing observability or improving incident response.
Designs production-grade monitoring, logging, and tracing systems with SLI/SLO management, alerting, and incident response workflows.
Share bugs, ideas, or general feedback.
Produce a complete monitoring setup guide for a service — defining exactly what to measure, how to structure logs, how to configure alerts with actionable thresholds, and how to build dashboards that answer real operational questions. A good monitoring guide eliminates "we don't know what's happening in production" as a root cause category, and gives on-call engineers a single source of truth for what healthy looks like.
Ask for these if not already provided:
Team: [Team name] | Tech lead: [Name] Stack: [Language/Framework] on [Infrastructure] Monitoring platform: [Datadog / Prometheus+Grafana / CloudWatch / etc.] Date: [Date] | Review cycle: Quarterly
Good monitoring answers three questions:
This guide defines the answers for [Service Name]. Every alert must be actionable — if an on-call engineer cannot take a specific action in response to the alert, the alert should not exist.
Key user journeys monitored:
Apply the four golden signals specifically to [Service Name]:
Latency measures how long requests take to complete. Track it separately for successful and failed requests — slow failures hide behind fast errors if you only measure aggregate latency.
| Metric | Description | Source | Dimensions |
|---|---|---|---|
[service].request.duration_ms | End-to-end request latency | Application instrumentation | endpoint, method, status_code |
[service].db.query_duration_ms | Database query latency | ORM / query instrumentation | query_name, table |
[service].external.request_duration_ms | Outbound call latency to dependencies | HTTP client instrumentation | target_service, endpoint |
[service].queue.processing_duration_ms | Time to process one message (if applicable) | Consumer instrumentation | queue_name, message_type |
Latency SLO targets:
| Endpoint / operation | p50 target | p95 target | p99 target |
|---|---|---|---|
GET /api/v1/[resource] | < [50] ms | < [200] ms | < [500] ms |
POST /api/v1/[resource] | < [100] ms | < [400] ms | < [1000] ms |
GET /health | < [10] ms | < [20] ms | < [50] ms |
| [Background job name] | < [5] sec | < [15] sec | < [60] sec |
Traffic measures demand on the system. Use it to detect unexpected spikes, traffic drops (which can indicate upstream failures), and to capacity-plan.
| Metric | Description | Source |
|---|---|---|
[service].request.count | Requests per second | Application / load balancer |
[service].request.count_by_endpoint | RPS broken down by endpoint | Application |
[service].queue.messages_consumed_per_second | Consumer throughput | Queue consumer |
[service].queue.depth | Messages waiting in queue | Queue metrics |
Traffic baselines (update after observing production for 2+ weeks):
| Time period | Expected RPS | Low-traffic floor | Spike ceiling |
|---|---|---|---|
| Peak (weekday business hours) | [N] RPS | [N × 0.5] RPS | [N × 5] RPS |
| Off-peak (nights/weekends) | [N × 0.2] RPS | [N × 0.05] RPS | [N] RPS |
Errors measure the fraction of requests that fail. Distinguish between client errors (4xx — caller is doing something wrong) and server errors (5xx — the service is broken).
| Metric | Description | Alert on? |
|---|---|---|
[service].request.error_rate | 5xx errors / total requests | Yes — see alert rules |
[service].request.client_error_rate | 4xx errors / total requests | Threshold alert — sudden spike may indicate API misuse |
[service].dependency.error_rate | Errors calling downstream dependencies | Yes — upstream health signal |
[service].queue.dlq_depth | Messages in dead-letter queue | Yes — indicates processing failures |
Saturation measures how "full" the service is — how close to maximum capacity are the constrained resources.
| Resource | Metric | Alert threshold | Source |
|---|---|---|---|
| CPU | [service].cpu.utilisation_pct | >80% sustained 5 min | Container / VM metrics |
| Memory | [service].memory.utilisation_pct | >85% sustained 5 min | Container / VM metrics |
| DB connections | [service].db.connection_pool.utilisation_pct | >75% | Application / DB metrics |
| Thread pool / goroutines | [service].runtime.goroutine_count / thread_count | >N (establish baseline) | Runtime metrics |
| Disk (if applicable) | [service].disk.utilisation_pct | >75% | Infrastructure |
| Queue depth (if applicable) | [service].queue.depth | >[backlog threshold] | Queue metrics |
Beyond the golden signals, track metrics that measure whether the service is delivering business value. These matter for SLO reporting and product dashboards.
| Metric | Description | Source | Alert? |
|---|---|---|---|
[service].[primary_action].success_rate | [e.g. "Payment success rate"] | Application | Yes — if drops >5% vs 1h average |
[service].[primary_action].count | [e.g. "Payments processed per minute"] | Application | Yes — sudden drop (traffic anomaly) |
[service].[resource].created_per_hour | [e.g. "New accounts created"] | Application / DB | No — informational |
[service].cache.hit_rate | Fraction of requests served from cache | Cache instrumentation | Yes — if drops below [60]% |
[service].job.[name].success_rate | [Background job success rate] | Job framework | Yes — if drops below [99]% |
All logs must be structured JSON. Do not emit unstructured text logs in production. Every log line must include the mandatory fields.
Mandatory fields (every log line):
{
"timestamp": "2024-01-15T10:23:45.123Z",
"level": "info",
"service": "[service-name]",
"version": "[git-sha-short]",
"trace_id": "[uuid-from-request-context]",
"span_id": "[span-uuid]",
"request_id": "[uuid-per-request]",
"message": "[human readable description]"
}
Request log (emit for every HTTP request):
{
"timestamp": "...",
"level": "info",
"service": "[service-name]",
"event": "http_request",
"method": "POST",
"path": "/api/v1/[resource]",
"status_code": 201,
"duration_ms": 45,
"user_id": "[uuid — DO NOT log PII directly]",
"request_id": "[uuid]",
"trace_id": "[uuid]"
}
Error log (emit for every error with context):
{
"timestamp": "...",
"level": "error",
"service": "[service-name]",
"event": "error",
"error_code": "[application-error-code]",
"error_message": "[description — no sensitive data]",
"stack_trace": "[stack trace]",
"request_id": "[uuid]",
"trace_id": "[uuid]",
"context": {
"[key]": "[relevant context without PII]"
}
}
| Level | Use when | Example |
|---|---|---|
error | Something failed that requires attention — this should page on-call eventually | Database query failed, external API returned 5xx, required config missing |
warn | Something unexpected happened but service is still functioning | Retry succeeded after failure, cache miss on expected hit, rate limit approaching |
info | Significant business events and request lifecycle | Request received, payment processed, user authenticated, job started/completed |
debug | Detailed diagnostic information — off in production by default | Query parameters, intermediate computation results, cache key lookups |
Never log:
GET /health from access logs)Distributed tracing is mandatory for any service that calls other services. It enables root-cause analysis across service boundaries.
[ ] Tracing library installed:
- Go: go.opentelemetry.io/otel
- Python: opentelemetry-sdk, opentelemetry-instrumentation
- Node: @opentelemetry/sdk-node
- Java: opentelemetry-java-instrumentation
[ ] Tracer initialized at service startup with service name and version
[ ] Trace context propagated via W3C Trace Context headers:
traceparent: 00-[trace-id]-[span-id]-01
tracestate: [optional vendor-specific]
[ ] Automatic instrumentation enabled for:
[ ] Inbound HTTP/gRPC requests (creates root span)
[ ] Outbound HTTP/gRPC calls (creates child spans)
[ ] Database queries (creates child spans with sanitized query)
[ ] Cache operations (Redis, Memcached)
[ ] Message queue produce/consume
[ ] Custom spans added for:
[ ] Key business operations ([e.g. payment processing, user lookup])
[ ] Background jobs (each job execution = root span)
[ ] Third-party API calls with custom attributes
[ ] Span attributes to capture on all spans:
- user.id (if authenticated — no PII)
- deployment.environment (production/staging)
- service.version (git SHA)
- [service-specific key attributes]
[ ] Trace exporter configured to: [Datadog / Jaeger / Tempo / OTLP endpoint]
[ ] Sampling rate configured:
- Production: [1–10]% of requests (adjust based on volume and cost)
- Always sample: errors, slow requests (>p99 threshold), and 100% of [critical endpoint]
# Python — OpenTelemetry example
from opentelemetry import trace
tracer = trace.get_tracer("[service-name]")
def process_payment(payment_data):
with tracer.start_as_current_span("process_payment") as span:
span.set_attribute("payment.amount_cents", payment_data["amount"])
span.set_attribute("payment.currency", payment_data["currency"])
# Never: span.set_attribute("payment.card_number", ...)
try:
result = _do_process(payment_data)
span.set_status(trace.StatusCode.OK)
return result
except PaymentError as e:
span.set_status(trace.StatusCode.ERROR, str(e))
span.record_exception(e)
raise
Every alert must have: a name, a condition, a threshold, a severity, and a clear on-call action. Alerts without a clear action should not exist.
| Alert name | Condition | Threshold | Severity | On-call action |
|---|---|---|---|---|
[Service]HighErrorRate | 5xx error rate, 5-min rolling window | >1% for 2 consecutive windows | P1 | Check recent deploys; inspect error logs; see runbook [link] |
[Service]CriticalErrorRate | 5xx error rate, 2-min rolling window | >5% | P1 — immediate | Same as above — page immediately, do not wait |
[Service]HighP99Latency | p99 latency on key endpoints | >2× SLO target for 3 min | P2 | Check DB latency, cache hit rate, and upstream dependencies |
[Service]LatencySLOBreach | p99 latency | >SLO target for 5 consecutive minutes | P1 | SLO burn — page on-call, escalate if not resolved in 20 min |
[Service]HighCPU | CPU utilisation | >80% sustained for 5 min | P2 | Check for traffic spike; scale up if needed; check for runaway processes |
[Service]HighMemory | Memory utilisation | >85% sustained for 5 min | P2 | Check for memory leak (especially after deploys); restart pod if OOM imminent |
[Service]DBConnectionPoolHigh | DB connection pool utilisation | >75% | P2 | Check for long-running queries; consider scaling service or increasing pool size |
[Service]DLQDepthHigh | Dead-letter queue depth | >10 messages | P2 | Inspect DLQ messages for error pattern; fix bug and replay if safe |
[Service]TrafficDropAnomaly | RPS, compared to same hour yesterday | >50% drop sustained 5 min | P1 | Upstream may be down; check caller health; check load balancer |
[Service]PrimaryActionSuccessRateDrop | [Business metric success rate] | <[95]% over 10 min | P1 | [Service-specific action — e.g. "Check payment provider status"] |
[Service]DownstreamDependencyErrors | Error rate calling [dependency] | >5% over 5 min | P2 | Check [dependency] status page; enable fallback if available |
# Prometheus / Grafana alerting rules (adapt for your platform)
groups:
- name: [service-name]-alerts
rules:
- alert: [Service]HighErrorRate
expr: |
(
sum(rate([service]_http_requests_total{status=~"5.."}[5m]))
/
sum(rate([service]_http_requests_total[5m]))
) > 0.01
for: 2m
labels:
severity: critical
team: [team-name]
annotations:
summary: "High error rate on [Service Name]"
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 1%)"
runbook_url: "[runbook link]"
- alert: [Service]HighP99Latency
expr: |
histogram_quantile(0.99,
sum(rate([service]_http_request_duration_seconds_bucket[5m])) by (le, endpoint)
) > [0.5]
for: 3m
labels:
severity: warning
team: [team-name]
annotations:
summary: "p99 latency elevated on [Service Name]"
description: "p99 latency on {{ $labels.endpoint }} is {{ $value | humanizeDuration }}"
runbook_url: "[runbook link]"
# Datadog monitor configuration (Python SDK or Terraform)
import datadog
datadog.initialize(api_key="[key]", app_key="[key]")
datadog.api.Monitor.create(
type="metric alert",
query=f"sum(last_5m):sum:{{service}}.http.errors{{service:[service-name]}} / sum:{{service}}.http.requests{{service:[service-name]}} > 0.01",
name="[Service] High Error Rate",
message="Error rate exceeded 1%. @pagerduty-[service-oncall]\n\nRunbook: [link]",
tags=["service:[service-name]", "team:[team-name]"],
options={
"thresholds": {"critical": 0.01, "warning": 0.005},
"notify_no_data": False,
"evaluation_delay": 60,
}
)
The primary service dashboard must answer "is the service healthy right now?" at a glance. Use this layout:
┌─────────────────────────────────────────────────────────────────────┐
│ [SERVICE NAME] — Service Health Dashboard [Time range ▼] │
├───────────────┬───────────────┬───────────────┬─────────────────────┤
│ Error rate │ p99 Latency │ RPS (current)│ SLO budget remaining│
│ [BIG NUMBER] │ [BIG NUMBER] │ [BIG NUMBER] │ [BIG NUMBER / days] │
│ vs SLO: 0.1% │ vs SLO: 500ms│ vs avg: [N] │ [Error budget gauge]│
├───────────────┴───────────────┴───────────────┴─────────────────────┤
│ Error rate over time (24h) │
│ [Time series: 5xx rate line, SLO threshold line] │
├─────────────────────────────────┬───────────────────────────────────┤
│ Latency percentiles over time │ Request throughput over time │
│ [Lines: p50, p95, p99, p999] │ [Bars: RPS by endpoint] │
│ [SLO threshold horizontal line]│ │
├─────────────────────────────────┴───────────────────────────────────┤
│ Latency heatmap (all requests — shows distribution shape) │
├─────────────────────────────────┬───────────────────────────────────┤
│ CPU utilisation over time │ Memory utilisation over time │
│ [All instances/pods — lines] │ [All instances/pods — lines] │
│ [Alert threshold: 80%] │ [Alert threshold: 85%] │
├─────────────────────────────────┴───────────────────────────────────┤
│ DB: connection pool utilisation│ DB: query latency (p99 per query)│
├─────────────────────────────────┴───────────────────────────────────┤
│ [Business metric 1 over time] │ [Business metric 2 over time] │
│ e.g. Payment success rate │ e.g. Orders created/min │
└─────────────────────────────────┴───────────────────────────────────┘
Second dashboard — Dependency Health:
┌─────────────────────────────────────────────────────────────────────┐
│ [SERVICE NAME] — Dependency Health │
├─────────────────────────────────────────────────────────────────────┤
│ For each dependency: error rate | latency | current status │
│ [Database] [N]% errors | [N]ms p99 | ● Healthy / ⚠ Degraded │
│ [Redis] [N]% errors | [N]ms p99 | ● Healthy │
│ [External API][N]% errors | [N]ms p99 | ● Healthy │
├─────────────────────────────────────────────────────────────────────┤
│ Outbound call latency over time (one line per dependency) │
├─────────────────────────────────────────────────────────────────────┤
│ Circuit breaker / fallback state (if implemented) │
└─────────────────────────────────────────────────────────────────────┘
Honest assessment of what is missing today and what the priority to add it is:
| Gap | Impact | Priority | Effort | Owner | Target date |
|---|---|---|---|---|---|
| [e.g. No distributed tracing — can't see cross-service latency] | High — blind to dependency issues | P1 | [2 days] | [Name] | [Date] |
| [e.g. No business metric alerts — only infra alerts] | High — silent business failures | P1 | [1 day] | [Name] | [Date] |
| [e.g. Logs are unstructured text — not searchable] | Medium — slow incident investigation | P2 | [3 days] | [Name] | [Date] |
| [e.g. No dead-letter queue monitoring] | Medium — failed messages go unnoticed | P2 | [4 hours] | [Name] | [Date] |
| [e.g. Alert thresholds not calibrated to production baseline] | Medium — alert fatigue or missed alerts | P2 | [1 day] | [Name] | [Date] |
| [e.g. No latency heatmap — outliers invisible in averages] | Low — harder to spot tail latency issues | P3 | [2 hours] | [Name] | [Date] |
Total observability debt: [N] items | Estimated effort: [N days]