Monitoring & Observability Skill
Purpose
Standards for monitoring, metrics, alerting, and observability.
Auto-Invoke Triggers
- Setting up monitoring infrastructure
- Defining metrics and KPIs
- Configuring alerts
- Implementing distributed tracing
Three Pillars of Observability
| Pillar | Purpose | Tools |
|---|
| Logs | Event records | ELK, Loki, CloudWatch |
| Metrics | Numerical measurements | Prometheus, Datadog |
| Traces | Request flow | Jaeger, Zipkin, X-Ray |
Key Metrics (Golden Signals)
The Four Golden Signals
| Signal | Description | Example Metric |
|---|
| Latency | Response time | p50, p95, p99 latency |
| Traffic | Request volume | Requests per second |
| Errors | Failure rate | Error percentage |
| Saturation | Resource usage | CPU, memory utilization |
RED Method (Request-focused)
- Rate - Requests per second
- Errors - Failed requests per second
- Duration - Request latency
USE Method (Resource-focused)
- Utilization - Resource % used
- Saturation - Queue depth
- Errors - Error count
Application Metrics
HTTP Endpoints
| Metric | Type | Description |
|---|
http_requests_total | Counter | Total requests |
http_request_duration_seconds | Histogram | Request latency |
http_requests_in_flight | Gauge | Active requests |
http_response_size_bytes | Histogram | Response size |
Database
| Metric | Type | Description |
|---|
db_connections_active | Gauge | Active connections |
db_query_duration_seconds | Histogram | Query time |
db_errors_total | Counter | Query errors |
Business Metrics
| Metric | Type | Description |
|---|
users_registered_total | Counter | New registrations |
orders_created_total | Counter | Orders placed |
payment_amount_total | Counter | Revenue |
Metric Types
| Type | Use Case | Example |
|---|
| Counter | Cumulative totals | Requests, errors |
| Gauge | Current value | Temperature, queue size |
| Histogram | Value distribution | Latency buckets |
| Summary | Quantiles | p50, p95, p99 |
Naming Convention
{namespace}_{subsystem}_{name}_{unit}
http_server_request_duration_seconds
db_pool_connections_active
app_users_registered_total
Alerting Strategy
Alert Severity
| Severity | Response Time | Action |
|---|
| Critical | Immediate | Page on-call |
| Warning | Within hours | Create ticket |
| Info | Next business day | Review |
Alert Rules
- Alert on symptoms, not causes
- Include runbook links
- Set appropriate thresholds
- Avoid alert fatigue
- Group related alerts
What to Alert On
| Alert | Condition | Severity |
|---|
| Service down | Health check fails | Critical |
| High error rate | > 5% errors | Critical |
| High latency | p99 > 2s | Warning |
| High CPU | > 80% for 5min | Warning |
| Disk space | < 20% free | Warning |
| SSL expiry | < 30 days | Warning |
SLOs and SLIs
Service Level Indicators (SLIs)
- Availability: Successful requests / Total requests
- Latency: % requests < threshold
- Throughput: Requests per second
- Error rate: Failed requests / Total requests
Service Level Objectives (SLOs)
| SLO | Target | Error Budget |
|---|
| Availability | 99.9% | 43.8 min/month |
| Latency (p99) | < 500ms | - |
| Error rate | < 0.1% | - |
Error Budget
- Monthly allowed downtime
- Spend on risky deployments
- Freeze deploys when exhausted
Distributed Tracing
Concepts
| Term | Description |
|---|
| Trace | End-to-end request journey |
| Span | Single operation in trace |
| Context | Trace ID propagated across services |
What to Trace
- Cross-service calls
- Database queries
- External API calls
- Message queue operations
Trace Propagation
- Pass trace context in headers
- Standard: W3C Trace Context
- Headers:
traceparent, tracestate
Health Checks
Endpoint Types
| Endpoint | Purpose | Response |
|---|
/health | Basic liveness | 200 OK |
/health/ready | Full readiness | 200 + deps status |
/health/live | Process alive | 200 OK |
Readiness Check Components
- Database connection
- Cache connection
- External service connectivity
- Required configuration present
Health Response Format
{
"status": "healthy",
"checks": {
"database": "healthy",
"cache": "healthy",
"external-api": "degraded"
},
"version": "1.2.3"
}
Dashboard Design
Layout Principles
- Most important metrics at top
- Group related metrics
- Use consistent time ranges
- Include context (deploy markers)
Standard Panels
- Overview - Traffic, errors, latency
- Resources - CPU, memory, disk
- Dependencies - DB, cache, external APIs
- Business - Domain-specific metrics
Best Practices
- Link to runbooks
- Add annotations for deploys
- Use consistent colors
- Set reasonable refresh rates
Runbooks
Structure
- Alert description - What triggered
- Impact - User/business effect
- Diagnosis steps - How to investigate
- Resolution steps - How to fix
- Escalation - Who to contact
Required Runbooks
- Service restart procedure
- Database failover
- Rollback deployment
- Scale up/down
- Incident communication
Best Practices
DO
- Monitor the four golden signals
- Set up alerts before incidents
- Create runbooks for alerts
- Use distributed tracing
- Track SLOs and error budgets
- Review dashboards regularly
DON'T
- Alert on every metric
- Ignore alert fatigue
- Skip health checks
- Forget to trace async operations
- Set unrealistic SLOs
- Neglect runbook maintenance
Observability Checklist
Metrics
Alerting
Tracing
Dashboards