Skill

sre-monitoring-and-observability

From sre

Supplies Prometheus queries for golden signals, SLIs, YAML alerting rules with severity levels, and dashboard guidance for SRE monitoring systems.

monitoring

devops

npx claudepluginhub thebushidocollective/han

Tool Access

This skill cannot use any tools. It operates in read-only mode without the ability to modify files or execute commands.

Preview

Building comprehensive monitoring and observability systems.

SKILL.md

Similar Skills

design-system

167.4k

Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.

team-skills-platform

ui-demo

167.4k

Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.

team-skills-platform

kotlin-patterns

167.4k

Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.

team-skills-platform

Stats

Parent Repo Stars84

Parent Repo Forks11

Last CommitFeb 3, 2026

Used By3 plugins

Actions

View Source View Plugin View on GitHub View README

SRE Monitoring and Observability

Building comprehensive monitoring and observability systems.

Four Golden Signals

Latency

Time to process requests:

# Request duration
http_request_duration_seconds

# Query
histogram_quantile(0.95, 
  rate(http_request_duration_seconds_bucket[5m])
)

Traffic

Demand on the system:

# Requests per second
rate(http_requests_total[5m])

# By endpoint
sum(rate(http_requests_total[5m])) by (endpoint)

Errors

Rate of failed requests:

# Error rate
rate(http_requests_total{status=~"5.."}[5m])
/ 
rate(http_requests_total[5m])

# SLI compliance
1 - (error_rate / slo_target)

Saturation

Resource utilization:

# CPU usage
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) 
/ node_memory_MemTotal_bytes * 100

Service Level Indicators (SLIs)

Availability SLI

# Successful requests / Total requests
sum(rate(http_requests_total{status=~"[23].."}[30d]))
/
sum(rate(http_requests_total[30d]))

Latency SLI

# Requests faster than threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[30d]))
/
sum(rate(http_request_duration_seconds_count[30d]))

Throughput SLI

# Requests processed within capacity
clamp_max(
  rate(http_requests_total[5m]) / capacity_requests_per_second,
  1.0
)

Alerting

Alert Severity Levels

P0 - Critical: Service down or severe degradation P1 - High: Significant impact, error budget at risk
P2 - Medium: Degradation, not user-facing yet P3 - Low: Awareness, no immediate action needed

Example Alerts

# High error rate
groups:
  - name: sre
    rules:
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m])
          / rate(http_requests_total[5m])
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          
      - alert: LatencyP95High
        expr: |
          histogram_quantile(0.95,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 1.0
        for: 10m
        labels:
          severity: warning
          
      - alert: ErrorBudgetBurn
        expr: |
          (1 - sli_availability) > (error_budget_remaining * 10)
        for: 1h
        labels:
          severity: high

Dashboards

Overview Dashboard

Service health (red/yellow/green)
Request rate
Error rate
Latency percentiles (p50, p95, p99)
Saturation metrics

Detailed Dashboard

Per-endpoint metrics
Dependency health
Database performance
Cache hit rates
Queue depths

Distributed Tracing

OpenTelemetry

const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('my-service');

async function handleRequest(req) {
  const span = tracer.startSpan('handle_request');
  
  try {
    span.setAttribute('user.id', req.user.id);
    span.setAttribute('request.path', req.path);
    
    const result = await processRequest(req);
    
    span.setStatus({ code: SpanStatusCode.OK });
    return result;
  } catch (error) {
    span.setStatus({
      code: SpanStatusCode.ERROR,
      message: error.message,
    });
    throw error;
  } finally {
    span.end();
  }
}

Structured Logging

logger.info('request_processed', {
  request_id: req.id,
  user_id: req.user.id,
  endpoint: req.path,
  method: req.method,
  status_code: res.statusCode,
  duration_ms: duration,
  error: error?.message,
});

Best Practices

USE Method

For resources:

Utilization: % time resource is busy
Saturation: Work queued but not serviced
Errors: Error count

RED Method

For requests:

Rate: Requests per second
Errors: Failed requests per second
Duration: Request latency distribution

Alert on Symptoms, Not Causes

# Good - alert on user impact
- alert: HighLatency
  expr: p95_latency > 1s

# Bad - alert on potential cause
- alert: HighCPU
  expr: cpu_usage > 80%

Runbook Links

annotations:
  runbook: "https://wiki.example.com/runbooks/high-error-rate"
  dashboard: "https://grafana.example.com/d/abc123"