Stats

Actions

Tags

Help us improve

Share bugs, ideas, or general feedback.

sre-monitoring-and-observability | site-reliability-engineering | ClaudePluginHub

Skill

sre-monitoring-and-observability

From site-reliability-engineering

Supplies Prometheus queries for golden signals, SLIs, YAML alerting rules with severity levels, and dashboard guidance for SRE monitoring systems.

$

npx claudepluginhub thebushidocollective/han --plugin do-site-reliability-engineering

Popularity

Parent stars

152

Parent forks

16

Shared by

2

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/site-reliability-engineering:sre-monitoring

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill has no tool access — it operates in read-only mode.

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Building comprehensive monitoring and observability systems.

SKILL.md

240 lines · ~1.1k tokens

Similar Skills

observability-engineer

38.0k

Builds production-ready monitoring, logging, and tracing systems with observability strategies, SLI/SLO management, alerting, and incident response workflows. Use for designing reliability systems or investigating regressions.

antigravity-awesome-skills

monitoring-ops

17

Provides observability patterns for metrics, logging, tracing, alerting, dashboards, and infrastructure monitoring in production systems with Prometheus, Grafana, OpenTelemetry.

4 files3 tools

Use this skill when

18

Stats

LanguageTypeScript

Parent stars152

Parent forks16

MaintenanceFair

Last CommitMar 2, 2026

Actions

View Source View Plugin View on GitHub View README

Tags

Help us improve

Share bugs, ideas, or general feedback.

SRE Monitoring and Observability

Building comprehensive monitoring and observability systems.

Four Golden Signals

Latency

Time to process requests:

# Request duration
http_request_duration_seconds

# Query
histogram_quantile(0.95, 
  rate(http_request_duration_seconds_bucket[5m])
)

Traffic

Demand on the system:

# Requests per second
rate(http_requests_total[5m])

# By endpoint
sum(rate(http_requests_total[5m])) by (endpoint)

Errors

Rate of failed requests:

# Error rate
rate(http_requests_total{status=~"5.."}[5m])
/ 
rate(http_requests_total[5m])

# SLI compliance
1 - (error_rate / slo_target)

Saturation

Resource utilization:

# CPU usage
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) 
/ node_memory_MemTotal_bytes * 100

Service Level Indicators (SLIs)

Availability SLI

# Successful requests / Total requests
sum(rate(http_requests_total{status=~"[23].."}[30d]))
/
sum(rate(http_requests_total[30d]))

Latency SLI

# Requests faster than threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[30d]))
/
sum(rate(http_request_duration_seconds_count[30d]))

Throughput SLI

# Requests processed within capacity
clamp_max(
  rate(http_requests_total[5m]) / capacity_requests_per_second,
  1.0
)

Alerting

Alert Severity Levels

P0 - Critical: Service down or severe degradation P1 - High: Significant impact, error budget at risk
P2 - Medium: Degradation, not user-facing yet P3 - Low: Awareness, no immediate action needed

Example Alerts

# High error rate
groups:
  - name: sre
    rules:
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m])
          / rate(http_requests_total[5m])
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          
      - alert: LatencyP95High
        expr: |
          histogram_quantile(0.95,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 1.0
        for: 10m
        labels:
          severity: warning
          
      - alert: ErrorBudgetBurn
        expr: |
          (1 - sli_availability) > (error_budget_remaining * 10)
        for: 1h
        labels:
          severity: high

Dashboards

Overview Dashboard

Service health (red/yellow/green)
Request rate
Error rate
Latency percentiles (p50, p95, p99)
Saturation metrics

Detailed Dashboard

Per-endpoint metrics
Dependency health
Database performance
Cache hit rates
Queue depths

Distributed Tracing

OpenTelemetry

const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('my-service');

async function handleRequest(req) {
  const span = tracer.startSpan('handle_request');
  
  try {
    span.setAttribute('user.id', req.user.id);
    span.setAttribute('request.path', req.path);
    
    const result = await processRequest(req);
    
    span.setStatus({ code: SpanStatusCode.OK });
    return result;
  } catch (error) {
    span.setStatus({
      code: SpanStatusCode.ERROR,
      message: error.message,
    });
    throw error;
  } finally {
    span.end();
  }
}

Structured Logging

logger.info('request_processed', {
  request_id: req.id,
  user_id: req.user.id,
  endpoint: req.path,
  method: req.method,
  status_code: res.statusCode,
  duration_ms: duration,
  error: error?.message,
});

Best Practices

USE Method

For resources:

Utilization: % time resource is busy
Saturation: Work queued but not serviced
Errors: Error count

RED Method

For requests:

Rate: Requests per second
Errors: Failed requests per second
Duration: Request latency distribution

Alert on Symptoms, Not Causes

# Good - alert on user impact
- alert: HighLatency
  expr: p95_latency > 1s

# Bad - alert on potential cause
- alert: HighCPU
  expr: cpu_usage > 80%

Runbook Links

annotations:
  runbook: "https://wiki.example.com/runbooks/high-error-rate"
  dashboard: "https://grafana.example.com/d/abc123"