Help us improve
Share bugs, ideas, or general feedback.
From site-reliability-engineering
Supplies Prometheus queries for golden signals, SLIs, YAML alerting rules with severity levels, and dashboard guidance for SRE monitoring systems.
npx claudepluginhub thebushidocollective/han --plugin do-site-reliability-engineeringHow this skill is triggered — by the user, by Claude, or both
Slash command
/site-reliability-engineering:sre-monitoringThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Building comprehensive monitoring and observability systems.
Builds production-ready monitoring, logging, and tracing systems with observability strategies, SLI/SLO management, alerting, and incident response workflows. Use for designing reliability systems or investigating regressions.
Provides observability patterns for metrics, logging, tracing, alerting, dashboards, and infrastructure monitoring in production systems with Prometheus, Grafana, OpenTelemetry.
Share bugs, ideas, or general feedback.
Building comprehensive monitoring and observability systems.
Time to process requests:
# Request duration
http_request_duration_seconds
# Query
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
)
Demand on the system:
# Requests per second
rate(http_requests_total[5m])
# By endpoint
sum(rate(http_requests_total[5m])) by (endpoint)
Rate of failed requests:
# Error rate
rate(http_requests_total{status=~"5.."}[5m])
/
rate(http_requests_total[5m])
# SLI compliance
1 - (error_rate / slo_target)
Resource utilization:
# CPU usage
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100
# Successful requests / Total requests
sum(rate(http_requests_total{status=~"[23].."}[30d]))
/
sum(rate(http_requests_total[30d]))
# Requests faster than threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[30d]))
/
sum(rate(http_request_duration_seconds_count[30d]))
# Requests processed within capacity
clamp_max(
rate(http_requests_total[5m]) / capacity_requests_per_second,
1.0
)
P0 - Critical: Service down or severe degradation
P1 - High: Significant impact, error budget at risk
P2 - Medium: Degradation, not user-facing yet
P3 - Low: Awareness, no immediate action needed
# High error rate
groups:
- name: sre
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m])
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
- alert: LatencyP95High
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 1.0
for: 10m
labels:
severity: warning
- alert: ErrorBudgetBurn
expr: |
(1 - sli_availability) > (error_budget_remaining * 10)
for: 1h
labels:
severity: high
const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('my-service');
async function handleRequest(req) {
const span = tracer.startSpan('handle_request');
try {
span.setAttribute('user.id', req.user.id);
span.setAttribute('request.path', req.path);
const result = await processRequest(req);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
throw error;
} finally {
span.end();
}
}
logger.info('request_processed', {
request_id: req.id,
user_id: req.user.id,
endpoint: req.path,
method: req.method,
status_code: res.statusCode,
duration_ms: duration,
error: error?.message,
});
For resources:
For requests:
# Good - alert on user impact
- alert: HighLatency
expr: p95_latency > 1s
# Bad - alert on potential cause
- alert: HighCPU
expr: cpu_usage > 80%
annotations:
runbook: "https://wiki.example.com/runbooks/high-error-rate"
dashboard: "https://grafana.example.com/d/abc123"