Provides Prometheus queries and templates for SLO/SLI definitions on availability/latency, error budget calculations, and burn rate alerting for service reliability.
npx claudepluginhub opensearch-project/observability-stack --plugin observabilityThis skill is limited to using the following tools:
This skill provides templates for implementing Service Level Objectives (SLOs) and Service Level Indicators (SLIs) using Prometheus recording rules, error budget calculations, and burn rate alerting. It follows the Google SRE book methodology for multi-window burn rate alerts.
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).
Generates original PNG/PDF visual art via design philosophy manifestos for posters, graphics, and static designs on user request.
This skill provides templates for implementing Service Level Objectives (SLOs) and Service Level Indicators (SLIs) using Prometheus recording rules, error budget calculations, and burn rate alerting. It follows the Google SRE book methodology for multi-window burn rate alerts.
All Prometheus queries use the HTTP API at http://localhost:9090/api/v1/query. Credentials are not required for local Prometheus (HTTP, no auth). Recording rules and alerting rules are YAML blocks that can be added to the Prometheus configuration at docker-compose/prometheus/prometheus.yml.
| Variable | Default | Description |
|---|---|---|
OPENSEARCH_ENDPOINT | https://localhost:9200 | OpenSearch base URL |
OPENSEARCH_USER | admin | OpenSearch username |
OPENSEARCH_PASSWORD | My_password_123!@# | OpenSearch password |
PROMETHEUS_ENDPOINT | http://localhost:9090 | Prometheus base URL |
The availability SLI measures the ratio of successful requests (non-5xx) to total requests. A value of 1.0 means all requests succeeded; 0.99 means 1% failed.
Note on status code labels: The label name varies by OTel SDK version. Older semconv uses
http_status_code; newer stable semconv useshttp_response_status_code. Use the Metric Discovery section in the metrics skill to check which label is present, and replacehttp_response_status_codein the queries below with the variant active in your stack.
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
--data-urlencode 'query=sum(rate(http_server_duration_seconds_count{http_response_status_code!~"5.."}[5m])) / sum(rate(http_server_duration_seconds_count[5m]))'
Per-service availability:
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
--data-urlencode 'query=sum(rate(http_server_duration_seconds_count{http_response_status_code!~"5.."}[5m])) by (service_name) / sum(rate(http_server_duration_seconds_count[5m])) by (service_name)'
The latency SLI measures the ratio of requests completing within a threshold (e.g., 250ms) to total requests. A value of 0.95 means 95% of requests finished within the threshold.
Note on latency thresholds: The
lebucket boundary depends on the metric's unit. For_secondsmetrics, usele="0.25"for 250ms. For_millisecondsmetrics, usele="250". Use the Metric Discovery section in the metrics skill to check which metric name is active.
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
--data-urlencode 'query=sum(rate(http_server_duration_seconds_bucket{le="0.25"}[5m])) / sum(rate(http_server_duration_seconds_count[5m]))'
Per-service latency SLI with a 500ms threshold:
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
--data-urlencode 'query=sum(rate(http_server_duration_seconds_bucket{le="0.5"}[5m])) by (service_name) / sum(rate(http_server_duration_seconds_count[5m])) by (service_name)'
The GenAI SLI measures agent response time objectives using the gen_ai_client_operation_duration_seconds histogram. For example, the ratio of GenAI operations completing within 5 seconds:
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
--data-urlencode 'query=sum(rate(gen_ai_client_operation_duration_seconds_bucket{le="5.0"}[5m])) by (gen_ai_operation_name) / sum(rate(gen_ai_client_operation_duration_seconds_count[5m])) by (gen_ai_operation_name)'
Per-model GenAI availability (non-error operations):
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
--data-urlencode 'query=sum(rate(gen_ai_client_operation_duration_seconds_count{gen_ai_operation_name!="error"}[5m])) by (gen_ai_request_model) / sum(rate(gen_ai_client_operation_duration_seconds_count[5m])) by (gen_ai_request_model)'
Recording rules pre-compute SLI values at multiple time windows so that SLO compliance queries are fast and efficient. Add these rule groups to docker-compose/prometheus/prometheus.yml under the rule_files section.
Recording rules follow the pattern:
| Pattern | Example |
|---|---|
sli:http_availability:ratio_rate<window> | sli:http_availability:ratio_rate5m |
sli:http_latency:ratio_rate<window> | sli:http_latency:ratio_rate5m |
Windows: 5m, 30m, 1h, 6h, 1d, 3d, 30d
groups:
- name: sli_availability
rules:
- record: sli:http_availability:ratio_rate5m
expr: |
sum(rate(http_server_duration_seconds_count{http_response_status_code!~"5.."}[5m])) by (service_name)
/
sum(rate(http_server_duration_seconds_count[5m])) by (service_name)
labels:
sli: availability
- record: sli:http_availability:ratio_rate30m
expr: |
sum(rate(http_server_duration_seconds_count{http_response_status_code!~"5.."}[30m])) by (service_name)
/
sum(rate(http_server_duration_seconds_count[30m])) by (service_name)
labels:
sli: availability
- record: sli:http_availability:ratio_rate1h
expr: |
sum(rate(http_server_duration_seconds_count{http_response_status_code!~"5.."}[1h])) by (service_name)
/
sum(rate(http_server_duration_seconds_count[1h])) by (service_name)
labels:
sli: availability
- record: sli:http_availability:ratio_rate6h
expr: |
sum(rate(http_server_duration_seconds_count{http_response_status_code!~"5.."}[6h])) by (service_name)
/
sum(rate(http_server_duration_seconds_count[6h])) by (service_name)
labels:
sli: availability
- record: sli:http_availability:ratio_rate1d
expr: |
sum(rate(http_server_duration_seconds_count{http_response_status_code!~"5.."}[1d])) by (service_name)
/
sum(rate(http_server_duration_seconds_count[1d])) by (service_name)
labels:
sli: availability
- record: sli:http_availability:ratio_rate3d
expr: |
sum(rate(http_server_duration_seconds_count{http_response_status_code!~"5.."}[3d])) by (service_name)
/
sum(rate(http_server_duration_seconds_count[3d])) by (service_name)
labels:
sli: availability
- record: sli:http_availability:ratio_rate30d
expr: |
sum(rate(http_server_duration_seconds_count{http_response_status_code!~"5.."}[30d])) by (service_name)
/
sum(rate(http_server_duration_seconds_count[30d])) by (service_name)
labels:
sli: availability
groups:
- name: sli_latency
rules:
- record: sli:http_latency:ratio_rate5m
expr: |
sum(rate(http_server_duration_seconds_bucket{le="0.25"}[5m])) by (service_name)
/
sum(rate(http_server_duration_seconds_count[5m])) by (service_name)
labels:
sli: latency
- record: sli:http_latency:ratio_rate30m
expr: |
sum(rate(http_server_duration_seconds_bucket{le="0.25"}[30m])) by (service_name)
/
sum(rate(http_server_duration_seconds_count[30m])) by (service_name)
labels:
sli: latency
- record: sli:http_latency:ratio_rate1h
expr: |
sum(rate(http_server_duration_seconds_bucket{le="0.25"}[1h])) by (service_name)
/
sum(rate(http_server_duration_seconds_count[1h])) by (service_name)
labels:
sli: latency
- record: sli:http_latency:ratio_rate6h
expr: |
sum(rate(http_server_duration_seconds_bucket{le="0.25"}[6h])) by (service_name)
/
sum(rate(http_server_duration_seconds_count[6h])) by (service_name)
labels:
sli: latency
- record: sli:http_latency:ratio_rate1d
expr: |
sum(rate(http_server_duration_seconds_bucket{le="0.25"}[1d])) by (service_name)
/
sum(rate(http_server_duration_seconds_count[1d])) by (service_name)
labels:
sli: latency
- record: sli:http_latency:ratio_rate3d
expr: |
sum(rate(http_server_duration_seconds_bucket{le="0.25"}[3d])) by (service_name)
/
sum(rate(http_server_duration_seconds_count[3d])) by (service_name)
labels:
sli: latency
- record: sli:http_latency:ratio_rate30d
expr: |
sum(rate(http_server_duration_seconds_bucket{le="0.25"}[30d])) by (service_name)
/
sum(rate(http_server_duration_seconds_count[30d])) by (service_name)
labels:
sli: latency
Query a recording rule value:
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
--data-urlencode 'query=sli:http_availability:ratio_rate30d'
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
--data-urlencode 'query=sli:http_latency:ratio_rate1h'
| SLO Target | Error Budget | Allowed Downtime (30 days) | Allowed Downtime (per day) |
|---|---|---|---|
| 99.9% | 0.1% | 43.2 minutes | 1.44 minutes |
| 99.5% | 0.5% | 3.6 hours | 7.2 minutes |
| 99.0% | 1.0% | 7.2 hours | 14.4 minutes |
The remaining error budget tells you what fraction of your error budget is still available. A value of 1.0 means the full budget remains; 0.0 means the budget is exhausted; negative means you've exceeded it.
Formula: 1 - (1 - SLI) / (1 - SLO_target)
For a 99.9% SLO target using the 30-day availability SLI:
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
--data-urlencode 'query=1 - ((1 - sli:http_availability:ratio_rate30d) / (1 - 0.999))'
For a 99.5% SLO target:
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
--data-urlencode 'query=1 - ((1 - sli:http_availability:ratio_rate30d) / (1 - 0.995))'
For a 99.0% SLO target:
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
--data-urlencode 'query=1 - ((1 - sli:http_availability:ratio_rate30d) / (1 - 0.99))'
The consumption rate shows how fast the error budget is being consumed. A value of 1.0 means the budget is being consumed at exactly the expected rate; values above 1.0 mean the budget is being consumed faster than sustainable.
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
--data-urlencode 'query=(1 - sli:http_availability:ratio_rate1h) / (1 - 0.999)'
Per-service error budget consumption over the last day:
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
--data-urlencode 'query=(1 - sli:http_availability:ratio_rate1d) / (1 - 0.999)'
Burn rate measures how fast you are consuming your error budget relative to the SLO. A burn rate of 1.0 means you will exactly exhaust the budget by the end of the SLO window. Higher values mean faster consumption.
Burn rate over a 1-hour window for a 99.9% SLO:
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
--data-urlencode 'query=(1 - sli:http_availability:ratio_rate1h) / (1 - 0.999)'
Burn rate over a 6-hour window:
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
--data-urlencode 'query=(1 - sli:http_availability:ratio_rate6h) / (1 - 0.999)'
The multi-window approach uses two conditions that must both be true before alerting. This reduces false positives by requiring both a short-term spike and a sustained trend.
Detects severe incidents that will exhaust the entire 30-day error budget in ~2 days. Both the 1-hour and 6-hour burn rates must exceed 14.4x:
1-hour burn rate:
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
--data-urlencode 'query=(1 - sli:http_availability:ratio_rate1h) / (1 - 0.999) > 14.4'
6-hour burn rate (confirmation window):
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
--data-urlencode 'query=(1 - sli:http_availability:ratio_rate6h) / (1 - 0.999) > 14.4'
Detects slow, sustained degradation that will exhaust the error budget by the end of the SLO window. Both the 3-day and 30-day burn rates must exceed 1x:
3-day burn rate:
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
--data-urlencode 'query=(1 - sli:http_availability:ratio_rate3d) / (1 - 0.999) > 1'
30-day burn rate (confirmation window):
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
--data-urlencode 'query=(1 - sli:http_availability:ratio_rate30d) / (1 - 0.999) > 1'
Add these alerting rules to the Prometheus configuration to trigger alerts when burn rates exceed thresholds. These follow the multi-window pattern from the Google SRE book.
groups:
- name: slo_burn_rate_alerts
rules:
- alert: SLOAvailabilityFastBurn
expr: |
(
(1 - sli:http_availability:ratio_rate1h) / (1 - 0.999) > 14.4
and
(1 - sli:http_availability:ratio_rate6h) / (1 - 0.999) > 14.4
)
for: 2m
labels:
severity: critical
slo: availability
annotations:
summary: "High availability burn rate detected for {{ $labels.service_name }}"
description: "Service {{ $labels.service_name }} is consuming error budget at 14.4x the sustainable rate. At this rate, the 30-day budget will be exhausted in ~2 days."
- alert: SLOAvailabilitySlowBurn
expr: |
(
(1 - sli:http_availability:ratio_rate3d) / (1 - 0.999) > 1
and
(1 - sli:http_availability:ratio_rate30d) / (1 - 0.999) > 1
)
for: 1h
labels:
severity: warning
slo: availability
annotations:
summary: "Sustained availability degradation for {{ $labels.service_name }}"
description: "Service {{ $labels.service_name }} has a burn rate above 1x over 3 days, confirmed by the 30-day window. Error budget will be exhausted before the SLO window ends."
groups:
- name: slo_latency_burn_rate_alerts
rules:
- alert: SLOLatencyFastBurn
expr: |
(
(1 - sli:http_latency:ratio_rate1h) / (1 - 0.999) > 14.4
and
(1 - sli:http_latency:ratio_rate6h) / (1 - 0.999) > 14.4
)
for: 2m
labels:
severity: critical
slo: latency
annotations:
summary: "High latency burn rate detected for {{ $labels.service_name }}"
description: "Service {{ $labels.service_name }} latency SLI is degrading at 14.4x the sustainable rate."
- alert: SLOLatencySlowBurn
expr: |
(
(1 - sli:http_latency:ratio_rate3d) / (1 - 0.999) > 1
and
(1 - sli:http_latency:ratio_rate30d) / (1 - 0.999) > 1
)
for: 1h
labels:
severity: warning
slo: latency
annotations:
summary: "Sustained latency degradation for {{ $labels.service_name }}"
description: "Service {{ $labels.service_name }} latency SLI burn rate exceeds 1x over 3 days."
Query active alerts:
curl -s "$PROMETHEUS_ENDPOINT/api/v1/alerts"
Query alerting rules:
curl -s "$PROMETHEUS_ENDPOINT/api/v1/rules"
Query the current availability SLI over the 30-day window for all services:
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
--data-urlencode 'query=sli:http_availability:ratio_rate30d'
Query the current latency SLI over the 30-day window:
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
--data-urlencode 'query=sli:http_latency:ratio_rate30d'
Check which services are meeting the 99.9% availability SLO:
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
--data-urlencode 'query=sli:http_availability:ratio_rate30d >= 0.999'
Check which services are violating the SLO:
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
--data-urlencode 'query=sli:http_availability:ratio_rate30d < 0.999'
Remaining error budget for each service against a 99.9% target:
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
--data-urlencode 'query=1 - ((1 - sli:http_availability:ratio_rate30d) / (1 - 0.999))'
Current burn rate for each service over the last hour:
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
--data-urlencode 'query=(1 - sli:http_availability:ratio_rate1h) / (1 - 0.999)'
Current burn rate over the last day:
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
--data-urlencode 'query=(1 - sli:http_availability:ratio_rate1d) / (1 - 0.999)'
Follow these steps to implement SLO monitoring for a service:
Choose the SLIs that matter for your service. Most services need at least availability and latency:
Verify the raw metrics exist in Prometheus:
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
--data-urlencode 'query=http_server_duration_seconds_count'
Add the recording rule groups from the Prometheus Recording Rules section to your Prometheus configuration. This pre-computes SLI values at all required time windows (5m, 30m, 1h, 6h, 1d, 3d, 30d).
Save the rules to a file (e.g., slo-rules.yml) and reference it in prometheus.yml:
rule_files:
- "slo-rules.yml"
Reload Prometheus to pick up the new rules:
curl -s -X POST "$PROMETHEUS_ENDPOINT/-/reload"
Verify the recording rules are loaded:
curl -s "$PROMETHEUS_ENDPOINT/api/v1/rules" | python3 -m json.tool
Choose SLO targets based on your service requirements:
| Service Tier | Availability Target | Latency Target (p99 < threshold) |
|---|---|---|
| Critical (user-facing) | 99.9% | 99.9% within 250ms |
| Standard (internal) | 99.5% | 99.5% within 500ms |
| Best-effort (batch) | 99.0% | 99.0% within 2s |
Add the burn rate alerting rules from the Prometheus Alerting Rules for Burn Rate section. Adjust the SLO target value in the expr field to match your chosen target.
Verify alerts are configured:
curl -s "$PROMETHEUS_ENDPOINT/api/v1/rules" | python3 -m json.tool
Run the compliance report queries from the SLO Compliance Reporting section to verify everything is working:
# Current SLI
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
--data-urlencode 'query=sli:http_availability:ratio_rate30d'
# Budget remaining
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
--data-urlencode 'query=1 - ((1 - sli:http_availability:ratio_rate30d) / (1 - 0.999))'
# Burn rate
curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
--data-urlencode 'query=(1 - sli:http_availability:ratio_rate1h) / (1 - 0.999)'
# Active alerts
curl -s "$PROMETHEUS_ENDPOINT/api/v1/alerts"
Replace the local Prometheus endpoint and authentication with AWS SigV4 for all PromQL queries in this skill:
curl -s --aws-sigv4 "aws:amz:REGION:aps" \
--user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY" \
'https://aps-workspaces.REGION.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/query' \
--data-urlencode 'query=sli:http_availability:ratio_rate30d'
https://aps-workspaces.REGION.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/query--aws-sigv4 "aws:amz:REGION:aps" with --user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY"Error budget query via AMP:
curl -s --aws-sigv4 "aws:amz:REGION:aps" \
--user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY" \
'https://aps-workspaces.REGION.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/query' \
--data-urlencode 'query=1 - ((1 - sli:http_availability:ratio_rate30d) / (1 - 0.999))'
Burn rate query via AMP:
curl -s --aws-sigv4 "aws:amz:REGION:aps" \
--user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY" \
'https://aps-workspaces.REGION.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/query' \
--data-urlencode 'query=(1 - sli:http_availability:ratio_rate1h) / (1 - 0.999)'
For Amazon Managed Prometheus, recording rules and alerting rules are managed via the AMP Rules Management API rather than local configuration files. Use awscurl or the AWS CLI to upload rule groups.