From gcx
Investigates breaching Grafana SLOs using gcx CLI: root cause analysis, dimensional breakdowns, alert correlations, runbook access, error budget checks.
npx claudepluginhub grafana/gcx --plugin gcxThis skill is limited to using the following tools:
Deep-dive investigation of breaching SLOs: dimensional breakdown, alert correlation, runbook access. For experienced operators — no hand-holding.
Checks SLO health, budgets, burn rates, and trends using gcx CLI commands. Displays overviews of all SLOs, specific status, timelines, and graphs for Grafana Cloud.
Provides Prometheus queries and templates for SLO/SLI definitions on availability/latency, error budget calculations, and burn rate alerting for service reliability.
Writes SLO-based alert rules with burn-rate thresholds and paired runbooks. Outputs configs for Prometheus/Grafana, Datadog, CloudWatch. Use for setting up alerts, defining SLOs, or runbooks.
Share bugs, ideas, or general feedback.
Deep-dive investigation of breaching SLOs: dimensional breakdown, alert correlation, runbook access. For experienced operators — no hand-holding.
-o json for agent processing, default format for user display; show graphs for time-series data--from/--to for all time-range commands (never --start/--end)gcx slo definitions get <UUID> -o json
Extract from the JSON response:
.metadata.name — SLO name.spec.query.type — query type: ratio, freeform, or threshold.spec.query.ratio.successMetric, .spec.query.ratio.totalMetric, .spec.query.ratio.groupByLabels[].spec.query.freeform.query.spec.objectives[0].value — objective (0–1), .spec.objectives[0].window — window.spec.destinationDatasource.uid — Prometheus datasource UID.spec.alerting.fastBurn.annotations, .spec.alerting.slowBurn.annotations — runbook/dashboard URLs.metadata.annotations — additional runbook/dashboard referencesIf no UUID is given, list SLOs and ask which to investigate:
gcx slo definitions list
gcx slo definitions status <UUID> -o wide
This shows SLI, ERROR_BUDGET, BURN_RATE, SLI_1H, SLI_1D, and STATUS.
Early exit — OK status: If STATUS is OK, report health metrics and stop:
SLO: <name> — Status: OK
SLI: <value> | Error budget remaining: <budget>% | Burn rate: <rate>x
1h SLI: <sli_1h> | 1d SLI: <sli_1d>
No action needed.
Early exit — NODATA status: If STATUS is NODATA, branch to NODATA diagnosis:
SLO: <name> — Status: NODATA
Recording rule metrics unavailable. Likely causes:
- Destination datasource misconfigured (check .spec.destinationDatasource.uid)
- Grafana recording rules not yet evaluated (can take 1–2 minutes after creation)
- Prometheus federation/remote write issue
Check: gcx datasources list --type prometheus
Then verify the destination datasource UID matches what the SLO expects.
Lifecycle states: If status is Creating/Updating/Deleting/Error, report that the SLO is in a transient state and investigate the Grafana backend.
gcx slo definitions timeline <UUID> --from now-1h --to now
For wider trends:
gcx slo definitions timeline <UUID> --from now-24h --to now
Show the graph output (default). Use it to identify when breaching started and how severe it is.
Resolve the datasource UID. If .spec.destinationDatasource.uid is set, use it. Otherwise auto-discover:
gcx datasources list --type prometheus
For ratio queries — extract success/total metric selectors and groupByLabels, then query dimensional breakdown:
# Success rate by dimension (e.g., cluster, status_code, endpoint)
gcx metrics query <datasource-uid> \
'sum by (<groupByLabel>) (rate(<successMetric>[5m])) / sum by (<groupByLabel>) (rate(<totalMetric>[5m]))' \
--from now-1h --to now --step 1m
# Error rate by dimension to spot the bad actor
gcx metrics query <datasource-uid> \
'sum by (<groupByLabel>) (rate(<totalMetric>[5m])) - sum by (<groupByLabel>) (rate(<successMetric>[5m]))' \
--from now-1h --to now --step 1m
If groupByLabels is empty, try common dimensions: cluster, namespace, service, status_code, endpoint.
For freeform queries — use the raw PromQL expression and add by (<label>) grouping:
gcx metrics query <datasource-uid> \
'<freeform_expression> by (cluster)' \
--from now-1h --to now --step 1m
# Also try other likely breakdown dimensions
gcx metrics query <datasource-uid> \
'<freeform_expression> by (namespace)' \
--from now-1h --to now --step 1m
Use graph output to display dimensional trends visually. Use -o json to extract exact values for the report.
gcx alert rules list -o json | jq '[.[] | .rules[]? | select(.name | test("<slo-name>"; "i"))]'
Also try searching by UUID fragment if the name-based search returns no results:
gcx alert rules list -o json | jq '[.[] | .rules[]? | select(.labels.slo_uuid == "<UUID>" or (.name | test("<slo-name>"; "i")))]'
Extract for each matching rule: name, state (firing/pending/inactive), labels, and annotations.
Collect URLs from:
.spec.alerting.fastBurn.annotations.runbook_url.spec.alerting.fastBurn.annotations.dashboard_url.spec.alerting.slowBurn.annotations.runbook_url.spec.alerting.slowBurn.annotations.dashboard_url.metadata.annotations.*If a GitHub URL is found in runbook annotations and gh is available:
# Convert GitHub web URL to API path and fetch content
gh api /repos/<owner>/<repo>/contents/<path> --jq '.content' | base64 --decode
For raw GitHub URLs (raw.githubusercontent.com), extract the content URL pattern and use gh api with the equivalent API endpoint.
After completing the investigation, present results in this structure:
SLO: <name>
Target: <objective>% over <window> | Status: BREACHING
SLI: <current>% | Error budget remaining: <budget>% | Burn rate: <rate>x
1h SLI: <sli_1h>% | 1d SLI: <sli_1d>%
[Timeline graph — show default output]
Dimensional Breakdown:
Worst dimension: <label>=<value> at <error_rate>% error rate
[Additional dimensions ranked by error rate]
Related Alert Rules:
- <rule_name>: <state> [labels: <key>=<value>]
Runbook: <url>
Dashboard: <url>
[If runbook fetched]: Key runbook steps:
<relevant excerpt>
Next actions:
1. <most specific actionable step based on findings>
2. <follow-up investigation or escalation path>
3. <if budget near zero: suggest slo-optimize for objective review>
gcx slo definitions list and confirm the UUID.gcx datasources list --type prometheus to find the correct UID..spec.destinationDatasource.uid). Try both the destination datasource and the default Prometheus datasource.gcx alert rules list -o json | jq length to confirm total count.gh is not authenticated or unavailable, report the runbook URL directly and skip content fetching.cluster, namespace, service, endpoint, status_code. Report which ones return data.