From gcx
Investigates Grafana alerts using gcx CLI to check states, query datasources like Prometheus, determine firing causes, scope, and impact. For diagnosing specific firing or pending alerts.
npx claudepluginhub grafana/gcx --plugin gcxThis skill uses the workspace's default tool permissions.
Investigate Grafana alerts by analyzing state, querying datasources, and identifying next steps. Be concise and direct - these are experienced operators who need actionable information, not hand-holding.
Debugs application issues using Grafana observability data with Prometheus metrics and Loki logs via 7-step gcx workflow. For errors, latency spikes, HTTP 500s, service degradation.
Query and manage Grafana dashboards, alert rules, and data sources via HTTP API. Useful for viewing dashboards, troubleshooting alerts, checking metrics, or on mentions of Grafana, monitoring, observability.
Configures Grafana Alerting, IRM, and SLOs: Grafana-managed/Prometheus/Loki alert rules, notification policies, Slack/PagerDuty/email contacts, silences, on-call rotations, incident workflows, YAML/API provisioning.
Share bugs, ideas, or general feedback.
Investigate Grafana alerts by analyzing state, querying datasources, and identifying next steps. Be concise and direct - these are experienced operators who need actionable information, not hand-holding.
User needs gcx installed with configured context and appropriate permissions. If gcx is not configured, use the setup-gcx skill first.
Check context if needed (gcx config view). If multiple contexts exist and none specified, ask which to use.
Fetch the alert by listing all alerts and filtering by name. Replace <AlertName> with the actual alert name:
gcx alert rules list -o json | jq -r '.[] | .rules[]? | select(.name == "<AlertName>")'
Server-side filters (use instead of downloading all rules and filtering with jq):
--state firing|pending|inactive — filter by rule state--group <name> — filter by group name--folder <uid> — filter by folder UIDFilter by name, state, cluster/environment as relevant. If multiple matches, list them and ask which to investigate. Inform the user which context you're using.
Check the type field:
type: recording: This is a recording rule, not an alerting rule. Report: "This is a recording rule (pre-calculates metrics), not an alerting rule. It doesn't fire alerts. Current state: [state]. Want details on what it's recording?" Stop here unless they ask for more.Check the state field:
state: inactive AND the alert's query looks healthy: Report: "Alert is inactive. [Brief what it monitors]. Health: [health]. Last evaluated: [time]. Want to see historical trends?" Stop here unless they ask for more.state: firing or state: pending: Continue with full investigation below.You should use the datasourceUID from the alert when you can.
If you need to query a different datasource (e.g., Loki for log correlation), resolve its UID first:
gcx datasources list --type loki
Annotation URLs often reference datasources by name — always resolve to UID before querying.
Query the datasource. Use -o json to get the data for yourself. Use with a graph visualization for showing a summary to the user:
# Prometheus
gcx metrics query <datasource-uid> '<query>' --from now-1h --to now --step 1m -o json
gcx metrics query <datasource-uid> '<query>' --from now-1h --to now --step 1m -o graph
# Loki
gcx logs query <datasource-uid> '<query>' --from now-1h --to now -o json
gcx logs query <datasource-uid> '<query>' --from now-1h --to now -o graph
Analyze the results: What's the current value? Spike or gradual? When did it start?
Extract from annotations:
gh is available, fetch with gh api)Provide concise analysis:
Based on the error class, suggest follow-up queries to the user:
up{job="..."}) and pod restart countsRecommend incident creation if there's customer impact.
List specific next actions - queries to run, deployments to check, metrics to examine. If there are queries for logs or metrics you can run, then ask the user if they want you to run them. If infrastructure changes are a suspected cause, suggest to the user that you could investigate any infra-as-code repos, if they point you to them.
If the next suggested actions include looking at logs in any way, use gcx to do it.
For recording rules or healthy inactive alerts (early exit):
This is a [recording rule / inactive alert]. [One sentence what it monitors]. State: [state]. Health: [health].
Want to see more details?
For firing/pending alerts (full investigation):
Alert: <name>
State: firing [in <cluster/env>]
Monitors: <brief what it checks>
[Show graph visualization]
Current value: <value>
Trend: <spike/gradual/sustained>
Likely causes:
- <cause 1>
- <cause 2>
Impact: <who/what affected>
Runbook: <link>
Dashboard: <link>
Next actions:
- <action 1>
- <action 2>
[If customer impact:] Recommend creating an incident - <why>.
Use minimal formatting. Avoid excessive bold text. No timelines like "within 24 hours". Trust the user to prioritize.
For alert JSON structure, query patterns by alert type, graph interpretation, and runbook fetching, see: