From gcx
Diagnoses Grafana Cloud Synthetic Monitoring check failures by triaging probe data, classifying failure scope, running per-probe breakdowns, and identifying root causes using gcx CLI.
npx claudepluginhub grafana/gcx --plugin gcxThis skill is limited to using the following tools:
Investigate Synthetic Monitoring check failures by triaging probe data, classifying failure scope, and identifying root cause. Experienced operators need actionable diagnosis, not hand-holding.
Views Synthetic Monitoring check health, status, success rates, and timelines using gcx CLI commands. Lists checks, interprets OK/FAILING/NODATA, graphs trends for operators.
Configures synthetic monitoring for uptime checks, transaction flows, and API health using Pingdom, Datadog, or New Relic. Useful for tracking app availability and performance.
Configures Grafana Cloud testing with Synthetic Monitoring for multi-region HTTP/DNS/TCP/ping probes, k6 Cloud distributed load tests, and Faro real user monitoring. Use for uptime checks, API testing, and frontend performance.
Share bugs, ideas, or general feedback.
Investigate Synthetic Monitoring check failures by triaging probe data, classifying failure scope, and identifying root cause. Experienced operators need actionable diagnosis, not hand-holding.
-o json for agent processing, default format for user displaygcx configured with an active context and appropriate permissions.
gcx synth checks status <ID>
If the user provided a name instead of ID, list first:
gcx synth checks list -o json | jq -r '.[] | select(.job | test("<name>"; "i")) | [.id, .job, .target, .type] | @tsv'
Early exit — OK: Check success rate >= 50% across all probes.
Report: "Check <job> is healthy. Success rate: %. <probe_count> probes up."
Stop unless the user asks for more.
Early exit — NODATA: No Prometheus metrics available.
enabled: true (gcx synth checks get <ID> -o json | jq .spec.enabled)gcx synth checks get <ID> -o json
Extract: job name, target, check type (http/ping/dns/tcp/traceroute), probe list, frequency, timeout, alertSensitivity, enabled flag.
For HTTP checks also note: any assertion settings, TLS config, expected status codes.
gcx synth checks timeline <ID> --from now-1h --to now
Show the graph output to the user. Then analyze the pattern:
| Pattern | Classification |
|---|---|
| All probes at 0 (or near 0) | Target down |
| Subset of probes at 0, others healthy | Regional / network |
| Intermittent drops across multiple probes | Flapping / timeout |
| All probes drop at a specific point in time | Sudden onset — possible deployment or config change |
| Gradual decline | Degradation — timeout drift or resource exhaustion |
Use a longer window if the failure started more than 1h ago:
gcx synth checks timeline <ID> --from now-6h --to now
Get the probe list for geographic mapping:
gcx synth probes list -o json
Cross-reference probe IDs from the check config against probe regions. Map failing probes to their regions.
All probes failing: Target/service issue — likely target down, SSL error, or DNS failure.
Subset of probes failing: Regional or network issue. Note which regions are affected:
Intermittent failures: Flapping. Consider: rate limiting, timeout too tight, flaky connectivity.
Resolve datasource UID if not already known:
gcx datasources list --type prometheus
Run per-probe success rate to pinpoint failing probes:
gcx metrics query <datasource-uid> \
'avg by (probe) (probe_success{job="<job>",instance="<target>"})' \
--from now-1h --to now --step 1m -o json
Show as graph for the user:
gcx metrics query <datasource-uid> \
'avg by (probe) (probe_success{job="<job>",instance="<target>"})' \
--from now-1h --to now --step 1m -o graph
For HTTP checks, also run HTTP phase latency to locate where time is spent:
gcx metrics query <datasource-uid> \
'avg by (phase) (probe_http_duration_seconds{job="<job>",instance="<target>"})' \
--from now-1h --to now --step 1m -o graph
For SSL/TLS failures or near-expiry concerns:
gcx metrics query <datasource-uid> \
'(probe_ssl_earliest_cert_expiry{job="<job>",instance="<target>"} - time()) / 86400' \
--from now-1h --to now --step 5m -o json
See references/sm-promql-patterns.md for full PromQL pattern library.
Cross-reference signals against references/failure-modes.md to select the most likely failure mode:
connect or tls → TimeoutSynthesize findings into an actionable report (see Output Format below).
Next actions depend on failure mode:
If deeper investigation is needed (e.g., logs, infra repos), ask the user if they want to proceed.
If check config needs changes (probe selection, frequency, assertions), route to synth-manage-checks.
Early exit (OK):
Check: <job> (<target>)
Status: OK
Success rate: <rate>%
Probes up: <count>/<total>
Early exit (NODATA):
Check: <job> (<target>)
Status: NODATA
Enabled: <yes/no>
Next: <datasource check / re-enable instruction>
Full investigation:
Check: <job> (<target>)
Type: <http|ping|dns|tcp|traceroute>
Status: FAILING
Success rate: <rate>% (window: <from> – <to>)
[Timeline graph]
Failure classification: <Target down | Regional/CDN | SSL/TLS | DNS | Timeout | Content/assertion | Private probe infra | Rate limiting>
Affected probes: <count>/<total>
- <probe-name> (<region>): failing since <time>
- <probe-name> (<region>): intermittent
Onset: <time/duration or "unknown">
Diagnosis:
<2-4 sentences describing what the data shows and the most likely cause>
Next actions:
1. <action>
2. <action>
3. <action>
Use minimal formatting. Avoid excessive bold text. Trust the user to prioritize.
gcx synth checks status returns no rows: check ID may be wrong — list all checks and confirmgcx synth probes list fails: skip geographic mapping; classify probes by name where possiblegcx datasources {kind} query fails with datasource error: note it, skip PromQL steps, classify using timeline data only--from now-6h --to now before concluding NODATA