From elastic-agent-skills
Assesses APM service health using Elastic SLOs, alerts, ML anomalies, ES|QL on APM/OTEL indices for throughput, latency, error rates, dependencies. Useful for service status and performance checks.
npx claudepluginhub elastic/agent-skills --plugin elastic-cloudThis skill uses the workspace's default tool permissions.
Assess APM service health using [Observability APIs](https://www.elastic.co/docs/solutions/observability/apis),
Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.
Dynamically discovers and combines enabled skills into cohesive, unexpected delightful experiences like interactive HTML or themed artifacts. Activates on 'surprise me', inspiration, or boredom cues.
Generates images from structured JSON prompts via Python script execution. Supports reference images and aspect ratios for characters, scenes, products, visuals.
Assess APM service health using Observability APIs, ES|QL against APM indices, Elasticsearch APIs, and (for correlation and APM-specific logic) the Kibana repo. Use SLOs, firing alerts, ML anomalies, throughput, latency (avg/p95/p99), error rate, and dependency health.
traces*apm*,traces*otel* and metrics*apm*,metrics*otel* with ES|QL (see
Using ES|QL for APM metrics) for throughput, latency, error rate, and dependency-style
aggregations. Use Elasticsearch APIs (e.g. POST _query for ES|QL, or Query DSL) as documented in the Elasticsearch
repo for indices and search.traces*apm*,traces*otel*. See
APM Correlations script.k8s.pod.name, container.id, host.name) in
traces; query infrastructure or metrics indices with ES|QL/Elasticsearch for CPU and memory. OOM and CPU
throttling directly impact APM health.service.name or trace.id to explain
behavior and root cause.Synthesize health from all of the following when available:
| Signal | What to check |
|---|---|
| SLOs | Burn rate, status (healthy/degrading/violated), error budget. |
| Firing alerts | Open or recently fired alerts for the service or dependencies. |
| ML anomalies | Anomaly jobs; score and severity for latency, throughput, or error rate. |
| Throughput | Request rate; compare to baseline or previous period. |
| Latency | Avg, p95, p99; compare to SLO targets or history. |
| Error rate | Failed/total requests; spikes or sustained elevation. |
| Dependency health | Downstream latency, error rate, availability (ES|QL, APIs, Kibana repo). |
| Infrastructure | CPU usage, memory; OOM and CPU throttling on pods/containers/hosts. |
| Logs | App logs filtered by service or trace ID for context and root cause. |
Treat a service as unhealthy if SLOs are violated, critical alerts are firing, or ML anomalies indicate severe degradation. Correlate with infrastructure (OOM, CPU throttling), dependencies, and logs (service/trace context) to explain why and suggest next steps.
When querying APM data from Elasticsearch (traces*apm*,traces*otel*, metrics*apm*,metrics*otel*), use ES|QL by
default where available.
service.name (and service.environment when relevant). Combine with a
time range on @timestamp:WHERE service.name == "my-service-name" AND service.environment == "production"
AND @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
trace_charts_definition.ts
(getThroughputChart, getLatencyChart, getErrorRateChart). Use from(index) → where(...) → stats(...) /
evaluate(...) with BUCKET(@timestamp, ...) and WHERE service.name == "<service_name>".LIMIT n to cap rows and token usage. Prefer coarser BUCKET(@timestamp, ...) (e.g. 1 hour)
when only trends are needed; finer buckets increase work and result size.When only a subpopulation of transactions has high latency or failures, run the apm-correlations script to list
attributes that correlate with those transactions (e.g. host, service version, pod, region). The script tries the Kibana
internal APM correlations API first; if unavailable (e.g. 404), it falls back to Elasticsearch significant_terms on
traces*apm*,traces*otel*.
# Latency correlations (attributes over-represented in slow transactions)
node skills/observability/service-health/scripts/apm-correlations.js latency-correlations --service-name <name> [--start <iso>] [--end <iso>] [--last-minutes 60] [--transaction-type <t>] [--transaction-name <n>] [--space <id>] [--json]
# Failed transaction correlations
node skills/observability/service-health/scripts/apm-correlations.js failed-correlations --service-name <name> [--start <iso>] [--end <iso>] [--last-minutes 60] [--transaction-type <t>] [--transaction-name <n>] [--space <id>] [--json]
# Test Kibana connection
node skills/observability/service-health/scripts/apm-correlations.js test [--space <id>]
Environment: KIBANA_URL and KIBANA_API_KEY (or KIBANA_USERNAME/KIBANA_PASSWORD) for Kibana; for fallback,
ELASTICSEARCH_URL and ELASTICSEARCH_API_KEY. Use the same time range as the investigation.
Service health progress:
- [ ] Step 1: Identify the service (and time range)
- [ ] Step 2: Check SLOs and firing alerts
- [ ] Step 3: Check ML anomalies (if configured)
- [ ] Step 4: Review throughput, latency (avg/p95/p99), error rate
- [ ] Step 5: Assess dependency health (ES|QL/APIs / Kibana repo)
- [ ] Step 6: Correlate with infrastructure and logs
- [ ] Step 7: Summarize health and recommend actions
Confirm service name and time range. Resolve the service from the request; if multiple are in scope, target the most
relevant. Use ES|QL on traces*apm*,traces*otel* or metrics*apm*,metrics*otel* (e.g.
WHERE service.name == "<name>") or Kibana repo APM routes to obtain service-level data. If the user has not provided
the time range, assume last hour.
SLOs: Call the SLOs API to get SLO definitions and status for the service (latency, availability),
healthy/degrading/violated, burn rate, error budget. Alerts: For active APM alerts, call
/api/alerting/rules/_find?search=apm&search_fields=tags&per_page=100&filter=alert.attributes.executionStatus.status:active.
When checking one service, include both rules where params.serviceName matches the service and rules where
params.serviceName is absent (all-services rules). Do not query .alerts* indices for active-state checks. Correlate
with SLO violations or metric changes.
If ML anomaly detection is used, query ML job results or anomaly records (via Elasticsearch ML APIs or indices) for the service and time range. Note high-severity anomalies (latency, throughput, error rate); use anomaly time windows to narrow Steps 4–5.
Use ES|QL against traces*apm*,traces*otel* or metrics*apm*,metrics*otel* for the service and time range to get
throughput (e.g. req/min), latency (avg, p95, p99), error rate (failed/total or 5xx/total). Example:
FROM traces*apm*,traces*otel* | WHERE service.name == "<service_name>" AND @timestamp >= ... AND @timestamp <= ... | STATS ....
Compare to prior period or SLO targets. See Using ES|QL for APM metrics.
Obtain dependency and service-map data via ES|QL on traces*apm*,traces*otel*/metrics*apm*,metrics*otel* (e.g.
downstream service/span aggregations) or via APM route handlers in the Kibana repo that expose
dependency/service-map data. For the service and time range, note downstream latency and error rate; flag slow or
failing dependencies as likely causes.
node skills/observability/service-health/scripts/apm-correlations.js latency-correlations|failed-correlations --service-name <name> [--start ...] [--end ...]
to get correlated attributes. Filter by those attributes and fetch trace samples or errors to confirm root cause. See
APM Correlations script.k8s.pod.name, container.id, host.name) and
query infrastructure/metrics indices with ES|QL or Elasticsearch for CPU and memory. OOM and CPU
throttling directly impact APM health; correlate their time windows with APM degradation.service.name == "<service_name>" or
trace.id == "<trace_id>" to explain behavior and root cause (exceptions, timeouts, restarts).State health (healthy / degraded / unhealthy) with reasons; list concrete next steps.
Scope with WHERE service.name == "<service_name>" and time range. Throughput and error rate (1-hour buckets; LIMIT
caps rows and tokens):
FROM traces*apm*,traces*otel*
| WHERE service.name == "api-gateway"
AND @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
| STATS request_count = COUNT(*), failures = COUNT(*) WHERE event.outcome == "failure" BY BUCKET(@timestamp, 1 hour)
| EVAL error_rate = failures / request_count
| SORT @timestamp
| LIMIT 500
Latency percentiles and exact field names: see Kibana trace_charts_definition.ts.
traces*apm*,traces*otel*/metrics*apm*,metrics*otel* for throughput, latency, error rate; query
dependency/service-map data (ES|QL or Kibana repo).Use resource attributes on spans/traces to get the runtimes (pods, containers, hosts) for the service. Then check CPU and memory for those resources in the same time window as the APM issue:
k8s.pod.name, k8s.namespace.name,
container.id, or host.name.system.cpu.total.norm.pct); look for
OOMKilled events, CPU throttling, or sustained high CPU/memory that align with APM latency or error spikes.To understand behavior for a specific service or a single trace, filter logs accordingly:
service.name == "<service_name>" and time
range to get application logs (errors, warnings, restarts) in the service context.trace.id from the APM trace and filter logs by
trace.id == "<trace_id>" (or equivalent field in your log schema). Logs with that trace ID show the full request
path and help explain failures or latency.traces*apm*,traces*otel*/metrics*apm*,metrics*otel* (8.11+ or Serverless), filtering by service.name (and
service.environment when relevant). For active APM alerts, call
/api/alerting/rules/_find?search=apm&search_fields=tags&per_page=100&filter=alert.attributes.executionStatus.status:active.
When checking one service, evaluate both rule types: rules where params.serviceName matches the target service, and
rules where params.serviceName is absent (all-services rules). Treat either as applicable to the service before
declaring health. Do not query .alerts* indices when determining currently active alerts; use the Alerting API
response above as the source of truth. For APM correlations, run the apm-correlations script (see
APM Correlations script); for dependency/service-map data, use ES|QL or Kibana repo route
handlers. For Elasticsearch index and search behavior, see the Elasticsearch APIs in the Elasticsearch repo.