From elastic-agent-skills
Queries Elastic traces, metrics, and logs to monitor LLM/agentic performance, token/costs, quality, and orchestration. For LLM monitoring, GenAI observability, AI cost/quality.
npx claudepluginhub elastic/agent-skills --plugin elastic-cloudThis skill uses the workspace's default tool permissions.
Answer user questions about monitoring LLMs and agentic components using **data ingested into Elastic** only. Focus on
Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.
Dynamically discovers and combines enabled skills into cohesive, unexpected delightful experiences like interactive HTML or themed artifacts. Activates on 'surprise me', inspiration, or boredom cues.
Generates images from structured JSON prompts via Python script execution. Supports reference images and aspect ratios for characters, scenes, products, visuals.
Answer user questions about monitoring LLMs and agentic components using data ingested into Elastic only. Focus on LLM performance, cost and token utilization, response quality, and call chaining or agentic workflow orchestration. Use ES|QL, Elasticsearch APIs, and (where needed) Kibana APIs. Do not rely on Kibana UI; the skill works without it. A given deployment typically uses one or more ingestion paths (APM/OTLP traces and/or integration metrics/logs)— discover what is available before querying.
traces* when collected by the
Elastic APM Agent, and in traces-generic.otel-default (and similar) when collected by OpenTelemetry. Use the
generic pattern traces* to find all trace data regardless of source. When the application is instrumented with
OpenTelemetry (e.g. Elastic
Distributions of OpenTelemetry (EDOT),
OpenLLMetry, OpenLIT, Langtrace exporting to OTLP), LLM and agent spans land in these trace data streams; metrics may
land in metrics-apm* or metrics-generic. Query traces* and metrics* data streams for per-request and
aggregated LLM signals.metrics*, logs* with dataset/namespace per integration). Check which data
streams exist.GET _data_stream, or
GET traces*/_mapping, GET metrics*/_mapping) and optionally sample a document to see which LLM-related fields are
present. Do not assume both APM and integration data exist.traces* or metrics data streams.traces*, or integration metrics). Firing alerts or
violated/degrading SLOs point to potential degraded performance.Spans from OTel/EDOT (and compatible SDKs) carry span attributes that may follow
OpenTelemetry GenAI semantic conventions or
provider-specific names. In Elasticsearch, attributes typically appear under span.attributes (exact key names depend
on ingestion). Common attributes:
| Purpose | Example attribute names (OTel GenAI) |
|---|---|
| Operation / provider | gen_ai.operation.name, gen_ai.provider.name |
| Model | gen_ai.request.model, gen_ai.response.model |
| Token usage | gen_ai.usage.input_tokens, gen_ai.usage.output_tokens |
| Request config | gen_ai.request.temperature, gen_ai.request.max_tokens |
| Errors | error.type |
| Conversation / agent | gen_ai.conversation.id; tool/agent spans as child spans |
Cost is not in the OTel spec; some instrumentations add custom attributes (e.g. llm.response.cost.usd_estimate).
Discover actual field names from the index mapping or a sample document (e.g. span.attributes.* or flattened keys).
Use duration and event.outcome on spans for latency and success/failure. Use trace.id, span.id, and parent/child span relationships to analyze call chaining and agentic workflows (e.g. one root span, multiple LLM or tool-call child spans).
Integrations (OpenAI, Azure OpenAI, Azure AI Foundry, Bedrock, Bedrock AgentCore, Vertex AI, etc.) ship metrics (and where supported logs) to Elastic. Metrics typically include token usage, request counts, latency, and—where the integration supports it—cost-related fields. Logs may include prompt/response or guardrail events. Exact field names and data streams are defined by each integration package; discover them from the integration docs or from the target data stream mapping.
GET _data_stream and filter for traces*, metrics-apm* (or metrics*), and metrics-* /
logs-* that match known LLM integration datasets (e.g. from
Elastic LLM observability).traces*, run a small search or use mapping to see if spans contain gen_ai.* or
llm.* (or similar) attributes. Confirm presence of token, model, and duration fields.traces* filtered by span attributes (e.g. gen_ai.operation.name or gen_ai.provider.name
when present). Compute throughput (count per time bucket), latency (e.g. duration.us or span duration), and error
rate (event.outcome == "failure") by model, service, or time.traces*: sum gen_ai.usage.input_tokens and gen_ai.usage.output_tokens (or
equivalent attribute names) by time, model, or service. If a cost attribute exists (e.g. custom
llm.response.cost.*), sum it for cost views.event.outcome, error.type, and span attributes (e.g. gen_ai.response.finish_reasons) in
traces* to identify failures, timeouts, or content filters. Correlate with prompts/responses if captured in
attributes (e.g. gen_ai.input.messages, gen_ai.output.messages) and not redacted.traces*. Filter by root service or trace attributes; group by trace.id
and use parent/child span relationships (e.g. parent.id, span.id) to reconstruct chains (e.g. orchestration span →
multiple LLM or tool-call spans). Aggregate by span name or gen_ai.operation.name to see distribution of steps (e.g.
retrieval, LLM, tool use). Duration per span and per trace gives bottleneck and end-to-end latency.@timestamp). When present, add service.name and optionally
service.environment. For LLM-specific spans, filter by span attributes once you know the field names (e.g. a keyword
field for gen_ai.provider.name or gen_ai.operation.name).LIMIT, coarse time buckets when only trends are needed, and avoid full scans over large
windows.LLM observability progress:
- [ ] Step 1: Determine available data (traces*, metrics-apm* or metrics*, or integration data streams)
- [ ] Step 2: Discover LLM-related field names (mapping or sample doc)
- [ ] Step 3: Run ES|QL or Elasticsearch queries for the user's question (performance, cost, quality, orchestration)
- [ ] Step 4: Check for active alerts or SLOs defined on LLM-related data (Alerting API, SLOs API); field names from
Step 2 help identify related rules; firing alerts or violated/degrading SLOs indicate potential degraded performance
- [ ] Step 5: Summarize findings from ingested data only; include alert/SLO status when relevant
Assume span attributes are available as span.attributes.gen_ai.usage.input_tokens and
span.attributes.gen_ai.usage.output_tokens (adjust to actual field names from mapping):
FROM traces*
| WHERE @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
AND span.attributes.gen_ai.provider.name IS NOT NULL
| STATS
input_tokens = SUM(span.attributes.gen_ai.usage.input_tokens),
output_tokens = SUM(span.attributes.gen_ai.usage.output_tokens)
BY BUCKET(@timestamp, 1 hour), span.attributes.gen_ai.request.model
| SORT @timestamp
| LIMIT 500
FROM traces*
| WHERE @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
AND span.attributes.gen_ai.request.model IS NOT NULL
| STATS
request_count = COUNT(*),
failures = COUNT(*) WHERE event.outcome == "failure",
avg_duration_us = AVG(span.duration.us)
BY span.attributes.gen_ai.request.model
| EVAL error_rate = failures / request_count
| LIMIT 100
Get trace IDs that contain at least one LLM span and count spans per trace to see chain length:
FROM traces*
| WHERE @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
AND span.attributes.gen_ai.operation.name IS NOT NULL
| STATS span_count = COUNT(*), total_duration_us = SUM(span.duration.us) BY trace.id
| WHERE span_count > 1
| SORT total_duration_us DESC
| LIMIT 50
The Amazon Bedrock AgentCore integration
ships metrics to the metrics-aws_bedrock_agentcore.metrics-* data stream (time series index). Use TS for
aggregations on time series data streams (Elasticsearch 9.2+); use a time range with TRANGE (9.3+). The
integration’s dashboards and
alerting rule templates
Example: token usage (counter), invocations (counter), and average latency (gauge) by hour and agent:
TS metrics-aws_bedrock_agentcore.metrics-*
| WHERE TRANGE(7 days)
AND aws.dimensions.Operation == "InvokeAgentRuntime"
| STATS
total_tokens = SUM(RATE(aws.bedrock_agentcore.metrics.TokenCount.sum)),
total_invocations = SUM(RATE(aws.bedrock_agentcore.metrics.Invocations.sum)),
avg_latency_ms = AVG(AVG_OVER_TIME(aws.bedrock_agentcore.metrics.Latency.avg))
BY TBUCKET(1 hour), aws.bedrock_agentcore.agent_name
| SORT TBUCKET(1 hour) DESC
For Elasticsearch 8.x or when TS is not available, use FROM with BUCKET(@timestamp, 1 hour) and SUM/AVG over
the metric fields (as in the integration's alert rule templates). For other LLM integrations (OpenAI, Azure OpenAI,
Vertex AI, etc.), use that integration’s data stream index pattern and field names from its package (see
Elastic LLM observability).
traces*, metrics, or integration
metrics/logs). Do not describe or rely on other vendors’ UIs or products._mapping or a sample document; naming may differ (e.g. gen_ai.* vs llm.* or integration-specific fields).