Search everything...

Skill

correlation

Correlates traces, logs, and metrics using OTel semantic conventions like traceId/spanId in OpenSearch and Prometheus for end-to-end observability investigations from metric spikes to error logs.

Prometheus

OpenTelemetry

Bash

monitoring

npx claudepluginhub opensearch-project/observability-stack --plugin observability

Tool Access

This skill is limited to using the following tools:

Bashcurl

Preview

This skill teaches how to correlate traces, logs, and metrics across all three telemetry signals using shared OTel semantic convention fields. Correlation enables end-to-end investigations: start from a metric spike, trace it to a specific request, and find the associated logs — or start from an error log and reconstruct the full trace that produced it.

SKILL.md

Similar Skills

cache-components

139.2k

Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.

cache-components

mcp-builder

124.2k

Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).

9 files

anthropics-skills-13

canvas-design

124.2k

Generates original PNG/PDF visual art via design philosophy manifestos for posters, graphics, and static designs on user request.

20 files

anthropics-skills-13

Stats

Parent Repo Stars14

Parent Repo Forks13

Last CommitMar 22, 2026

Actions

View Source View Plugin View on GitHub View README

correlation

From opensearch@observability

Correlates traces, logs, and metrics using OTel semantic conventions like traceId/spanId in OpenSearch and Prometheus for end-to-end observability investigations from metric spikes to error logs.

npx claudepluginhub opensearch-project/observability-stack --plugin observability

Tool Access

This skill is limited to using the following tools:

Bashcurl

Preview

SKILL.md

Cross-Signal Correlation

Overview

All OpenSearch queries use the PPL API at /_plugins/_ppl with HTTPS and basic authentication. Prometheus queries use the HTTP API at localhost:9090. Credentials are read from the .env file (default: admin / My_password_123!@#).

Connection Defaults

Variable	Default	Description
`OPENSEARCH_ENDPOINT`	`https://localhost:9200`	OpenSearch base URL
`OPENSEARCH_USER`	`admin`	OpenSearch username
`OPENSEARCH_PASSWORD`	`My_password_123!@#`	OpenSearch password

OTel Correlation Fields Reference

Trace Context Correlation

Traces and logs share traceId and spanId fields. When an application emits a log within an active span, the OTel SDK automatically injects the current trace context into the log record. This creates a direct link between log entries and the spans that produced them.

Field	Signal	Type	Description
`traceId`	Traces, Logs	keyword	Hex-encoded 128-bit trace identifier shared between spans and log records
`spanId`	Traces, Logs	keyword	Hex-encoded 64-bit span identifier shared between spans and log records
`traceFlags`	Logs	integer	W3C trace flags (e.g., 01 = sampled) carried on log records

In the Trace_Index (otel-v1-apm-span-*): traceId and spanId identify each span
In the Log_Index (logs-otel-v1-*): traceId and spanId link the log to the span that was active when the log was emitted

Metric-to-Trace Correlation (Prometheus Exemplars)

Prometheus exemplars attach trace context to individual metric samples. When the OTel SDK records a metric observation inside an active span, it can attach the trace_id and span_id as exemplar labels. This links a specific metric data point back to the trace that produced it.

Exemplar data model:

Field	Description
`trace_id`	Hex-encoded trace identifier from the span active during metric recording
`span_id`	Hex-encoded span identifier from the span active during metric recording
`filtered_attributes`	Additional key-value pairs attached to the exemplar
`timestamp`	Time when the exemplar was recorded
`value`	The metric sample value associated with this exemplar

Resource-Level Correlation

All three signals (traces, logs, metrics) share resource attributes that identify the originating service. These attributes are set by the OTel SDK and propagated through the pipeline:

Resource Attribute	Traces Field	Logs Field	Prometheus Label	Description
`service.name`	`serviceName`	`resource.attributes.service.name`	`service_name`	Service that produced the telemetry
`service.namespace`	`resource.service.namespace`	`resource.attributes.service.namespace`	`service_namespace`	Namespace grouping related services
`service.version`	`resource.service.version`	`resource.attributes.service.version`	`service_version`	Service version string
`service.instance.id`	`resource.service.instance.id`	`resource.attributes.service.instance.id`	`service_instance_id`	Unique instance identifier
`deployment.environment.name`	`resource.deployment.environment.name`	`resource.attributes.deployment.environment.name`	`deployment_environment_name`	Deployment environment (e.g., production, staging)

The OTel Collector's resourcedetection processor enriches telemetry with environment context, and the Prometheus promote_resource_attributes configuration (in docker-compose/prometheus/prometheus.yml) promotes these resource attributes to metric labels so they are queryable in PromQL.

GenAI Resource Attributes in Prometheus

The following GenAI resource attributes are promoted to Prometheus metric labels via the promote_resource_attributes configuration, enabling metric queries filtered by agent or model:

Resource Attribute	Prometheus Label	Description
`gen_ai.agent.id`	`gen_ai_agent_id`	Agent identifier
`gen_ai.agent.name`	`gen_ai_agent_name`	Human-readable agent name (only available if SDK sets this as a resource attribute; most SDKs set it as a span attribute instead)
`gen_ai.provider.name`	`gen_ai_provider_name`	LLM provider (e.g., bedrock, openai)
`gen_ai.request.model`	`gen_ai_request_model`	Model requested for the operation
`gen_ai.response.model`	`gen_ai_response_model`	Model that actually served the response

Trace-to-Log Correlation (PPL)

Find Logs by traceId

Given a trace ID, find all log entries emitted during that trace:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=logs-otel-v1-* | where traceId = '\''<TRACE_ID>'\'' | fields traceId, spanId, severityText, body, `resource.attributes.service.name`, `@timestamp` | sort `@timestamp`"}'

Find Logs by spanId

Given a span ID, find all log entries emitted during that specific span:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=logs-otel-v1-* | where spanId = '\''<SPAN_ID>'\'' | fields traceId, spanId, severityText, body, `resource.attributes.service.name`, `@timestamp` | sort `@timestamp`"}'

Join Spans and Logs by traceId

Use PPL join to combine trace spans with their correlated logs in a single query:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=otel-v1-apm-span-* | where traceId = '\''<TRACE_ID>'\'' | join left=s right=l ON s.traceId = l.traceId logs-otel-v1-* | fields s.spanId, s.name, s.serviceName, s.durationInNanos, l.severityText, l.body, l.`@timestamp`"}'

Full Timeline Reconstruction

Reconstruct the complete request timeline by interleaving spans and logs sorted by timestamp. Run both queries and merge results by time:

Step 1 — Get all spans for the trace:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=otel-v1-apm-span-* | where traceId = '\''<TRACE_ID>'\'' | eval signal = '\''span'\'' | fields traceId, spanId, serviceName, name, startTime, endTime, durationInNanos, `status.code`, signal | sort startTime"}'

Step 2 — Get all logs for the trace:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=logs-otel-v1-* | where traceId = '\''<TRACE_ID>'\'' | eval signal = '\''log'\'' | fields traceId, spanId, `resource.attributes.service.name`, severityText, body, `@timestamp`, signal | sort `@timestamp`"}'

Merge both result sets by timestamp to see the full chronological sequence of spans and log entries for the request.

Log-to-Trace Correlation (PPL)

Find Originating Trace from an Error Log

When you find an error log, extract its traceId and query the Trace_Index to reconstruct the full trace:

Step 1 — Find error logs and get their traceId:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=logs-otel-v1-* | where severityText = '\''ERROR'\'' | fields traceId, spanId, severityText, body, `resource.attributes.service.name`, `@timestamp` | sort - `@timestamp` | head 10"}'

Step 2 — Query the Trace_Index with the extracted traceId to get all spans:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=otel-v1-apm-span-* | where traceId = '\''<TRACE_ID_FROM_LOG>'\'' | fields traceId, spanId, parentSpanId, serviceName, name, startTime, endTime, durationInNanos, `status.code` | sort startTime"}'

Find Specific Span from a Log Entry

When a log entry has a spanId, query the Trace_Index to find the exact span that was active when the log was emitted:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=otel-v1-apm-span-* | where spanId = '\''<SPAN_ID_FROM_LOG>'\'' | fields traceId, spanId, parentSpanId, serviceName, name, startTime, endTime, durationInNanos, `status.code`, `attributes.gen_ai.operation.name`"}'

Metric-to-Trace Correlation (Prometheus Exemplars)

Query Exemplars from Prometheus

Use the Prometheus exemplars API to retrieve trace context attached to metric samples. This links a metric observation back to the specific trace that produced it:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query_exemplars" \
  --data-urlencode 'query=http_server_duration_seconds_bucket' \
  --data-urlencode 'start=2024-01-01T00:00:00Z' \
  --data-urlencode 'end=2024-01-02T00:00:00Z'

The response contains exemplar objects with trace_id and span_id in the labels field:

{
  "status": "success",
  "data": [
    {
      "seriesLabels": { "service_name": "my-agent", "__name__": "http_server_duration_seconds_bucket" },
      "exemplars": [
        {
          "labels": { "trace_id": "abc123...", "span_id": "def456..." },
          "value": "0.25",
          "timestamp": 1704067200.000
        }
      ]
    }
  ]
}

Query Exemplars for GenAI Metrics

Query exemplars for GenAI operation duration, filtered by agent name:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query_exemplars" \
  --data-urlencode 'query=gen_ai_client_operation_duration_seconds_bucket{gen_ai_operation_name="invoke_agent"}' \
  --data-urlencode 'start=2024-01-01T00:00:00Z' \
  --data-urlencode 'end=2024-01-02T00:00:00Z'

Extract trace_id and Query Trace_Index

After extracting a trace_id from an exemplar response, query the Trace_Index for the full trace:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=otel-v1-apm-span-* | where traceId = '\''<TRACE_ID_FROM_EXEMPLAR>'\'' | fields traceId, spanId, parentSpanId, serviceName, name, startTime, endTime, durationInNanos, `status.code` | sort startTime"}'

PromQL Queries with GenAI Resource Labels

Filter metrics by GenAI resource labels before correlating to traces via exemplars:

By agent name:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=rate(gen_ai_client_operation_duration_seconds_count{gen_ai_operation_name="invoke_agent"}[5m])'

By model:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=rate(gen_ai_client_token_usage_count[5m])'

Then query exemplars for the filtered metric to get trace IDs for correlation.

Resource-Level Correlation

service.name Across All Signals

The service.name resource attribute is the primary key for correlating telemetry across all three signals at the service level.

Find all traces from a specific service:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=otel-v1-apm-span-* | where serviceName = '\''my-service'\'' | stats count() as span_count, avg(durationInNanos) as avg_duration by serviceName"}'

Find all logs from the same service:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=logs-otel-v1-* | where `resource.attributes.service.name` = '\''my-service'\'' | stats count() by severityText"}'

Find all metrics from the same service in Prometheus:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=rate(http_server_duration_seconds_count{service_name="my-service"}[5m])'

GenAI Resource Labels in Prometheus

Query metrics filtered by GenAI resource attributes that are promoted to Prometheus labels:

By agent:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=sum(rate(gen_ai_client_operation_duration_seconds_count{gen_ai_operation_name="invoke_agent"}[5m])) by (gen_ai_operation_name)'

By provider and model:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=sum(rate(gen_ai_client_token_usage_count[5m])) by (gen_ai_request_model)'

How Resource Attributes Flow Through the Stack

The OTel SDK sets resource attributes (service.name, service.version, etc.) on all telemetry
The OTel Collector's resourcedetection processor enriches telemetry with environment context (Docker, system info)
For traces and logs: resource attributes are stored in OpenSearch as part of the document
For metrics: the Prometheus promote_resource_attributes configuration (in docker-compose/prometheus/prometheus.yml) promotes resource attributes to metric labels, making them queryable in PromQL

This ensures the same service.name value appears in traces (serviceName field), logs (resource.attributes.service.name field), and metrics (service_name label) — enabling service-level correlation across all backends.

Correlation Workflows

Workflow 1: Metric Spike Investigation

Investigate a metric anomaly by correlating from metrics → traces → logs.

Step 1 — Detect the spike via PromQL:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=rate(http_server_duration_seconds_count[5m])'

Look for services with unusually high request rates or latency.

Step 2 — Query exemplars to get trace IDs from the spike window:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query_exemplars" \
  --data-urlencode 'query=http_server_duration_seconds_bucket' \
  --data-urlencode 'start=<SPIKE_START>' \
  --data-urlencode 'end=<SPIKE_END>'

Extract trace_id values from the exemplar response.

Step 3 — Query the Trace_Index for those traces:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=otel-v1-apm-span-* | where traceId = '\''<TRACE_ID_FROM_EXEMPLAR>'\'' | fields traceId, spanId, parentSpanId, serviceName, name, startTime, endTime, durationInNanos, `status.code` | sort startTime"}'

Step 4 — Query the Log_Index for correlated logs:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=logs-otel-v1-* | where traceId = '\''<TRACE_ID_FROM_EXEMPLAR>'\'' | fields traceId, spanId, severityText, body, `resource.attributes.service.name`, `@timestamp` | sort `@timestamp`"}'

Workflow 2: Error Log Investigation

Start from an error log and trace back to the root cause.

Step 1 — Find error logs:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=logs-otel-v1-* | where severityText = '\''ERROR'\'' | fields traceId, spanId, severityText, body, `resource.attributes.service.name`, `@timestamp` | sort - `@timestamp` | head 10"}'

Step 2 — Extract the traceId from the error log and reconstruct the full trace tree:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=otel-v1-apm-span-* | where traceId = '\''<TRACE_ID_FROM_ERROR_LOG>'\'' | fields traceId, spanId, parentSpanId, serviceName, name, startTime, endTime, durationInNanos, `status.code` | sort startTime"}'

Step 3 — Identify the root cause span (look for error status or exceptions):

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=otel-v1-apm-span-* | where traceId = '\''<TRACE_ID_FROM_ERROR_LOG>'\'' AND `status.code` = 2 | fields traceId, spanId, serviceName, name, `events.attributes.exception.type`, `events.attributes.exception.message` | sort startTime"}'

Step 4 — Get all logs for the error span to see the full context:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=logs-otel-v1-* | where traceId = '\''<TRACE_ID_FROM_ERROR_LOG>'\'' | fields traceId, spanId, severityText, body, `@timestamp` | sort `@timestamp`"}'

Workflow 3: Slow Agent Investigation

Investigate a slow agent invocation by correlating spans, child operations, logs, and metrics.

Step 1 — Find slow invoke_agent spans:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=otel-v1-apm-span-* | where `attributes.gen_ai.operation.name` = '\''invoke_agent'\'' AND durationInNanos > 5000000000 | fields traceId, spanId, `attributes.gen_ai.agent.name`, durationInNanos, startTime | sort - durationInNanos | head 10"}'

Step 2 — Get all child spans to identify the bottleneck:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=otel-v1-apm-span-* | where traceId = '\''<TRACE_ID>'\'' | fields traceId, spanId, parentSpanId, name, `attributes.gen_ai.operation.name`, durationInNanos, startTime | sort startTime"}'

Look for child spans with high durationInNanos — these are the bottleneck operations (e.g., slow tool calls, slow LLM responses).

Step 3 — Check tool calls within the slow trace:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=otel-v1-apm-span-* | where traceId = '\''<TRACE_ID>'\'' AND `attributes.gen_ai.operation.name` = '\''execute_tool'\'' | fields spanId, `attributes.gen_ai.tool.name`, `attributes.gen_ai.tool.call.arguments`, durationInNanos | sort - durationInNanos"}'

Step 4 — Get correlated logs for the slow spans:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=logs-otel-v1-* | where traceId = '\''<TRACE_ID>'\'' | fields spanId, severityText, body, `@timestamp` | sort `@timestamp`"}'

Step 5 — Check GenAI token usage metrics for the agent via PromQL:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=sum(rate(gen_ai_client_token_usage_count[5m])) by (gen_ai_operation_name, gen_ai_request_model)'

Check if the agent is consuming unusually high token counts, which may explain slow response times.

Dynamic Field Discovery for Correlation

When correlating across signals, use describe or _mapping to discover available fields dynamically. This is especially useful when index schemas differ from the defaults.

Discover Trace Index Fields

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "describe otel-v1-apm-span-000001"}'

Discover Log Index Fields

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "describe logs-otel-v1-000001"}'

Discover Service Map Fields

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "describe otel-v2-apm-service-map-000001"}'

Use the field names from describe output to construct correlation queries when the default field names don't match your index schema.

Advanced Correlation Patterns

Batch Log Correlation via traceId IN List

When you have a set of traceIds from span queries, use IN to fetch all correlated logs in one query:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=logs-otel-v1-* | where traceId IN ('\''<TRACE_ID_1>'\'', '\''<TRACE_ID_2>'\'', '\''<TRACE_ID_3>'\'') | fields traceId, spanId, severityText, body, `resource.attributes.service.name`, `@timestamp` | sort `@timestamp`"}'

This is more efficient than querying logs one traceId at a time.

Log Correlation with Fallback for Missing Trace Context

Some logs may have empty or null traceId. Include those alongside correlated logs:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=logs-otel-v1-* | where `resource.attributes.service.name` = '\''frontend'\'' | where (traceId IN ('\''<TRACE_ID_1>'\'', '\''<TRACE_ID_2>'\'') OR traceId = '\'''\'' OR isnull(traceId)) | sort - `@timestamp` | head 50"}'

Remote Service Dependency Correlation

Identify which remote services a given service calls using coalesce() across OTel attribute variants. Different instrumentation libraries (Node.js, Go, Python, .NET) use different attributes:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=otel-v1-apm-span-* | where serviceName = '\''checkout'\'' | where kind = '\''SPAN_KIND_CLIENT'\'' | eval _remoteService = coalesce(`attributes.net.peer.name`, `attributes.server.address`, `attributes.rpc.service`, `attributes.db.system`, `attributes.gen_ai.system`, '\''unknown'\'') | stats count() as calls, avg(durationInNanos) as avg_latency by _remoteService | sort - calls"}'

Field Reference by Protocol

When correlating across different service types, these are the key fields by protocol:

Protocol	Remote Service Field	Operation Field
gRPC	`attributes.net.peer.name` or `attributes.server.address`	`attributes.rpc.method`
HTTP	`attributes.http.host` or `attributes.server.address`	`attributes.http.route`
Database	`attributes.db.system` + `attributes.server.address`	`attributes.db.statement`
Envoy/Istio	`attributes.upstream_cluster`	span `name`
LLM/GenAI	`attributes.gen_ai.system` + `attributes.server.address`	`attributes.gen_ai.request.model`
Message Queue	`attributes.messaging.destination.name`	span `name`

References

PPL Language Reference — Official PPL syntax documentation. Fetch this if queries fail due to OpenSearch version differences or new syntax.

AWS Managed Service Variants

Amazon OpenSearch Service (SigV4)

Replace the local OpenSearch endpoint and authentication with AWS SigV4 for all PPL queries in this skill:

curl -s --aws-sigv4 "aws:amz:REGION:es" \
  --user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY" \
  -X POST https://DOMAIN-ID.REGION.es.amazonaws.com/_plugins/_ppl \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=otel-v1-apm-span-* | where traceId = '\''<TRACE_ID>'\'' | fields traceId, spanId, parentSpanId, serviceName, name, startTime, endTime, durationInNanos, `status.code` | sort startTime"}'

Endpoint format: https://DOMAIN-ID.REGION.es.amazonaws.com
Auth: --aws-sigv4 "aws:amz:REGION:es" with --user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY"
The PPL API endpoint (/_plugins/_ppl) and query syntax are identical to the local stack
No -k flag needed — AWS managed endpoints use valid TLS certificates

Amazon Managed Service for Prometheus (AMP) (SigV4)

Replace the local Prometheus endpoint and authentication with AWS SigV4 for all PromQL and exemplar queries:

Query exemplars:

curl -s --aws-sigv4 "aws:amz:REGION:aps" \
  --user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY" \
  'https://aps-workspaces.REGION.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/query_exemplars' \
  --data-urlencode 'query=http_server_duration_seconds_bucket' \
  --data-urlencode 'start=2024-01-01T00:00:00Z' \
  --data-urlencode 'end=2024-01-02T00:00:00Z'

Query metrics:

curl -s --aws-sigv4 "aws:amz:REGION:aps" \
  --user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY" \
  'https://aps-workspaces.REGION.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/query' \
  --data-urlencode 'query=rate(http_server_duration_seconds_count{service_name="my-service"}[5m])'

Endpoint format: https://aps-workspaces.REGION.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/query
Auth: --aws-sigv4 "aws:amz:REGION:aps" with --user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY"
PromQL query syntax and exemplar API are identical to local Prometheus; only the endpoint and authentication differ

Similar Skills

cache-components

139.2k

cache-components

mcp-builder

124.2k

Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).

9 files

anthropics-skills-13

canvas-design

124.2k

Generates original PNG/PDF visual art via design philosophy manifestos for posters, graphics, and static designs on user request.

20 files

anthropics-skills-13

Stats

Parent Repo Stars14

Parent Repo Forks13

Last CommitMar 22, 2026

Actions

View Source View Plugin View on GitHub View README

Cross-Signal Correlation

Overview

Connection Defaults

Variable	Default	Description
`OPENSEARCH_ENDPOINT`	`https://localhost:9200`	OpenSearch base URL
`OPENSEARCH_USER`	`admin`	OpenSearch username
`OPENSEARCH_PASSWORD`	`My_password_123!@#`	OpenSearch password

OTel Correlation Fields Reference

Trace Context Correlation

Field	Signal	Type	Description
`traceId`	Traces, Logs	keyword	Hex-encoded 128-bit trace identifier shared between spans and log records
`spanId`	Traces, Logs	keyword	Hex-encoded 64-bit span identifier shared between spans and log records
`traceFlags`	Logs	integer	W3C trace flags (e.g., 01 = sampled) carried on log records

In the Trace_Index (otel-v1-apm-span-*): traceId and spanId identify each span
In the Log_Index (logs-otel-v1-*): traceId and spanId link the log to the span that was active when the log was emitted

Metric-to-Trace Correlation (Prometheus Exemplars)

Exemplar data model:

Field	Description
`trace_id`	Hex-encoded trace identifier from the span active during metric recording
`span_id`	Hex-encoded span identifier from the span active during metric recording
`filtered_attributes`	Additional key-value pairs attached to the exemplar
`timestamp`	Time when the exemplar was recorded
`value`	The metric sample value associated with this exemplar

Resource-Level Correlation

All three signals (traces, logs, metrics) share resource attributes that identify the originating service. These attributes are set by the OTel SDK and propagated through the pipeline:

Resource Attribute	Traces Field	Logs Field	Prometheus Label	Description
`service.name`	`serviceName`	`resource.attributes.service.name`	`service_name`	Service that produced the telemetry
`service.namespace`	`resource.service.namespace`	`resource.attributes.service.namespace`	`service_namespace`	Namespace grouping related services
`service.version`	`resource.service.version`	`resource.attributes.service.version`	`service_version`	Service version string
`service.instance.id`	`resource.service.instance.id`	`resource.attributes.service.instance.id`	`service_instance_id`	Unique instance identifier
`deployment.environment.name`	`resource.deployment.environment.name`	`resource.attributes.deployment.environment.name`	`deployment_environment_name`	Deployment environment (e.g., production, staging)

GenAI Resource Attributes in Prometheus

The following GenAI resource attributes are promoted to Prometheus metric labels via the promote_resource_attributes configuration, enabling metric queries filtered by agent or model:

Resource Attribute	Prometheus Label	Description
`gen_ai.agent.id`	`gen_ai_agent_id`	Agent identifier
`gen_ai.agent.name`	`gen_ai_agent_name`	Human-readable agent name (only available if SDK sets this as a resource attribute; most SDKs set it as a span attribute instead)
`gen_ai.provider.name`	`gen_ai_provider_name`	LLM provider (e.g., bedrock, openai)
`gen_ai.request.model`	`gen_ai_request_model`	Model requested for the operation
`gen_ai.response.model`	`gen_ai_response_model`	Model that actually served the response

Trace-to-Log Correlation (PPL)

Find Logs by traceId

Given a trace ID, find all log entries emitted during that trace:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=logs-otel-v1-* | where traceId = '\''<TRACE_ID>'\'' | fields traceId, spanId, severityText, body, `resource.attributes.service.name`, `@timestamp` | sort `@timestamp`"}'

Find Logs by spanId

Given a span ID, find all log entries emitted during that specific span:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=logs-otel-v1-* | where spanId = '\''<SPAN_ID>'\'' | fields traceId, spanId, severityText, body, `resource.attributes.service.name`, `@timestamp` | sort `@timestamp`"}'

Join Spans and Logs by traceId

Use PPL join to combine trace spans with their correlated logs in a single query:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=otel-v1-apm-span-* | where traceId = '\''<TRACE_ID>'\'' | join left=s right=l ON s.traceId = l.traceId logs-otel-v1-* | fields s.spanId, s.name, s.serviceName, s.durationInNanos, l.severityText, l.body, l.`@timestamp`"}'

Full Timeline Reconstruction

Reconstruct the complete request timeline by interleaving spans and logs sorted by timestamp. Run both queries and merge results by time:

Step 1 — Get all spans for the trace:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=otel-v1-apm-span-* | where traceId = '\''<TRACE_ID>'\'' | eval signal = '\''span'\'' | fields traceId, spanId, serviceName, name, startTime, endTime, durationInNanos, `status.code`, signal | sort startTime"}'

Step 2 — Get all logs for the trace:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=logs-otel-v1-* | where traceId = '\''<TRACE_ID>'\'' | eval signal = '\''log'\'' | fields traceId, spanId, `resource.attributes.service.name`, severityText, body, `@timestamp`, signal | sort `@timestamp`"}'

Merge both result sets by timestamp to see the full chronological sequence of spans and log entries for the request.

Log-to-Trace Correlation (PPL)

Find Originating Trace from an Error Log

When you find an error log, extract its traceId and query the Trace_Index to reconstruct the full trace:

Step 1 — Find error logs and get their traceId:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=logs-otel-v1-* | where severityText = '\''ERROR'\'' | fields traceId, spanId, severityText, body, `resource.attributes.service.name`, `@timestamp` | sort - `@timestamp` | head 10"}'

Step 2 — Query the Trace_Index with the extracted traceId to get all spans:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=otel-v1-apm-span-* | where traceId = '\''<TRACE_ID_FROM_LOG>'\'' | fields traceId, spanId, parentSpanId, serviceName, name, startTime, endTime, durationInNanos, `status.code` | sort startTime"}'

Find Specific Span from a Log Entry

When a log entry has a spanId, query the Trace_Index to find the exact span that was active when the log was emitted:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=otel-v1-apm-span-* | where spanId = '\''<SPAN_ID_FROM_LOG>'\'' | fields traceId, spanId, parentSpanId, serviceName, name, startTime, endTime, durationInNanos, `status.code`, `attributes.gen_ai.operation.name`"}'

Metric-to-Trace Correlation (Prometheus Exemplars)

Query Exemplars from Prometheus

Use the Prometheus exemplars API to retrieve trace context attached to metric samples. This links a metric observation back to the specific trace that produced it:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query_exemplars" \
  --data-urlencode 'query=http_server_duration_seconds_bucket' \
  --data-urlencode 'start=2024-01-01T00:00:00Z' \
  --data-urlencode 'end=2024-01-02T00:00:00Z'

The response contains exemplar objects with trace_id and span_id in the labels field:

{
  "status": "success",
  "data": [
    {
      "seriesLabels": { "service_name": "my-agent", "__name__": "http_server_duration_seconds_bucket" },
      "exemplars": [
        {
          "labels": { "trace_id": "abc123...", "span_id": "def456..." },
          "value": "0.25",
          "timestamp": 1704067200.000
        }
      ]
    }
  ]
}

Query Exemplars for GenAI Metrics

Query exemplars for GenAI operation duration, filtered by agent name:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query_exemplars" \
  --data-urlencode 'query=gen_ai_client_operation_duration_seconds_bucket{gen_ai_operation_name="invoke_agent"}' \
  --data-urlencode 'start=2024-01-01T00:00:00Z' \
  --data-urlencode 'end=2024-01-02T00:00:00Z'

Extract trace_id and Query Trace_Index

After extracting a trace_id from an exemplar response, query the Trace_Index for the full trace:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=otel-v1-apm-span-* | where traceId = '\''<TRACE_ID_FROM_EXEMPLAR>'\'' | fields traceId, spanId, parentSpanId, serviceName, name, startTime, endTime, durationInNanos, `status.code` | sort startTime"}'

PromQL Queries with GenAI Resource Labels

Filter metrics by GenAI resource labels before correlating to traces via exemplars:

By agent name:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=rate(gen_ai_client_operation_duration_seconds_count{gen_ai_operation_name="invoke_agent"}[5m])'

By model:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=rate(gen_ai_client_token_usage_count[5m])'

Then query exemplars for the filtered metric to get trace IDs for correlation.

Resource-Level Correlation

service.name Across All Signals

The service.name resource attribute is the primary key for correlating telemetry across all three signals at the service level.

Find all traces from a specific service:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=otel-v1-apm-span-* | where serviceName = '\''my-service'\'' | stats count() as span_count, avg(durationInNanos) as avg_duration by serviceName"}'

Find all logs from the same service:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=logs-otel-v1-* | where `resource.attributes.service.name` = '\''my-service'\'' | stats count() by severityText"}'

Find all metrics from the same service in Prometheus:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=rate(http_server_duration_seconds_count{service_name="my-service"}[5m])'

GenAI Resource Labels in Prometheus

Query metrics filtered by GenAI resource attributes that are promoted to Prometheus labels:

By agent:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=sum(rate(gen_ai_client_operation_duration_seconds_count{gen_ai_operation_name="invoke_agent"}[5m])) by (gen_ai_operation_name)'

By provider and model:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=sum(rate(gen_ai_client_token_usage_count[5m])) by (gen_ai_request_model)'

How Resource Attributes Flow Through the Stack

The OTel SDK sets resource attributes (service.name, service.version, etc.) on all telemetry
The OTel Collector's resourcedetection processor enriches telemetry with environment context (Docker, system info)
For traces and logs: resource attributes are stored in OpenSearch as part of the document
For metrics: the Prometheus promote_resource_attributes configuration (in docker-compose/prometheus/prometheus.yml) promotes resource attributes to metric labels, making them queryable in PromQL

Correlation Workflows

Workflow 1: Metric Spike Investigation

Investigate a metric anomaly by correlating from metrics → traces → logs.

Step 1 — Detect the spike via PromQL:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=rate(http_server_duration_seconds_count[5m])'

Look for services with unusually high request rates or latency.

Step 2 — Query exemplars to get trace IDs from the spike window:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query_exemplars" \
  --data-urlencode 'query=http_server_duration_seconds_bucket' \
  --data-urlencode 'start=<SPIKE_START>' \
  --data-urlencode 'end=<SPIKE_END>'

Extract trace_id values from the exemplar response.

Step 3 — Query the Trace_Index for those traces:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=otel-v1-apm-span-* | where traceId = '\''<TRACE_ID_FROM_EXEMPLAR>'\'' | fields traceId, spanId, parentSpanId, serviceName, name, startTime, endTime, durationInNanos, `status.code` | sort startTime"}'

Step 4 — Query the Log_Index for correlated logs:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=logs-otel-v1-* | where traceId = '\''<TRACE_ID_FROM_EXEMPLAR>'\'' | fields traceId, spanId, severityText, body, `resource.attributes.service.name`, `@timestamp` | sort `@timestamp`"}'

Workflow 2: Error Log Investigation

Start from an error log and trace back to the root cause.

Step 1 — Find error logs:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=logs-otel-v1-* | where severityText = '\''ERROR'\'' | fields traceId, spanId, severityText, body, `resource.attributes.service.name`, `@timestamp` | sort - `@timestamp` | head 10"}'

Step 2 — Extract the traceId from the error log and reconstruct the full trace tree:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=otel-v1-apm-span-* | where traceId = '\''<TRACE_ID_FROM_ERROR_LOG>'\'' | fields traceId, spanId, parentSpanId, serviceName, name, startTime, endTime, durationInNanos, `status.code` | sort startTime"}'

Step 3 — Identify the root cause span (look for error status or exceptions):

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=otel-v1-apm-span-* | where traceId = '\''<TRACE_ID_FROM_ERROR_LOG>'\'' AND `status.code` = 2 | fields traceId, spanId, serviceName, name, `events.attributes.exception.type`, `events.attributes.exception.message` | sort startTime"}'

Step 4 — Get all logs for the error span to see the full context:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=logs-otel-v1-* | where traceId = '\''<TRACE_ID_FROM_ERROR_LOG>'\'' | fields traceId, spanId, severityText, body, `@timestamp` | sort `@timestamp`"}'

Workflow 3: Slow Agent Investigation

Investigate a slow agent invocation by correlating spans, child operations, logs, and metrics.

Step 1 — Find slow invoke_agent spans:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=otel-v1-apm-span-* | where `attributes.gen_ai.operation.name` = '\''invoke_agent'\'' AND durationInNanos > 5000000000 | fields traceId, spanId, `attributes.gen_ai.agent.name`, durationInNanos, startTime | sort - durationInNanos | head 10"}'

Step 2 — Get all child spans to identify the bottleneck:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=otel-v1-apm-span-* | where traceId = '\''<TRACE_ID>'\'' | fields traceId, spanId, parentSpanId, name, `attributes.gen_ai.operation.name`, durationInNanos, startTime | sort startTime"}'

Look for child spans with high durationInNanos — these are the bottleneck operations (e.g., slow tool calls, slow LLM responses).

Step 3 — Check tool calls within the slow trace:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=otel-v1-apm-span-* | where traceId = '\''<TRACE_ID>'\'' AND `attributes.gen_ai.operation.name` = '\''execute_tool'\'' | fields spanId, `attributes.gen_ai.tool.name`, `attributes.gen_ai.tool.call.arguments`, durationInNanos | sort - durationInNanos"}'

Step 4 — Get correlated logs for the slow spans:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=logs-otel-v1-* | where traceId = '\''<TRACE_ID>'\'' | fields spanId, severityText, body, `@timestamp` | sort `@timestamp`"}'

Step 5 — Check GenAI token usage metrics for the agent via PromQL:

curl -s "$PROMETHEUS_ENDPOINT/api/v1/query" \
  --data-urlencode 'query=sum(rate(gen_ai_client_token_usage_count[5m])) by (gen_ai_operation_name, gen_ai_request_model)'

Check if the agent is consuming unusually high token counts, which may explain slow response times.

Dynamic Field Discovery for Correlation

When correlating across signals, use describe or _mapping to discover available fields dynamically. This is especially useful when index schemas differ from the defaults.

Discover Trace Index Fields

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "describe otel-v1-apm-span-000001"}'

Discover Log Index Fields

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "describe logs-otel-v1-000001"}'

Discover Service Map Fields

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "describe otel-v2-apm-service-map-000001"}'

Use the field names from describe output to construct correlation queries when the default field names don't match your index schema.

Advanced Correlation Patterns

Batch Log Correlation via traceId IN List

When you have a set of traceIds from span queries, use IN to fetch all correlated logs in one query:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=logs-otel-v1-* | where traceId IN ('\''<TRACE_ID_1>'\'', '\''<TRACE_ID_2>'\'', '\''<TRACE_ID_3>'\'') | fields traceId, spanId, severityText, body, `resource.attributes.service.name`, `@timestamp` | sort `@timestamp`"}'

This is more efficient than querying logs one traceId at a time.

Log Correlation with Fallback for Missing Trace Context

Some logs may have empty or null traceId. Include those alongside correlated logs:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=logs-otel-v1-* | where `resource.attributes.service.name` = '\''frontend'\'' | where (traceId IN ('\''<TRACE_ID_1>'\'', '\''<TRACE_ID_2>'\'') OR traceId = '\'''\'' OR isnull(traceId)) | sort - `@timestamp` | head 50"}'

Remote Service Dependency Correlation

Identify which remote services a given service calls using coalesce() across OTel attribute variants. Different instrumentation libraries (Node.js, Go, Python, .NET) use different attributes:

curl -sk -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" \
  -X POST "$OPENSEARCH_ENDPOINT/_plugins/_ppl" \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=otel-v1-apm-span-* | where serviceName = '\''checkout'\'' | where kind = '\''SPAN_KIND_CLIENT'\'' | eval _remoteService = coalesce(`attributes.net.peer.name`, `attributes.server.address`, `attributes.rpc.service`, `attributes.db.system`, `attributes.gen_ai.system`, '\''unknown'\'') | stats count() as calls, avg(durationInNanos) as avg_latency by _remoteService | sort - calls"}'

Field Reference by Protocol

When correlating across different service types, these are the key fields by protocol:

Protocol	Remote Service Field	Operation Field
gRPC	`attributes.net.peer.name` or `attributes.server.address`	`attributes.rpc.method`
HTTP	`attributes.http.host` or `attributes.server.address`	`attributes.http.route`
Database	`attributes.db.system` + `attributes.server.address`	`attributes.db.statement`
Envoy/Istio	`attributes.upstream_cluster`	span `name`
LLM/GenAI	`attributes.gen_ai.system` + `attributes.server.address`	`attributes.gen_ai.request.model`
Message Queue	`attributes.messaging.destination.name`	span `name`

References

PPL Language Reference — Official PPL syntax documentation. Fetch this if queries fail due to OpenSearch version differences or new syntax.

AWS Managed Service Variants

Amazon OpenSearch Service (SigV4)

Replace the local OpenSearch endpoint and authentication with AWS SigV4 for all PPL queries in this skill:

curl -s --aws-sigv4 "aws:amz:REGION:es" \
  --user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY" \
  -X POST https://DOMAIN-ID.REGION.es.amazonaws.com/_plugins/_ppl \
  -H 'Content-Type: application/json' \
  -d '{"query": "source=otel-v1-apm-span-* | where traceId = '\''<TRACE_ID>'\'' | fields traceId, spanId, parentSpanId, serviceName, name, startTime, endTime, durationInNanos, `status.code` | sort startTime"}'

Endpoint format: https://DOMAIN-ID.REGION.es.amazonaws.com
Auth: --aws-sigv4 "aws:amz:REGION:es" with --user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY"
The PPL API endpoint (/_plugins/_ppl) and query syntax are identical to the local stack
No -k flag needed — AWS managed endpoints use valid TLS certificates

Amazon Managed Service for Prometheus (AMP) (SigV4)

Replace the local Prometheus endpoint and authentication with AWS SigV4 for all PromQL and exemplar queries:

Query exemplars:

curl -s --aws-sigv4 "aws:amz:REGION:aps" \
  --user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY" \
  'https://aps-workspaces.REGION.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/query_exemplars' \
  --data-urlencode 'query=http_server_duration_seconds_bucket' \
  --data-urlencode 'start=2024-01-01T00:00:00Z' \
  --data-urlencode 'end=2024-01-02T00:00:00Z'

Query metrics:

curl -s --aws-sigv4 "aws:amz:REGION:aps" \
  --user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY" \
  'https://aps-workspaces.REGION.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/query' \
  --data-urlencode 'query=rate(http_server_duration_seconds_count{service_name="my-service"}[5m])'

Endpoint format: https://aps-workspaces.REGION.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/query
Auth: --aws-sigv4 "aws:amz:REGION:aps" with --user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY"
PromQL query syntax and exemplar API are identical to local Prometheus; only the endpoint and authentication differ