From agent-almanac
Instruments apps with OpenTelemetry for distributed tracing and Jaeger/Tempo integration. Debugs latency in microservices, analyzes request flows, correlates traces with logs/metrics.
npx claudepluginhub pjt222/agent-almanacThis skill is limited to using the following tools:
Implement OpenTelemetry distributed tracing to track requests across microservices and identify performance bottlenecks.
Implements distributed tracing with Jaeger and Tempo to track requests across microservices, identify bottlenecks, and analyze flows. Use for debugging latency, dependencies, and observability.
Sets up distributed tracing for microservices with OpenTelemetry, Jaeger, or Zipkin, handling instrumentation, context propagation, spans, and trace collection for request visibility.
Implements distributed tracing with Jaeger and Tempo for microservices request visibility, tracking latency, dependencies, errors, and bottlenecks.
Share bugs, ideas, or general feedback.
Implement OpenTelemetry distributed tracing to track requests across microservices and identify performance bottlenecks.
See Extended Examples for complete configuration files and templates.
Deploy Jaeger or Grafana Tempo to receive and store traces.
Option A: Jaeger all-in-one (development/testing):
# docker-compose.yml
version: '3.8'
services:
jaeger:
image: jaegertracing/all-in-one:1.51
ports:
- "5775:5775/udp" # Zipkin compact thrift
- "6831:6831/udp" # Jaeger compact thrift
- "6832:6832/udp" # Jaeger binary thrift
- "5778:5778" # Serve configs
- "16686:16686" # Jaeger UI
- "14268:14268" # Jaeger HTTP thrift
- "14250:14250" # Jaeger GRPC
- "9411:9411" # Zipkin compatible endpoint
environment:
- COLLECTOR_ZIPKIN_HOST_PORT=:9411
- COLLECTOR_OTLP_ENABLED=true
restart: unless-stopped
Option B: Grafana Tempo (production, scalable):
# docker-compose.yml
version: '3.8'
services:
tempo:
image: grafana/tempo:2.3.0
command: ["-config.file=/etc/tempo.yaml"]
volumes:
- ./tempo.yaml:/etc/tempo.yaml
- tempo-data:/tmp/tempo
ports:
- "3200:3200" # Tempo HTTP
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "9411:9411" # Zipkin
restart: unless-stopped
volumes:
tempo-data:
Tempo configuration (tempo.yaml):
server:
http_listen_port: 3200
distributor:
receivers:
jaeger:
# ... (see EXAMPLES.md for complete configuration)
For production with S3 storage:
storage:
trace:
backend: s3
s3:
bucket: tempo-traces
endpoint: s3.amazonaws.com
region: us-east-1
wal:
path: /tmp/tempo/wal
pool:
max_workers: 100
queue_depth: 10000
Expected: Tracing backend accessible, ready to receive traces via OTLP, Jaeger UI or Grafana shows "no traces" initially.
On failure:
netstat -tulpn | grep -E '(4317|16686|3200)'docker logs jaeger or docker logs tempocurl http://localhost:4318/v1/traces -vtempo -config.file=/etc/tempo.yaml -verify-configUse OpenTelemetry auto-instrumentation for common frameworks to minimize code changes.
Python with Flask:
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install
# app.py
from flask import Flask
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# ... (see EXAMPLES.md for complete configuration)
Go with Gin framework:
go get go.opentelemetry.io/otel
go get go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc
go get go.opentelemetry.io/otel/sdk/trace
go get go.opentelemetry.io/contrib/instrumentation/github.com/gin-gonic/gin/otelgin
package main
import (
"context"
"github.com/gin-gonic/gin"
"go.opentelemetry.io/otel"
# ... (see EXAMPLES.md for complete configuration)
Node.js with Express:
npm install @opentelemetry/api \
@opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-trace-otlp-grpc
// tracing.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
# ... (see EXAMPLES.md for complete configuration)
Expected: Traces from instrumented services appear in Jaeger UI or Grafana, HTTP requests automatically create spans.
On failure:
OTEL_EXPORTER_OTLP_ENDPOINT=http://tempo:4317OTEL_LOG_LEVEL=debug (Python), OTEL_LOG_LEVEL=DEBUG (Node.js)Create custom spans for business logic, database queries, and external calls.
Python manual spans:
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
def process_order(order_id):
# Create a span for the entire operation
# ... (see EXAMPLES.md for complete configuration)
Go manual spans:
import (
"context"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/codes"
"go.opentelemetry.io/otel/trace"
# ... (see EXAMPLES.md for complete configuration)
Span attributes best practices:
http.method, http.status_code, db.system, db.statementuser.id, order.id, product.categoryinstance.id, region, availability_zonespan.RecordError(err) and span.SetStatus(codes.Error, message)span.AddEvent("cache_miss")Expected: Custom spans appear in trace view, parent-child relationships correct, attributes visible in span details, errors highlighted.
On failure:
defer span.End() in Go, with blocks in Python)Ensure trace context flows across service boundaries and async operations.
HTTP headers propagation (W3C Trace Context):
# Client side (Python with requests)
import requests
from opentelemetry import trace
from opentelemetry.propagate import inject
tracer = trace.get_tracer(__name__)
# ... (see EXAMPLES.md for complete configuration)
// Server side (Go with Gin)
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/propagation"
)
# ... (see EXAMPLES.md for complete configuration)
Message queue propagation (Kafka):
# Producer
from opentelemetry.propagate import inject
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers=['kafka:9092'])
# ... (see EXAMPLES.md for complete configuration)
# Consumer
from opentelemetry.propagate import extract
def process_message(msg):
# Extract trace context from Kafka headers
headers = {k: v.decode('utf-8') for k, v in msg.headers}
ctx = extract(headers)
# Continue the trace
with tracer.start_as_current_span("process_order_event", context=ctx):
order_id = json.loads(msg.value)['order_id']
handle_order(order_id)
Async operations (Python asyncio):
import asyncio
from opentelemetry import trace, context
async def async_operation():
# Capture current context
token = context.attach(context.get_current())
try:
with tracer.start_as_current_span("async_database_query"):
await asyncio.sleep(0.1) # Simulated async work
return "result"
finally:
context.detach(token)
Expected: Traces span multiple services, trace IDs consistent across service boundaries, parent-child relationships preserved.
On failure:
otel.propagation.set_global_textmap(TraceContextTextMapPropagator())traceparent header valueImplement sampling to reduce trace volume and cost while maintaining visibility.
Sampling strategies:
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.sampling import (
ParentBased,
TraceIdRatioBased,
StaticSampler,
Decision
# ... (see EXAMPLES.md for complete configuration)
Tail-based sampling with Tempo:
Configure in tempo.yaml:
overrides:
defaults:
metrics_generator:
processors: [service-graphs, span-metrics]
storage:
path: /tmp/tempo/generator/wal
remote_write:
- url: http://prometheus:9090/api/v1/write
send_exemplars: true
# Tail sampling (requires tempo-query)
ingestion_rate_limit_bytes: 5000000
ingestion_burst_size_bytes: 10000000
Use Grafana Tempo's TraceQL for dynamic sampling:
# Sample traces with errors
{ status = error }
# Sample slow traces (>1s)
{ duration > 1s }
# Sample specific services
{ resource.service.name = "checkout-service" }
Expected: Trace volume reduced to target percentage, error traces always sampled, sampling decision visible in trace metadata.
On failure:
ingestion_burst_size_bytes)otel_traces_dropped_total metricLink traces to metrics and logs for unified observability.
Add trace IDs to logs (Python):
import logging
from opentelemetry import trace
# Custom log formatter with trace context
class TraceFormatter(logging.Formatter):
def format(self, record):
# ... (see EXAMPLES.md for complete configuration)
Generate metrics from traces (Tempo):
# tempo.yaml
metrics_generator:
registry:
external_labels:
cluster: production
storage:
# ... (see EXAMPLES.md for complete configuration)
This generates Prometheus metrics:
traces_service_graph_request_total - request count between servicestraces_span_metrics_duration_seconds - span duration histogramtraces_spanmetrics_calls_total - span call countsQuery traces from metrics (Grafana):
Add exemplar support to Prometheus datasource in Grafana:
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus:9090
jsonData:
exemplarTraceIdDestinations:
- name: trace_id
datasourceName: Tempo
In Grafana dashboard, enable exemplars:
{
"fieldConfig": {
"defaults": {
"custom": {
"showExemplars": true
}
}
}
}
Expected: Clicking metric exemplars opens trace, logs show trace IDs, traces link to logs, unified debugging across signals.
On failure:
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) and on() exemplarcontext to downstream calls breaks traces. Always pass context explicitly.defer span.End() (Go) or with blocks (Python) causes spans to remain open and memory leaks.span.RecordError() loses valuable debugging information. Always record errors in spans.ParentBased sampler to honor upstream sampling.correlate-observability-signals - Unified debugging with metrics, logs, and traces linked by trace IDssetup-prometheus-monitoring - Generate metrics from traces using Tempo metrics generatorconfigure-log-aggregation - Add trace IDs to logs for correlation with distributed tracesbuild-grafana-dashboards - Visualize trace-derived metrics and exemplar links in dashboards