Distributed Tracing

Patterns and practices for implementing distributed tracing across microservices and understanding request flows in distributed systems.

When to Use This Skill

Implementing distributed tracing in microservices
Debugging cross-service request issues
Understanding trace propagation
Choosing tracing infrastructure
Correlating logs, metrics, and traces

Why Distributed Tracing?

Problem: Request flows through multiple services
How do you debug when something fails?

Without tracing:
User → API → ??? → ??? → Error somewhere

With tracing:
User → API (50ms) → OrderService (20ms) → PaymentService (ERROR: timeout)
         └── Full visibility into request flow

Core Concepts

Traces, Spans, and Context

Trace: End-to-end request journey
├── Span: Single operation within a service
│   ├── SpanID: Unique identifier
│   ├── ParentSpanID: Link to parent span
│   ├── TraceID: Shared across all spans
│   ├── Operation Name: What is being done
│   ├── Start/End Time: Duration
│   ├── Status: Success/Error
│   ├── Attributes: Key-value metadata
│   └── Events: Point-in-time annotations
│
└── Context: Propagated across service boundaries
    ├── TraceID
    ├── SpanID
    ├── Trace Flags
    └── Trace State

Trace Visualization

TraceID: abc123

Service A (API Gateway)
├──────────────────────────────────────────────────────┤ 200ms
    │
    └─► Service B (Order Service)
        ├───────────────────────────────────┤ 150ms
            │
            ├─► Service C (Inventory)
            │   ├───────────────┤ 50ms
            │
            └─► Service D (Payment)
                ├───────────────────────┤ 80ms
                    │
                    └─► External API
                        ├─────────┤ 60ms

OpenTelemetry

Overview

OpenTelemetry = Unified observability framework

Components:
┌─────────────────────────────────────────────────────┐
│  Application                                        │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐ │
│  │    SDK      │  │   Tracer    │  │   Meter     │ │
│  │             │  │   Provider  │  │   Provider  │ │
│  └─────────────┘  └─────────────┘  └─────────────┘ │
└─────────────────────────────────────────────────────┘
           │               │               │
           └───────────────┼───────────────┘
                           ▼
              ┌─────────────────────────┐
              │    OTLP Exporter        │
              └─────────────────────────┘
                           │
                           ▼
              ┌─────────────────────────┐
              │    Collector            │
              │  (Optional)             │
              └─────────────────────────┘
                           │
           ┌───────────────┼───────────────┐
           ▼               ▼               ▼
      ┌─────────┐    ┌─────────┐    ┌─────────┐
      │ Jaeger  │    │  Zipkin │    │  Tempo  │
      └─────────┘    └─────────┘    └─────────┘

Trace Context Propagation

HTTP Headers (W3C Trace Context):
traceparent: 00-{trace-id}-{span-id}-{flags}
tracestate: vendor1=value1,vendor2=value2

Example:
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
              │   │                               │                └─ sampled
              │   │                               └─ parent span id
              │   └─ trace id (128-bit)
              └─ version

Propagation across services:
┌─────────────┐                      ┌─────────────┐
│  Service A  │  ─── HTTP ──────────►│  Service B  │
│             │  traceparent: 00-... │             │
│ Create Span │                      │ Extract     │
│ Inject      │                      │ Create Span │
└─────────────┘                      └─────────────┘

Span Attributes

Semantic conventions (standard attributes):

HTTP:
- http.method: GET, POST, etc.
- http.url: Full URL
- http.status_code: 200, 404, 500
- http.route: /users/{id}

Database:
- db.system: postgresql, mysql
- db.statement: SELECT * FROM...
- db.operation: query, insert

RPC:
- rpc.system: grpc
- rpc.service: OrderService
- rpc.method: CreateOrder

Custom:
- user.id: 12345
- order.total: 99.99
- feature.flag: experiment_v2

Tracing Backends

Jaeger

Features:
- Open source (CNCF)
- Built-in UI
- Multiple storage backends
- OpenTelemetry native

Architecture:
┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│   Agent     │─►│  Collector  │─►│   Storage   │
│ (optional)  │  │             │  │ (Cassandra/ │
└─────────────┘  └─────────────┘  │ Elasticsearch)
                       │          └─────────────┘
                       ▼
                ┌─────────────┐
                │    Query    │
                │   Service   │
                └─────────────┘
                       │
                       ▼
                ┌─────────────┐
                │     UI      │
                └─────────────┘

Zipkin

Features:
- Mature, battle-tested
- Simple architecture
- Low resource overhead
- Good ecosystem support

Best for:
- Simpler setups
- Lower resource environments
- Teams familiar with Zipkin

Grafana Tempo

Features:
- Object storage backend (cheap)
- Deep Grafana integration
- Log-based trace discovery
- Exemplars support

Best for:
- Grafana-heavy environments
- Cost-sensitive deployments
- Large-scale traces

Cloud Native Options

Provider	Service	Integration
AWS	X-Ray	Native AWS services
GCP	Cloud Trace	Native GCP services
Azure	Application Insights	Native Azure services
Datadog	APM	Full-stack observability

Sampling Strategies

Why Sample?

High-traffic systems generate millions of spans.
Storing all spans is expensive and often unnecessary.

Sampling: Collect a subset of traces

Goal: Keep enough data to debug issues
      while managing costs

Sampling Types

1. Head-based sampling (at trace start):
   - Decision made when trace begins
   - Consistent across services
   - Simple but may miss rare events

2. Tail-based sampling (after trace complete):
   - Decision made after seeing full trace
   - Can keep interesting traces (errors, slow)
   - Requires buffering spans
   - More complex infrastructure

3. Priority sampling:
   - Assign priority based on attributes
   - Keep all errors, sample normal traffic

Sampling Strategies

Rate-based:
- Sample 10% of all traces
- Simple, predictable cost

Priority-based:
- 100% of errors
- 100% of slow requests (>1s)
- 5% of normal requests

Adaptive:
- Adjust rate based on traffic
- Target specific traces/second
- Handle traffic spikes

Correlation Patterns

Logs-Traces-Metrics

Three Pillars of Observability:

Logs ◄──────────► Traces ◄──────────► Metrics
  │                  │                   │
  │ trace_id         │ exemplars         │
  │ span_id          │                   │
  └──────────────────┴───────────────────┘

Correlation:
1. Add trace_id/span_id to log entries
2. Add exemplars (trace links) to metrics
3. Click from metric → trace → logs

Log Correlation

Structured log with trace context:

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "ERROR",
  "message": "Payment failed",
  "trace_id": "abc123def456",
  "span_id": "789xyz",
  "service": "payment-service",
  "user_id": "12345",
  "error": "Card declined"
}

Query in log aggregator:
trace_id:"abc123def456"
→ See all logs for this request

Exemplars (Metrics to Traces)

Metric with exemplar:
http_request_duration{service="api"} = 2.5s
  └── exemplar: trace_id=abc123

When latency spikes:
1. See metric spike in dashboard
2. Click on data point
3. Jump directly to slow trace
4. See exactly what caused latency

Instrumentation Patterns

Automatic Instrumentation

Zero-code instrumentation:
- HTTP clients/servers
- Database clients
- Message queues
- gRPC

Pros: Easy, comprehensive
Cons: Less control, more noise

Manual Instrumentation

Add spans for business logic:

with tracer.start_span("process_order") as span:
    span.set_attribute("order.id", order_id)
    span.set_attribute("order.items", len(items))

    result = process(order)

    if result.error:
        span.set_status(Status(StatusCode.ERROR))
        span.record_exception(result.error)

Pros: Precise, business-relevant
Cons: More code, maintenance

Hybrid Approach (Recommended)

1. Auto-instrument infrastructure:
   - HTTP, database, queue calls

2. Manual instrument business logic:
   - Key operations
   - Business metrics
   - Error context

Best Practices

Span Design

Good span names:
- HTTP GET /api/orders/{id}
- ProcessPayment
- db.query users

Bad span names:
- Handler (too generic)
- /api/orders/12345 (cardinality explosion)
- doStuff (meaningless)

Attribute Guidelines

Do:
- Use semantic conventions
- Add business context (user_id, order_id)
- Keep cardinality low
- Include error details

Don't:
- Add PII (personally identifiable info)
- Use high-cardinality values as attributes
- Add large payloads
- Include sensitive data

Performance Considerations

1. Use async span export
2. Sample appropriately
3. Limit attribute count
4. Use span processor batching
5. Consider span limits

Troubleshooting with Traces

Common Patterns

Finding slow requests:
1. Query traces by duration > threshold
2. Identify slow spans
3. Check span attributes for context

Finding errors:
1. Query traces by status = ERROR
2. See error span and context
3. Check exception details

Finding dependencies:
1. View service map from traces
2. Identify critical paths
3. Find hidden dependencies

Related Skills

observability-patterns - Three pillars overview
slo-sli-error-budget - Using traces for SLIs
incident-response - Using traces in incidents

distributed-tracing

Distributed Tracing

When to Use This Skill

Why Distributed Tracing?

Core Concepts

Traces, Spans, and Context

Trace Visualization

OpenTelemetry

Overview

Trace Context Propagation

Span Attributes

Tracing Backends

Jaeger

Zipkin

Grafana Tempo

Cloud Native Options

Sampling Strategies

Why Sample?

Sampling Types

Sampling Strategies

Correlation Patterns

Logs-Traces-Metrics

Log Correlation

Exemplars (Metrics to Traces)

Instrumentation Patterns

Automatic Instrumentation

Manual Instrumentation

Hybrid Approach (Recommended)

Best Practices

Span Design

Attribute Guidelines

Performance Considerations

Troubleshooting with Traces

Common Patterns

Related Skills

Similar Skills