Instrumentation Planning

Strategic planning for application instrumentation before implementation.

When to Use This Skill

Planning instrumentation for new services
Reviewing instrumentation strategy
Establishing naming conventions
Managing telemetry cardinality
Setting instrumentation budgets

Instrumentation Strategy Framework

What to Instrument

Instrumentation Layers:

┌─────────────────────────────────────────────────────────────────┐
│  Layer 1: Automatic/Library Instrumentation                     │
│  - HTTP clients/servers (auto-captured)                         │
│  - Database clients (auto-captured)                             │
│  - Message queue clients (auto-captured)                        │
│  - Framework-provided metrics                                   │
│  Effort: Low | Coverage: Broad | Customization: Limited        │
├─────────────────────────────────────────────────────────────────┤
│  Layer 2: Business Transaction Instrumentation                  │
│  - Key user journeys                                            │
│  - Business operations (checkout, signup, etc.)                │
│  - Revenue-generating flows                                     │
│  - SLA-bound operations                                         │
│  Effort: Medium | Coverage: Targeted | Value: High             │
├─────────────────────────────────────────────────────────────────┤
│  Layer 3: Debug/Diagnostic Instrumentation                      │
│  - Algorithmic hot paths                                        │
│  - Cache behavior                                               │
│  - Circuit breaker states                                       │
│  - Retry/fallback paths                                         │
│  Effort: Medium | Coverage: Deep | Use: Troubleshooting        │
├─────────────────────────────────────────────────────────────────┤
│  Layer 4: Business Metrics                                      │
│  - Domain-specific counters                                     │
│  - Conversion rates                                             │
│  - Feature usage                                                │
│  - Customer behavior                                            │
│  Effort: High | Coverage: Custom | Value: Business Insights    │
└─────────────────────────────────────────────────────────────────┘

Instrumentation Decision Matrix

instrumentation_decisions:
  always_instrument:
    - "Inbound HTTP/gRPC requests"
    - "Outbound HTTP/gRPC calls"
    - "Database queries"
    - "Message publish/consume"
    - "Authentication/authorization"
    - "External API calls"
    - "Cache operations"

  consider_instrumenting:
    - "Complex business logic"
    - "Feature flags evaluation"
    - "Background jobs"
    - "Scheduled tasks"
    - "File I/O operations"
    - "CPU-intensive operations"

  avoid_instrumenting:
    - "Every method call (too noisy)"
    - "Tight loops (performance impact)"
    - "Data transformation (low value)"
    - "Validation helpers"
    - "Utility functions"

  decision_criteria:
    business_value:
      weight: 0.3
      question: "Does this help understand business outcomes?"

    debugging_value:
      weight: 0.25
      question: "Does this help diagnose production issues?"

    slo_relevance:
      weight: 0.25
      question: "Does this contribute to SLI measurement?"

    cost_impact:
      weight: 0.2
      question: "Is the cardinality/volume acceptable?"

Naming Conventions

Metric Naming

metric_naming:
  format: "[namespace]_[subsystem]_[name]_[unit]"

  rules:
    case: "snake_case"
    unit_suffix: "Always include (_seconds, _bytes, _total)"
    base_units: "Use base units (seconds not milliseconds)"
    counter_suffix: "_total for counters"

  examples:
    good:
      - "http_server_requests_total"
      - "http_server_request_duration_seconds"
      - "http_server_response_size_bytes"
      - "db_connections_current"
      - "order_processing_duration_seconds"
      - "payment_transactions_total"

    bad:
      - "requests (no unit, no namespace)"
      - "HttpRequestDuration (wrong case)"
      - "order_latency_ms (use base units)"
      - "totalOrders (camelCase, no unit)"

  label_naming:
    case: "snake_case"
    avoid:
      - "Embedded values in name (path=/users)"
      - "High cardinality labels"
    good_labels:
      - "method, status_code, path"
      - "service, version, environment"
    bad_labels:
      - "user_id (high cardinality)"
      - "request_id (high cardinality)"
      - "timestamp (not a dimension)"

Span Naming

span_naming:
  format: "[operation] [resource]"

  rules:
    - "Use verb + noun pattern"
    - "Keep names low cardinality"
    - "Include operation type, not specific values"
    - "Be consistent across services"

  examples:
    http:
      pattern: "HTTP {METHOD} {route_template}"
      good: "HTTP GET /users/{id}"
      bad: "HTTP GET /users/12345"

    database:
      pattern: "{operation} {table}"
      good: "SELECT orders"
      bad: "SELECT * FROM orders WHERE id=123"

    messaging:
      pattern: "{operation} {queue/topic}"
      good: "PUBLISH order-events"
      bad: "publish message to order-events queue"

    rpc:
      pattern: "{service}/{method}"
      good: "OrderService/CreateOrder"
      bad: "grpc call to order service"

  attributes:
    required:
      - "service.name"
      - "service.version"
      - "deployment.environment"

    recommended:
      http:
        - "http.method"
        - "http.route"
        - "http.status_code"
        - "http.target"

      database:
        - "db.system"
        - "db.name"
        - "db.operation"
        - "db.statement (sanitized)"

      messaging:
        - "messaging.system"
        - "messaging.destination"
        - "messaging.operation"

Log Field Naming

log_naming:
  format: "snake_case for all fields"

  standard_fields:
    timestamp: "ISO 8601 format"
    level: "INFO, WARN, ERROR, etc."
    message: "Human-readable description"
    service: "Service name"
    trace_id: "Correlation ID"
    span_id: "Current span"

  domain_fields:
    pattern: "{domain}_{field}"
    examples:
      - "order_id"
      - "customer_id"
      - "payment_amount"
      - "product_sku"

  avoid:
    - "Nested objects (flatten for indexing)"
    - "Arrays of unknown length"
    - "Large text blobs"
    - "Sensitive data (PII, secrets)"

Cardinality Management

Understanding Cardinality

Cardinality = Number of unique time series

Example:
http_requests_total{method="GET", path="/api/users", status="200"}

Cardinality = methods × paths × statuses
            = 5 × 100 × 10
            = 5,000 time series

With user_id (1M users):
            = 5 × 100 × 10 × 1,000,000
            = 5,000,000,000 time series ← EXPLOSION!

Cardinality Budget

cardinality_budget:
  planning:
    total_budget: 100000  # Target max time series per service
    allocation:
      automatic_instrumentation: 30%  # 30,000
      business_transactions: 40%      # 40,000
      custom_metrics: 20%             # 20,000
      buffer: 10%                     # 10,000

  per_metric_limits:
    low_cardinality:
      max_series: 100
      example: "status codes, methods"

    medium_cardinality:
      max_series: 1000
      example: "endpoints, operations"

    high_cardinality:
      max_series: 10000
      example: "aggregated by hour"
      requires: "Justification and approval"

  monitoring:
    - "Alert on cardinality growth > 10% per day"
    - "Weekly cardinality reviews"
    - "Automatic label value limiting"

Cardinality Reduction Techniques

cardinality_reduction:
  bucketing:
    before: "path=/users/12345"
    after: "path=/users/{id}"
    technique: "Path template extraction"

  sampling:
    description: "Sample high-volume, low-value traces"
    strategies:
      head_sampling: "Decide at trace start"
      tail_sampling: "Decide after seeing full trace"
      adaptive: "Adjust rate based on volume"

  aggregation:
    description: "Pre-aggregate before export"
    example: "Count by status, not per request"

  value_limiting:
    description: "Cap unique values per label"
    example: "Max 100 unique paths, then 'other'"

  dropping:
    description: "Drop low-value dimensions"
    candidates:
      - "Instance ID (use service name)"
      - "Request ID (not for metrics)"
      - "Full URLs (use route templates)"

Instrumentation Budget

Performance Impact

performance_budget:
  cpu_overhead:
    target: "< 1% CPU increase"
    measurement: "Profile with/without instrumentation"

  memory_overhead:
    target: "< 50MB additional heap"
    components:
      - "Metric registries"
      - "Span buffers"
      - "Log buffers"

  latency_overhead:
    target: "< 1ms per request"
    hot_paths: "< 100μs"

  data_volume:
    metrics:
      target: "< 1GB/day per service"
      calculation: "series × scrape_interval × 8 bytes"

    traces:
      target: "< 10GB/day per service (with sampling)"
      sampling_rate: "1-10% for high-volume services"

    logs:
      target: "< 5GB/day per service"
      strategies: "Sampling, level gating"

Cost Planning

cost_planning:
  estimation_formula:
    metrics:
      monthly_cost: "time_series × $0.003 (typical cloud pricing)"
      example: "10,000 series × $0.003 = $30/month"

    traces:
      monthly_cost: "spans_per_month × $0.000005"
      example: "100M spans × $0.000005 = $500/month"

    logs:
      monthly_cost: "GB_per_month × $0.50"
      example: "500GB × $0.50 = $250/month"

  optimization_strategies:
    - "Increase scrape interval (15s → 60s)"
    - "Reduce trace sampling rate"
    - "Log level gating in production"
    - "Shorter retention for debug data"
    - "Downsample old metrics"

Instrumentation Plan Template

instrumentation_plan:
  service: "{Service Name}"
  version: "1.0"
  date: "{Date}"
  owner: "{Team}"

  objectives:
    - "Track SLIs for order processing"
    - "Enable distributed tracing for debugging"
    - "Monitor payment success rates"

  automatic_instrumentation:
    framework: "OpenTelemetry .NET"
    enabled:
      - "ASP.NET Core (HTTP server)"
      - "HttpClient (HTTP client)"
      - "Entity Framework Core (database)"
      - "Azure.Messaging.ServiceBus"
    configuration:
      sampling_rate: 0.1  # 10% of traces
      batch_export_interval: 5000  # ms

  custom_spans:
    - name: "ProcessOrder"
      purpose: "Track order processing duration"
      attributes:
        - "order.id"
        - "order.item_count"
        - "order.total_amount"
      events:
        - "inventory.reserved"
        - "payment.processed"

    - name: "ValidatePayment"
      purpose: "Track payment validation steps"
      attributes:
        - "payment.method"
        - "payment.provider"
      sensitive: false

  custom_metrics:
    counters:
      - name: "orders_total"
        labels: ["status", "payment_method"]
        purpose: "Count orders by outcome"
        cardinality_estimate: 20

      - name: "payment_failures_total"
        labels: ["reason", "provider"]
        purpose: "Track payment failure reasons"
        cardinality_estimate: 50

    histograms:
      - name: "order_processing_duration_seconds"
        labels: ["order_type"]
        purpose: "Track order processing latency"
        buckets: [0.1, 0.5, 1, 2, 5, 10]
        cardinality_estimate: 10

    gauges:
      - name: "pending_orders_current"
        labels: []
        purpose: "Current pending order count"
        cardinality_estimate: 1

  cardinality_summary:
    estimated_total: 81
    budget: 1000
    status: "Within budget"

  log_strategy:
    production_level: "INFO"
    structured_fields:
      standard:
        - "trace_id"
        - "span_id"
        - "service"
        - "environment"
      domain:
        - "order_id"
        - "customer_id (hashed)"
    sampling:
      debug_logs: "1% in production"

  cost_estimate:
    monthly:
      metrics: "$30"
      traces: "$200"
      logs: "$150"
      total: "$380"

  review_schedule:
    frequency: "Quarterly"
    metrics_to_review:
      - "Cardinality growth"
      - "Data volume"
      - "Cost vs budget"

Related Skills

observability-patterns - Three pillars overview
distributed-tracing - Trace implementation details
slo-sli-error-budget - What to measure for SLOs

Last Updated: 2025-12-26

instrumentation-planning

Instrumentation Planning

When to Use This Skill

Instrumentation Strategy Framework

What to Instrument

Instrumentation Decision Matrix

Naming Conventions

Metric Naming

Span Naming

Log Field Naming

Cardinality Management

Understanding Cardinality

Cardinality Budget

Cardinality Reduction Techniques

Instrumentation Budget

Performance Impact

Cost Planning

Instrumentation Plan Template

Related Skills

Similar Skills