Skill

managing-observability-recipe

Validates and optimizes observability instrumentation (traces, logs, metrics). Use when validating OpenTelemetry setup, checking trace coverage, ensuring log correlation, or auditing instrumentation completeness. Supports SLO accuracy and incident debugging.

npx claudepluginhub andercore-labs/claudes-kitchen --plugin operational-excellence

Tool Access

This skill uses the workspace's default tool permissions.

Preview

**SCOPE:** Trace, log, and metric instrumentation validation for production services.

Supporting Assets

log-validation.mdtrace-validation.md

SKILL.md

Similar Skills

github-deep-research

63.9k

Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.

2 files

bytedance-deer-flow-1

surprise-me

63.9k

Dynamically discovers and combines enabled skills into cohesive, unexpected delightful experiences like interactive HTML or themed artifacts. Activates on 'surprise me', inspiration, or boredom cues.

bytedance-deer-flow-1

image-generation

63.9k

Generates images from structured JSON prompts via Python script execution. Supports reference images and aspect ratios for characters, scenes, products, visuals.

2 files

bytedance-deer-flow-1

Stats

Parent Repo Stars0

Parent Repo Forks0

Last CommitApr 20, 2026

Actions

View Source View Plugin View on GitHub View README

Observability Validation

SCOPE: Trace, log, and metric instrumentation validation for production services.

PHILOSOPHY: "You can't debug what you can't see. Instrumentation gaps = blind spots during incidents."

Quick Reference

Observability Pillars:
├── Traces (request flow, latency, errors)
├── Logs (structured events, correlation)
└── Metrics (aggregated counters, gauges)

Validation Focus:
├── Endpoint coverage (all routes instrumented?)
├── External calls (DB, HTTP, queues traced?)
├── Trace correlation (logs ↔ traces linked?)
└── Metric accuracy (status codes, durations)

When to Use

New service deployment | SLO setup | Incident debugging gaps | OpenTelemetry migration | Instrumentation audit | False SLO signals

Why Observability Validation Matters

Issue	Impact	Detection
Missing endpoint instrumentation	Blind spots in availability metrics	Some routes not in http.route tag
No outbound call tracing	Cannot detect downstream failures	Missing http.client.requests
Missing status codes	Incorrect error rate calculations	Status code tags missing
No trace correlation	Cannot debug from metrics → logs	No trace_id in logs

Trace Validation (OpenTelemetry)

Endpoint Coverage

Check all HTTP endpoints instrumented:

# List all instrumented routes
dog metric query "sum:http.server.requests{service:my-service,env:prod} by {http.route}"

# Expected: All production API routes
# Example output:
# /api/v1/users
# /api/v1/orders
# /api/v1/checkout
# /health
# /metrics

Compare against code:

# Express.js routes
grep -r "router\.(get|post|put|delete|patch)" src/ | grep -v test

# FastAPI routes
grep -r "@app\.(get|post|put|delete|patch)" src/ | grep -v test

# Fastify routes
grep -r "fastify\.(get|post|put|delete|patch)" src/ | grep -v test

Validation checklist:

All API endpoints report http.server.requests
All endpoints have http.route tag (not just URL path)
Status codes correctly reported (2xx, 4xx, 5xx)
Health/metrics endpoints included
Admin endpoints included (if customer-facing)

External Call Instrumentation

Check outbound HTTP calls:

# List external dependencies
dog metric query "sum:http.client.requests{service:my-service,env:prod} by {http.url}"

# Expected: All downstream services
# Example: https://api.stripe.com, https://payment-gateway

Check database operations:

# PostgreSQL
dog metric query "sum:db.client.operations{service:my-service,db.system:postgresql}"

# MongoDB
dog metric query "sum:db.client.operations{service:my-service,db.system:mongodb}"

# Redis
dog metric query "sum:db.client.operations{service:my-service,db.system:redis}"

Check queue operations:

# Kafka
dog metric query "sum:messaging.operation{service:my-service,messaging.system:kafka}"

# RabbitMQ
dog metric query "sum:messaging.operation{service:my-service,messaging.system:rabbitmq}"

Validation checklist:

Database calls instrumented (connection spans)
HTTP client calls instrumented
Queue operations instrumented
Cache operations instrumented
gRPC calls instrumented (if applicable)

Trace Correlation

Verify trace IDs in logs:

# Sample recent logs
kubectl logs deployment/my-service -n prod --tail=10

# Expected: JSON structured logs with trace_id
{
  "timestamp": "2025-01-15T14:30:00Z",
  "level": "info",
  "message": "User checkout completed",
  "trace_id": "1234567890abcdef",
  "span_id": "abcdef1234567890"
}

Verify Datadog correlation:

1. Datadog APM → Traces → Find error trace
2. Click "View Logs" button
3. Logs auto-filter by trace_id

Validation checklist:

Logs include trace_id field
Logs include span_id field
Datadog traces link to logs
Logs link to traces
Distributed traces complete across services

Log Validation (Structured Logging)

Structure Requirements

Check logs are JSON formatted:

# Sample logs
kubectl logs deployment/my-service -n prod --tail=5

# Expected: Valid JSON per line
{"timestamp":"2025-01-15T14:30:00Z","level":"info","message":"Request processed"}

Required fields:

Field	Purpose	Example
timestamp	ISO8601 time	2025-01-15T14:30:00Z
level	Log severity	info, warn, error
message	Human-readable	User checkout completed
trace_id	Trace correlation	1234567890abcdef
span_id	Span correlation	abcdef1234567890
service	Service name	my-service
env	Environment	prod

Validation checklist:

All logs are valid JSON (no plain text)
Required fields present in every log
Timestamps are ISO8601 format
No sensitive data in logs (PII, credentials)

PII Detection

Scan for common PII patterns:

# Credit card patterns
kubectl logs deployment/my-service -n prod | grep -E "\b[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}\b"

# Email addresses
kubectl logs deployment/my-service -n prod | grep -E "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"

# Phone numbers
kubectl logs deployment/my-service -n prod | grep -E "\b[0-9]{3}[-.]?[0-9]{3}[-.]?[0-9]{4}\b"

# JWT tokens
kubectl logs deployment/my-service -n prod | grep -E "eyJ[A-Za-z0-9_-]*\."

Validation checklist:

Common Instrumentation Gaps

Gap 1: Missing http.route Tag

Symptom:

dog metric query "sum:http.server.requests{service:my-service} by {http.route}"
# Returns: http.route:/* (all routes as wildcard)

Impact: Cannot measure per-endpoint SLOs. High-latency endpoints hidden.

Fix:

// Express - correct middleware order
app.use(routerMiddleware);  // Must come first
app.use(otelMiddleware);

Gap 2: No Database Spans

Symptom:

dog metric query "sum:db.client.operations{service:my-service}"
# Returns: No data found

Impact: Cannot detect DB slowness as root cause.

Fix:

import { PgInstrumentation } from '@opentelemetry/instrumentation-pg';
new PgInstrumentation().enable();

Gap 3: Missing Status Codes

Symptom:

dog metric query "sum:http.server.requests{service:my-service,http.response.status_code:5*}"
# Returns: No data
# But logs show 500 errors

Impact: Availability SLO shows 100% while users see errors.

Fix: Verify HTTP instrumentation captures status codes automatically.

Gap 4: No Trace Correlation in Logs

Symptom:

kubectl logs deployment/my-service | grep "trace_id"
# Returns: (nothing)

Impact: Cannot jump from trace → logs during debugging.

Fix:

import { trace } from '@opentelemetry/api';

const span = trace.getActiveSpan();
logger.info('Event', {
  trace_id: span?.spanContext().traceId,
  span_id: span?.spanContext().spanId
});

Framework-Specific Instrumentation

Express.js (Node.js)

import { NodeSDK } from '@opentelemetry/sdk-node';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express';
import { PgInstrumentation } from '@opentelemetry/instrumentation-pg';

const sdk = new NodeSDK({
  instrumentations: [
    new HttpInstrumentation(),       // HTTP inbound/outbound
    new ExpressInstrumentation(),    // Express routes
    new PgInstrumentation()          // PostgreSQL
  ]
});

sdk.start();

FastAPI (Python)

from opentelemetry.instrumentation.fastapi import FastAPIInstrumentation
from opentelemetry.instrumentation.requests import RequestsInstrumentation
from opentelemetry.instrumentation.psycopg2 import Psycopg2Instrumentation

FastAPIInstrumentation().instrument_app(app)
RequestsInstrumentation().instrument()
Psycopg2Instrumentation().instrument()

Fastify (Node.js)

import { NodeSDK } from '@opentelemetry/sdk-node';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { FastifyInstrumentation } from '@opentelemetry/instrumentation-fastify';

const sdk = new NodeSDK({
  instrumentations: [
    new HttpInstrumentation(),
    new FastifyInstrumentation()
  ]
});

sdk.start();

Validation Workflow

1. Baseline Check (New Service)

# Step 1: Verify metrics exist
dog metric query "sum:http.server.requests{service:NEW_SERVICE,env:prod}"

# Step 2: Check route coverage
dog metric query "sum:http.server.requests{service:NEW_SERVICE} by {http.route}"

# Step 3: Verify status codes
dog metric query "sum:http.server.requests{service:NEW_SERVICE} by {http.response.status_code}"

# Step 4: Check external calls
dog metric query "sum:http.client.requests{service:NEW_SERVICE} by {http.url}"

# Step 5: Verify log structure
kubectl logs deployment/NEW_SERVICE -n prod --tail=10 | jq .

2. Pre-Deployment Validation

# Deploy to staging
# Generate traffic (load test)

# Verify new routes appear
dog metric query "sum:http.server.requests{service:my-service,env:stg} by {http.route}"

# Check for gaps
# Expected: All new routes with status codes

3. Continuous Monitoring

# Daily: Compare routes in code vs Datadog
diff <(grep -r "router\.get" src/ | wc -l) \
     <(dog metric query "sum:http.server.requests{service:my-service} by {http.route}" | wc -l)

# Weekly: Check external dependencies
dog metric query "sum:http.client.requests{service:my-service} by {http.url}"

# Monthly: Verify trace correlation
kubectl logs deployment/my-service -n prod --tail=100 | grep -c "trace_id"

Reference

OpenTelemetry: https://opentelemetry.io/docs/instrumentation/ Datadog APM: https://docs.datadoghq.com/tracing/ Semantic Conventions: https://opentelemetry.io/docs/specs/semconv/http/ Trace Context: https://www.w3.org/TR/trace-context/ Detailed Patterns: See trace-validation.md, log-validation.md