Validates and optimizes observability instrumentation (traces, logs, metrics). Use when validating OpenTelemetry setup, checking trace coverage, ensuring log correlation, or auditing instrumentation completeness. Supports SLO accuracy and incident debugging.
npx claudepluginhub andercore-labs/claudes-kitchen --plugin operational-excellenceThis skill uses the workspace's default tool permissions.
**SCOPE:** Trace, log, and metric instrumentation validation for production services.
Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.
Dynamically discovers and combines enabled skills into cohesive, unexpected delightful experiences like interactive HTML or themed artifacts. Activates on 'surprise me', inspiration, or boredom cues.
Generates images from structured JSON prompts via Python script execution. Supports reference images and aspect ratios for characters, scenes, products, visuals.
SCOPE: Trace, log, and metric instrumentation validation for production services.
PHILOSOPHY: "You can't debug what you can't see. Instrumentation gaps = blind spots during incidents."
Observability Pillars:
├── Traces (request flow, latency, errors)
├── Logs (structured events, correlation)
└── Metrics (aggregated counters, gauges)
Validation Focus:
├── Endpoint coverage (all routes instrumented?)
├── External calls (DB, HTTP, queues traced?)
├── Trace correlation (logs ↔ traces linked?)
└── Metric accuracy (status codes, durations)
New service deployment | SLO setup | Incident debugging gaps | OpenTelemetry migration | Instrumentation audit | False SLO signals
| Issue | Impact | Detection |
|---|---|---|
| Missing endpoint instrumentation | Blind spots in availability metrics | Some routes not in http.route tag |
| No outbound call tracing | Cannot detect downstream failures | Missing http.client.requests |
| Missing status codes | Incorrect error rate calculations | Status code tags missing |
| No trace correlation | Cannot debug from metrics → logs | No trace_id in logs |
Check all HTTP endpoints instrumented:
# List all instrumented routes
dog metric query "sum:http.server.requests{service:my-service,env:prod} by {http.route}"
# Expected: All production API routes
# Example output:
# /api/v1/users
# /api/v1/orders
# /api/v1/checkout
# /health
# /metrics
Compare against code:
# Express.js routes
grep -r "router\.(get|post|put|delete|patch)" src/ | grep -v test
# FastAPI routes
grep -r "@app\.(get|post|put|delete|patch)" src/ | grep -v test
# Fastify routes
grep -r "fastify\.(get|post|put|delete|patch)" src/ | grep -v test
Validation checklist:
http.server.requestshttp.route tag (not just URL path)Check outbound HTTP calls:
# List external dependencies
dog metric query "sum:http.client.requests{service:my-service,env:prod} by {http.url}"
# Expected: All downstream services
# Example: https://api.stripe.com, https://payment-gateway
Check database operations:
# PostgreSQL
dog metric query "sum:db.client.operations{service:my-service,db.system:postgresql}"
# MongoDB
dog metric query "sum:db.client.operations{service:my-service,db.system:mongodb}"
# Redis
dog metric query "sum:db.client.operations{service:my-service,db.system:redis}"
Check queue operations:
# Kafka
dog metric query "sum:messaging.operation{service:my-service,messaging.system:kafka}"
# RabbitMQ
dog metric query "sum:messaging.operation{service:my-service,messaging.system:rabbitmq}"
Validation checklist:
Verify trace IDs in logs:
# Sample recent logs
kubectl logs deployment/my-service -n prod --tail=10
# Expected: JSON structured logs with trace_id
{
"timestamp": "2025-01-15T14:30:00Z",
"level": "info",
"message": "User checkout completed",
"trace_id": "1234567890abcdef",
"span_id": "abcdef1234567890"
}
Verify Datadog correlation:
1. Datadog APM → Traces → Find error trace
2. Click "View Logs" button
3. Logs auto-filter by trace_id
Validation checklist:
trace_id fieldspan_id fieldCheck logs are JSON formatted:
# Sample logs
kubectl logs deployment/my-service -n prod --tail=5
# Expected: Valid JSON per line
{"timestamp":"2025-01-15T14:30:00Z","level":"info","message":"Request processed"}
Required fields:
| Field | Purpose | Example |
|---|---|---|
| timestamp | ISO8601 time | 2025-01-15T14:30:00Z |
| level | Log severity | info, warn, error |
| message | Human-readable | User checkout completed |
| trace_id | Trace correlation | 1234567890abcdef |
| span_id | Span correlation | abcdef1234567890 |
| service | Service name | my-service |
| env | Environment | prod |
Validation checklist:
Scan for common PII patterns:
# Credit card patterns
kubectl logs deployment/my-service -n prod | grep -E "\b[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}\b"
# Email addresses
kubectl logs deployment/my-service -n prod | grep -E "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"
# Phone numbers
kubectl logs deployment/my-service -n prod | grep -E "\b[0-9]{3}[-.]?[0-9]{3}[-.]?[0-9]{4}\b"
# JWT tokens
kubectl logs deployment/my-service -n prod | grep -E "eyJ[A-Za-z0-9_-]*\."
Validation checklist:
Symptom:
dog metric query "sum:http.server.requests{service:my-service} by {http.route}"
# Returns: http.route:/* (all routes as wildcard)
Impact: Cannot measure per-endpoint SLOs. High-latency endpoints hidden.
Fix:
// Express - correct middleware order
app.use(routerMiddleware); // Must come first
app.use(otelMiddleware);
Symptom:
dog metric query "sum:db.client.operations{service:my-service}"
# Returns: No data found
Impact: Cannot detect DB slowness as root cause.
Fix:
import { PgInstrumentation } from '@opentelemetry/instrumentation-pg';
new PgInstrumentation().enable();
Symptom:
dog metric query "sum:http.server.requests{service:my-service,http.response.status_code:5*}"
# Returns: No data
# But logs show 500 errors
Impact: Availability SLO shows 100% while users see errors.
Fix: Verify HTTP instrumentation captures status codes automatically.
Symptom:
kubectl logs deployment/my-service | grep "trace_id"
# Returns: (nothing)
Impact: Cannot jump from trace → logs during debugging.
Fix:
import { trace } from '@opentelemetry/api';
const span = trace.getActiveSpan();
logger.info('Event', {
trace_id: span?.spanContext().traceId,
span_id: span?.spanContext().spanId
});
import { NodeSDK } from '@opentelemetry/sdk-node';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express';
import { PgInstrumentation } from '@opentelemetry/instrumentation-pg';
const sdk = new NodeSDK({
instrumentations: [
new HttpInstrumentation(), // HTTP inbound/outbound
new ExpressInstrumentation(), // Express routes
new PgInstrumentation() // PostgreSQL
]
});
sdk.start();
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentation
from opentelemetry.instrumentation.requests import RequestsInstrumentation
from opentelemetry.instrumentation.psycopg2 import Psycopg2Instrumentation
FastAPIInstrumentation().instrument_app(app)
RequestsInstrumentation().instrument()
Psycopg2Instrumentation().instrument()
import { NodeSDK } from '@opentelemetry/sdk-node';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { FastifyInstrumentation } from '@opentelemetry/instrumentation-fastify';
const sdk = new NodeSDK({
instrumentations: [
new HttpInstrumentation(),
new FastifyInstrumentation()
]
});
sdk.start();
# Step 1: Verify metrics exist
dog metric query "sum:http.server.requests{service:NEW_SERVICE,env:prod}"
# Step 2: Check route coverage
dog metric query "sum:http.server.requests{service:NEW_SERVICE} by {http.route}"
# Step 3: Verify status codes
dog metric query "sum:http.server.requests{service:NEW_SERVICE} by {http.response.status_code}"
# Step 4: Check external calls
dog metric query "sum:http.client.requests{service:NEW_SERVICE} by {http.url}"
# Step 5: Verify log structure
kubectl logs deployment/NEW_SERVICE -n prod --tail=10 | jq .
# Deploy to staging
# Generate traffic (load test)
# Verify new routes appear
dog metric query "sum:http.server.requests{service:my-service,env:stg} by {http.route}"
# Check for gaps
# Expected: All new routes with status codes
# Daily: Compare routes in code vs Datadog
diff <(grep -r "router\.get" src/ | wc -l) \
<(dog metric query "sum:http.server.requests{service:my-service} by {http.route}" | wc -l)
# Weekly: Check external dependencies
dog metric query "sum:http.client.requests{service:my-service} by {http.url}"
# Monthly: Verify trace correlation
kubectl logs deployment/my-service -n prod --tail=100 | grep -c "trace_id"
OpenTelemetry: https://opentelemetry.io/docs/instrumentation/ Datadog APM: https://docs.datadoghq.com/tracing/ Semantic Conventions: https://opentelemetry.io/docs/specs/semconv/http/ Trace Context: https://www.w3.org/TR/trace-context/ Detailed Patterns: See trace-validation.md, log-validation.md