Skill

observability

Production observability from Google SRE and Netflix. Use when implementing structured logging, setting up metrics collection (Prometheus, Datadog, CloudWatch), configuring distributed tracing (Jaeger, OpenTelemetry), creating dashboards (Grafana), defining alert rules, or building observability pipelines. Triggers on logging, monitoring, tracing, metrics, alerts, dashboards, Prometheus, Grafana, OpenTelemetry, or production observability.

npx claudepluginhub duylinhdang1998/claude-template-agent --plugin vfm-agent-company

Tool Access

This skill uses the workspace's default tool permissions.

Preview

**Purpose**: Implement comprehensive observability for production systems with logs, metrics, and distributed tracing

SKILL.md

Similar Skills

skill-lookup

159.9k

Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.

prompts.chat

prompt-lookup

159.9k

Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.

prompts.chat

next-compile

139.2k

Checks Next.js compilation errors using a running Turbopack dev server after code edits. Fixes actionable issues before reporting complete. Replaces `next build`.

1 file

vercel-next-js-2

Stats

Parent Repo Stars0

Parent Repo Forks0

Last CommitApr 16, 2026

Actions

View Source View Plugin View on GitHub View README

Observability - Logging, Monitoring & Tracing

Purpose: Implement comprehensive observability for production systems with logs, metrics, and distributed tracing

Agent: Google SRE / Netflix Backend Architect Use When: Setting up monitoring, debugging production issues, or ensuring system reliability

Three Pillars of Observability

1. Logging (What happened?)

2. Metrics (How much/how fast?)

3. Tracing (Where did it happen?)

1. Structured Logging

import pino from 'pino'

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label })
  }
})

// Structured logs (JSON)
logger.info({ userId: 123, action: 'login' }, 'User logged in')
logger.error({ error: err, userId: 123 }, 'Failed to process payment')

// Request logging middleware
app.use((req, res, next) => {
  req.log = logger.child({
    requestId: crypto.randomUUID(),
    method: req.method,
    url: req.url,
    ip: req.ip
  })

  req.log.info('Request started')

  res.on('finish', () => {
    req.log.info({
      statusCode: res.statusCode,
      duration: Date.now() - req.startTime
    }, 'Request completed')
  })

  next()
})

Best Practices:

Use JSON format for easy parsing
Include context (requestId, userId, etc.)
Log levels: ERROR, WARN, INFO, DEBUG
Don't log sensitive data (passwords, tokens)
Use correlation IDs across services

2. Metrics (Prometheus + Grafana)

import { register, Counter, Histogram, Gauge } from 'prom-client'

// HTTP request counter
const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status']
})

// HTTP request duration
const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route', 'status'],
  buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
})

// Active connections
const activeConnections = new Gauge({
  name: 'active_connections',
  help: 'Number of active connections'
})

// Metrics middleware
app.use((req, res, next) => {
  const start = Date.now()

  activeConnections.inc()

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000

    httpRequestsTotal.inc({
      method: req.method,
      route: req.route?.path || req.path,
      status: res.statusCode
    })

    httpRequestDuration.observe({
      method: req.method,
      route: req.route?.path || req.path,
      status: res.statusCode
    }, duration)

    activeConnections.dec()
  })

  next()
})

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType)
  res.end(await register.metrics())
})

Key Metrics to Track:

Request rate (requests/second)
Error rate (errors/second)
Response time (p50, p95, p99)
Database query time
Cache hit rate
CPU/Memory usage
Active connections

3. Distributed Tracing (OpenTelemetry)

import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node'
import { registerInstrumentations } from '@opentelemetry/instrumentation'
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http'
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express'
import { JaegerExporter } from '@opentelemetry/exporter-jaeger'

// Set up tracer
const provider = new NodeTracerProvider()

provider.addSpanProcessor(
  new SimpleSpanProcessor(
    new JaegerExporter({
      endpoint: 'http://localhost:14268/api/traces'
    })
  )
)

provider.register()

// Auto-instrument HTTP and Express
registerInstrumentations({
  instrumentations: [
    new HttpInstrumentation(),
    new ExpressInstrumentation()
  ]
})

// Manual instrumentation
import { trace } from '@opentelemetry/api'

const tracer = trace.getTracer('my-service')

app.post('/api/orders', async (req, res) => {
  const span = tracer.startSpan('create-order')

  try {
    span.setAttribute('userId', req.user.id)
    span.setAttribute('orderTotal', req.body.total)

    // Create order
    const order = await db.order.create({ data: req.body })

    // Child span for payment
    const paymentSpan = tracer.startSpan('process-payment', {
      parent: span
    })

    await processPayment(order.id)
    paymentSpan.end()

    span.setStatus({ code: SpanStatusCode.OK })
    res.json(order)
  } catch (error) {
    span.recordException(error)
    span.setStatus({ code: SpanStatusCode.ERROR })
    res.status(500).json({ error: 'Failed to create order' })
  } finally {
    span.end()
  }
})

4. Application Performance Monitoring (APM)

Popular Tools:

Datadog APM
New Relic
Elastic APM
Sentry (error tracking)

import * as Sentry from '@sentry/node'

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  environment: process.env.NODE_ENV,
  tracesSampleRate: 0.1 // Sample 10% of transactions
})

// Sentry middleware
app.use(Sentry.Handlers.requestHandler())
app.use(Sentry.Handlers.tracingHandler())

// Error handler (must be last)
app.use(Sentry.Handlers.errorHandler())

// Manual error tracking
try {
  await dangerousOperation()
} catch (error) {
  Sentry.captureException(error, {
    user: { id: userId, email: userEmail },
    tags: { operation: 'payment' },
    extra: { orderId: order.id }
  })
}

5. Health Checks

// Liveness probe (is app running?)
app.get('/health/live', (req, res) => {
  res.json({ status: 'ok' })
})

// Readiness probe (is app ready to serve traffic?)
app.get('/health/ready', async (req, res) => {
  try {
    // Check database
    await db.$queryRaw`SELECT 1`

    // Check Redis
    await redis.ping()

    // Check external APIs
    await fetch('https://api.example.com/health', { timeout: 2000 })

    res.json({ status: 'ok', checks: { db: 'ok', redis: 'ok', api: 'ok' } })
  } catch (error) {
    res.status(503).json({ status: 'error', error: error.message })
  }
})

6. Alerting

# Prometheus alert rules
groups:
  - name: api_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        annotations:
          summary: "High error rate detected"

      - alert: SlowResponse
        expr: histogram_quantile(0.95, http_request_duration_seconds) > 1
        for: 10m
        annotations:
          summary: "95th percentile response time > 1s"

Best Practices

Use correlation IDs across logs, metrics, and traces
Sample traces in production (not 100%)
Set up alerts for SLOs (error rate, latency)
Dashboard for each service (Grafana)
Centralized logging (Elasticsearch, CloudWatch)
Monitor business metrics (orders/hour, revenue)

Remember: You can't fix what you can't see. Implement observability from day one.

Created: 2026-02-04 Maintained By: Google SRE