From vfm-agent-company
Production observability from Google SRE and Netflix. Use when implementing structured logging, setting up metrics collection (Prometheus, Datadog, CloudWatch), configuring distributed tracing (Jaeger, OpenTelemetry), creating dashboards (Grafana), defining alert rules, or building observability pipelines. Triggers on logging, monitoring, tracing, metrics, alerts, dashboards, Prometheus, Grafana, OpenTelemetry, or production observability.
npx claudepluginhub duylinhdang1998/claude-template-agent --plugin vfm-agent-companyThis skill uses the workspace's default tool permissions.
**Purpose**: Implement comprehensive observability for production systems with logs, metrics, and distributed tracing
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Checks Next.js compilation errors using a running Turbopack dev server after code edits. Fixes actionable issues before reporting complete. Replaces `next build`.
Purpose: Implement comprehensive observability for production systems with logs, metrics, and distributed tracing
Agent: Google SRE / Netflix Backend Architect Use When: Setting up monitoring, debugging production issues, or ensuring system reliability
import pino from 'pino'
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => ({ level: label })
}
})
// Structured logs (JSON)
logger.info({ userId: 123, action: 'login' }, 'User logged in')
logger.error({ error: err, userId: 123 }, 'Failed to process payment')
// Request logging middleware
app.use((req, res, next) => {
req.log = logger.child({
requestId: crypto.randomUUID(),
method: req.method,
url: req.url,
ip: req.ip
})
req.log.info('Request started')
res.on('finish', () => {
req.log.info({
statusCode: res.statusCode,
duration: Date.now() - req.startTime
}, 'Request completed')
})
next()
})
Best Practices:
import { register, Counter, Histogram, Gauge } from 'prom-client'
// HTTP request counter
const httpRequestsTotal = new Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status']
})
// HTTP request duration
const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'route', 'status'],
buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
})
// Active connections
const activeConnections = new Gauge({
name: 'active_connections',
help: 'Number of active connections'
})
// Metrics middleware
app.use((req, res, next) => {
const start = Date.now()
activeConnections.inc()
res.on('finish', () => {
const duration = (Date.now() - start) / 1000
httpRequestsTotal.inc({
method: req.method,
route: req.route?.path || req.path,
status: res.statusCode
})
httpRequestDuration.observe({
method: req.method,
route: req.route?.path || req.path,
status: res.statusCode
}, duration)
activeConnections.dec()
})
next()
})
// Metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType)
res.end(await register.metrics())
})
Key Metrics to Track:
import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node'
import { registerInstrumentations } from '@opentelemetry/instrumentation'
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http'
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express'
import { JaegerExporter } from '@opentelemetry/exporter-jaeger'
// Set up tracer
const provider = new NodeTracerProvider()
provider.addSpanProcessor(
new SimpleSpanProcessor(
new JaegerExporter({
endpoint: 'http://localhost:14268/api/traces'
})
)
)
provider.register()
// Auto-instrument HTTP and Express
registerInstrumentations({
instrumentations: [
new HttpInstrumentation(),
new ExpressInstrumentation()
]
})
// Manual instrumentation
import { trace } from '@opentelemetry/api'
const tracer = trace.getTracer('my-service')
app.post('/api/orders', async (req, res) => {
const span = tracer.startSpan('create-order')
try {
span.setAttribute('userId', req.user.id)
span.setAttribute('orderTotal', req.body.total)
// Create order
const order = await db.order.create({ data: req.body })
// Child span for payment
const paymentSpan = tracer.startSpan('process-payment', {
parent: span
})
await processPayment(order.id)
paymentSpan.end()
span.setStatus({ code: SpanStatusCode.OK })
res.json(order)
} catch (error) {
span.recordException(error)
span.setStatus({ code: SpanStatusCode.ERROR })
res.status(500).json({ error: 'Failed to create order' })
} finally {
span.end()
}
})
Popular Tools:
import * as Sentry from '@sentry/node'
Sentry.init({
dsn: process.env.SENTRY_DSN,
environment: process.env.NODE_ENV,
tracesSampleRate: 0.1 // Sample 10% of transactions
})
// Sentry middleware
app.use(Sentry.Handlers.requestHandler())
app.use(Sentry.Handlers.tracingHandler())
// Error handler (must be last)
app.use(Sentry.Handlers.errorHandler())
// Manual error tracking
try {
await dangerousOperation()
} catch (error) {
Sentry.captureException(error, {
user: { id: userId, email: userEmail },
tags: { operation: 'payment' },
extra: { orderId: order.id }
})
}
// Liveness probe (is app running?)
app.get('/health/live', (req, res) => {
res.json({ status: 'ok' })
})
// Readiness probe (is app ready to serve traffic?)
app.get('/health/ready', async (req, res) => {
try {
// Check database
await db.$queryRaw`SELECT 1`
// Check Redis
await redis.ping()
// Check external APIs
await fetch('https://api.example.com/health', { timeout: 2000 })
res.json({ status: 'ok', checks: { db: 'ok', redis: 'ok', api: 'ok' } })
} catch (error) {
res.status(503).json({ status: 'error', error: error.message })
}
})
# Prometheus alert rules
groups:
- name: api_alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
annotations:
summary: "High error rate detected"
- alert: SlowResponse
expr: histogram_quantile(0.95, http_request_duration_seconds) > 1
for: 10m
annotations:
summary: "95th percentile response time > 1s"
Remember: You can't fix what you can't see. Implement observability from day one.
Created: 2026-02-04 Maintained By: Google SRE