From architect
Patterns for metrics, tracing, logging, alerting, dashboards, and SLO definition
npx claudepluginhub navraj007in/architecture-cowork-plugin --plugin architectThis skill uses the workspace's default tool permissions.
Comprehensive observability strategy: metrics collection, distributed tracing, structured logging, alerting rules, dashboard design, and SLO/SLA definitions.
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).
Generates original PNG/PDF visual art via design philosophy manifestos for posters, graphics, and static designs on user request.
Comprehensive observability strategy: metrics collection, distributed tracing, structured logging, alerting rules, dashboard design, and SLO/SLA definitions.
Observability answers "what is happening in production?" across three pillars:
This skill provides framework-agnostic patterns and thresholds for building each pillar.
For services handling requests, collect three metrics per endpoint:
| Metric | Prometheus | Purpose | Query |
|---|---|---|---|
| Rate | http_requests_total | requests per second | rate(http_requests_total[1m]) |
| Errors | http_requests_total{status=~"5.."} | failed % | rate(http_requests_total{status=~"5.."}[1m]) / rate(http_requests_total[1m]) |
| Duration | http_request_duration_seconds | latency histogram | histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[1m])) |
Implementation pattern (Node.js + Prometheus client):
import promClient from 'prom-client';
const httpRequestDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request latency in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1, 2, 5]
});
const httpRequestTotal = new promClient.Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status_code']
});
// Middleware (Express)
app.use((req, res, next) => {
const startTime = Date.now();
res.on('finish', () => {
const duration = (Date.now() - startTime) / 1000;
httpRequestDuration
.labels(req.method, req.route?.path || req.url, res.statusCode)
.observe(duration);
httpRequestTotal
.labels(req.method, req.route?.path || req.url, res.statusCode)
.inc();
});
next();
});
For background workers or batch services, collect three metrics per resource:
| Metric | Prometheus | Purpose | Alert threshold |
|---|---|---|---|
| Utilization | process_cpu_seconds_total, process_resident_memory_bytes | % of capacity used | CPU > 70%, Memory > 80% |
| Saturation | job_queue_length, db_connection_pool_active | items waiting | queue > 100 items |
| Errors | task_failures_total, db_connection_errors_total | failed operations | > 1% of operations |
Example (background worker):
const jobQueueLength = new promClient.Gauge({
name: 'job_queue_length',
help: 'Number of jobs waiting in queue'
});
const taskFailures = new promClient.Counter({
name: 'task_failures_total',
help: 'Total failed tasks',
labelNames: ['task_type', 'error_code']
});
// Worker loop
async function processJob(job: Job) {
try {
jobQueueLength.set(await queue.length());
await job.execute();
} catch (error) {
taskFailures.labels(job.type, error.code).inc();
throw error;
}
}
Monitor these four signals universally:
| Signal | Metric | MVP | Growth | Enterprise |
|---|---|---|---|---|
| Latency | p50, p95, p99 | p99 < 500ms | p99 < 200ms | p99 < 100ms |
| Traffic | requests/sec | > 0 | track trends | auto-scale rule |
| Errors | error rate % | < 1% | < 0.5% | < 0.1% |
| Saturation | queue depth, CPU % | manual review | alert @ 70% CPU | auto-scale @ 60% |
Prometheus dashboard queries:
# Latency (p99)
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Request rate
sum(rate(http_requests_total[1m]))
# Error rate
sum(rate(http_requests_total{status=~"5.."}[1m])) / sum(rate(http_requests_total[1m]))
# CPU utilization
rate(process_cpu_seconds_total[1m]) * 100
Standard attribute names ensure interoperability across tools (Datadog, Jaeger, New Relic, etc.).
HTTP attributes:
http.request.method "GET" | "POST" | ...
http.url "https://example.com/users?id=123"
http.target "/users?id=123" (path + query)
http.host "example.com"
http.status_code 200, 404, 500, ...
http.request.body.size bytes
http.response.body.size bytes
Database attributes:
db.system "postgresql" | "mysql" | "mongodb" | ...
db.name "users_db"
db.statement "SELECT * FROM users WHERE id = ?"
db.connection_string "postgresql://..." (scrubbed of credentials)
db.operation "SELECT" | "INSERT" | "UPDATE" | "DELETE"
db.rows_affected N
Service/span attributes:
service.name "api-server"
service.version "0.1.0"
trace.id unique identifier
span.id unique per span
span.parent_id parent span ID (or root span has no parent)
span.kind "SERVER" | "CLIENT" | "INTERNAL" | "PRODUCER" | "CONSUMER"
span.status "OK" | "ERROR" | "UNSET"
Pass trace IDs across service boundaries for end-to-end visibility.
HTTP headers (W3C Trace Context standard):
traceparent: version-traceId-spanId-traceFlags
Example: traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
Implementation (Node.js + OpenTelemetry):
import { getTracer } from '@opentelemetry/api';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
const tracer = getTracer('api-server');
// Outbound HTTP client call
const span = tracer.startSpan('external-api-call', {
attributes: {
'http.method': 'GET',
'http.url': 'https://dependency-api.example.com/data',
'span.kind': 'CLIENT'
}
});
// Automatic propagation via fetch or axios with OpenTelemetry instrumentation
const response = await fetch('https://dependency-api.example.com/data');
span.end();
Control what fraction of requests are traced (100% tracing is expensive; smart sampling targets slow/errored requests).
Sampling strategies by stage:
| Stage | Sampling rule | Example |
|---|---|---|
| MVP | 100% (trace all) | ratio=1.0 — smaller user base, capture everything |
| Growth | Error + tail latency | Trace all errors + slowest 5% (ratio=0.05 for success paths) |
| Enterprise | Intelligent sampling | Datadog intelligent sampling, New Relic tail-based sampling |
Implementation (Node.js):
import { NodeTracerProvider, BatchSpanProcessor } from '@opentelemetry/node';
import { ProbabilitySampler } from '@opentelemetry/core';
const provider = new NodeTracerProvider({
sampler: new ProbabilitySampler(0.1) // 10% sampling for growth stage
});
Tail-based sampling rules (Enterprise):
# Example: sample all errors + slowest 1%
tail_sampling:
policies:
- name: error-policy
type: status_code
status_code:
status_codes: [ERROR]
- name: latency-policy
type: latency
latency:
threshold_ms: 1000
upper_threshold_ms: 5000
Use standardized levels consistently:
| Level | When to use | Example |
|---|---|---|
| DEBUG | Development only; very detailed state | db_query_params: {...}, cache_hit: true |
| INFO | Notable events in request flow | "user created", "payment processed", "email sent" |
| WARN | Recoverable issue needing attention | "retry after 3 failures", "fallback to default value" |
| ERROR | Request/operation failed, needs investigation | "database connection timeout", "payment gateway returned 500" |
| FATAL | Service cannot continue, immediate human action needed | "out of disk space", "database unreachable" |
Guidelines:
Use consistent field names across all services for dashboard aggregation:
Request context (always include in every log):
{
"trace_id": "0af7651916cd43dd",
"span_id": "b7ad6b716920",
"request_id": "req-xyz",
"user_id": "user-123",
"tenant_id": "org-456",
"session_id": "sess-789"
}
Business context:
{
"entity_type": "order",
"entity_id": "order-001",
"action": "create",
"status": "pending",
"amount": 123.45,
"currency": "USD"
}
Execution context:
{
"service": "api-server",
"version": "0.1.0",
"environment": "production",
"function": "processPayment",
"duration_ms": 234
}
Error context (only when level is ERROR or FATAL):
{
"error_code": "PAYMENT_TIMEOUT",
"error_message": "Payment gateway did not respond within 30s",
"error_stack": "...stack trace...",
"retry_count": 2,
"retriable": true
}
Node.js (Winston):
import winston from 'winston';
const logger = winston.createLogger({
level: process.env.LOG_LEVEL || 'info',
format: winston.format.json(),
defaultMeta: {
service: 'api-server',
version: process.env.APP_VERSION,
environment: process.env.NODE_ENV
},
transports: [
new winston.transports.Console({
format: winston.format.simple()
}),
new winston.transports.File({
filename: 'logs/error.log',
level: 'error',
format: winston.format.json()
})
]
});
// Log with context
logger.info('order created', {
trace_id: req.traceId,
user_id: req.userId,
entity_type: 'order',
entity_id: orderId,
amount: 123.45,
duration_ms: Date.now() - startTime
});
logger.error('payment failed', {
trace_id: req.traceId,
error_code: error.code,
error_message: error.message,
retry_count: retries,
retriable: error.retriable
});
Python (structlog):
import structlog
logger = structlog.get_logger()
logger.info(
"order_created",
trace_id=trace_id,
user_id=user_id,
entity_type="order",
entity_id=order_id,
amount=123.45,
duration_ms=elapsed_time
)
logger.error(
"payment_failed",
trace_id=trace_id,
error_code=error.code,
error_message=str(error),
retry_count=retries
)
Log aggregation (Loki/ELK/Datadog):
All logs are JSON-serialized. Log aggregation systems parse and index these fields automatically.
Example query (Grafana Loki):
{service="api-server", level="error"} | json | error_code != ""
Use these thresholds as baselines; adjust per service's SLA.
MVP stage:
| Alert | Threshold | Duration | Action |
|---|---|---|---|
| Service Down | Status code 5xx > 10% | 1 min | Page on-call |
| High Latency | p95 > 500ms | 5 min | Monitor; page if sustained |
| Error Spike | 5x baseline error rate | 2 min | Page on-call |
Growth stage:
| Alert | Threshold | Duration | Action |
|---|---|---|---|
| Service Down | Status code 5xx > 5% | 2 min | Page on-call |
| High Latency | p95 > 200ms | 10 min | Page on-call |
| High Latency | p99 > 500ms | 5 min | Page on-call |
| Error Spike | Error rate > 1% | 5 min | Create incident |
| CPU Saturation | > 70% for 10 min | 10 min | Auto-scale or page |
| Memory Leak | Memory usage rising > 50%/hour | sustained | Page on-call |
Enterprise stage:
| Alert | Threshold | Duration | Action |
|---|---|---|---|
| Service Degradation | Status code 5xx > 0.5% | 1 min | Create incident |
| Latency SLO Miss | p99 > SLO target | 5 min | Create incident |
| Error Budget Burn | Consumed > 10%/day | realtime | Page on-call |
| CPU Saturation | > 60% for 5 min | sustained | Auto-scale |
| Database Connection Pool | Active > 80% | 5 min | Page DBA |
| Disk Space | Free < 10% | realtime | Critical alert |
File location: monitoring/alerts/rules.yaml or integrated into Prometheus config
groups:
- name: api-server
interval: 30s
rules:
# MVP threshold
- alert: HighErrorRate
expr: |
(sum(rate(http_requests_total{status=~"5.."}[1m]))
/ sum(rate(http_requests_total[1m]))) > 0.1
for: 1m
annotations:
summary: "High error rate (>10%) in api-server"
description: "Error rate: {{ $value | humanizePercentage }}"
# Growth threshold
- alert: HighLatencyP95
expr: |
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.2
for: 10m
annotations:
summary: "High latency (p95 > 200ms)"
# CPU saturation (growth stage)
- alert: HighCPUUsage
expr: |
rate(process_cpu_seconds_total[1m]) > 0.7
for: 10m
annotations:
summary: "CPU utilization > 70%"
# Memory leak detection
- alert: MemoryLeak
expr: |
rate(process_resident_memory_bytes[1h]) > 52428800 # 50 MiB/hour
for: 30m
annotations:
summary: "Possible memory leak (memory growing > 50 MiB/hour)"
| Term | Means | Owner | Consequence |
|---|---|---|---|
| SLO (Service Level Objective) | Internal goal (e.g., "99% uptime") | Engineering team | Guides investment + on-call |
| SLA (Service Level Agreement) | Promise to customers; has penalties | Product/Legal | Financial |
SLOs are typically more aggressive than SLAs (SLO: 99.9%, SLA: 99% — gives 0.9% buffer).
## [Service Name] SLO
### Availability
- **Target:** 99.5% uptime
- **Budget:** 21.6 minutes downtime/month
- **Measurement:** HTTP status 2xx or 3xx / total requests
### Latency
- **Target:** 95th percentile < 200ms
- **Measurement:** p95(request duration) measured over 1-minute windows
### Error Rate
- **Target:** Error rate < 0.5%
- **Measurement:** 5xx responses / total requests
### Duration & Review
- **Quarter:** Q2 2026 (Apr-Jun)
- **Review:** Monthly; escalate if budget consumed > 33%/month
Once SLO is defined, calculate "error budget" — how much failure is acceptable:
Availability SLO: 99.5%
Allowed downtime per month: (1 - 0.995) × 30 days × 24 hours = ~21.6 minutes
If 5 minutes of unexpected downtime occurs on April 3:
Remaining budget: 21.6 - 5 = 16.6 minutes for the rest of April
% consumed: (5 / 21.6) × 100 = 23% of monthly budget
Decision rule:
REST API service:
Background worker:
Web frontend:
Database:
Prometheus configuration (prometheus.yml):
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'api-server'
static_configs:
- targets: ['localhost:3000']
metrics_path: '/metrics'
scrape_interval: 10s
- job_name: 'database'
static_configs:
- targets: ['localhost:9187'] # postgres_exporter
Grafana dashboard (JSON):
{
"dashboard": {
"title": "API Server — RED Metrics",
"panels": [
{
"title": "Request Rate",
"targets": [{
"expr": "sum(rate(http_requests_total[1m]))"
}]
},
{
"title": "Error Rate",
"targets": [{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[1m])) / sum(rate(http_requests_total[1m]))"
}]
},
{
"title": "Latency P99",
"targets": [{
"expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))"
}]
}
]
}
}
Integrate with cloud-native observability platforms for rich integrations.
Datadog:
import { datadogRum } from '@datadog/browser-rum';
import tracer from 'dd-trace';
// Backend
tracer.init(); // Auto-instruments HTTP, DB, cache
tracer.trace('custom-operation', () => {
// custom logic
});
// Frontend
datadogRum.init({
applicationId: 'app-id',
clientToken: 'token',
site: 'datadoghq.com',
service: 'web-app',
env: 'production',
sessionSampleRate: 100,
sessionReplaySampleRate: 20,
trackUserInteractions: true
});
New Relic:
const newrelic = require('newrelic');
newrelic.instrumentLoadedModule('pg', new newrelic.API.QuerySpec({
operation: 'query'
}));
newrelic.startSegment('custom-segment', false, () => {
// custom logic
});
AWS CloudWatch:
import { CloudWatchClient, PutMetricDataCommand } from "@aws-sdk/client-cloudwatch";
const client = new CloudWatchClient({ region: 'us-east-1' });
await client.send(new PutMetricDataCommand({
Namespace: 'api-server',
MetricData: [{
MetricName: 'RequestCount',
Value: 1,
Unit: 'Count'
}]
}));