Help us improve
Share bugs, ideas, or general feedback.
From quality-attributes
Design observability (metrics, logs, traces) for understanding system behavior in production. Use when debugging distributed systems or building monitoring.
npx claudepluginhub sethdford/claude-skills --plugin architect-quality-attributesHow this skill is triggered — by the user, by Claude, or both
Slash command
/quality-attributes:observability-designThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Design comprehensive observability across metrics, logs, and traces to understand system behavior and debug issues.
Provides patterns for observability strategies covering logs, metrics, traces, and signal correlation. Use when designing monitoring systems or implementing the three pillars.
Sets up observability with structured logging, metrics collection, distributed tracing, alerting rules, dashboards, SLOs using ELK, Prometheus, Grafana, Datadog, OpenTelemetry. For monitoring, production debugging, observability architecture.
Guides observability setup across logs, metrics, traces: checklists, maturity assessments (L0-L4), metric/alert design, golden signals verification. Use for new services, SLOs, alerting.
Share bugs, ideas, or general feedback.
Design comprehensive observability across metrics, logs, and traces to understand system behavior and debug issues.
You are building observability for a system. The user struggles to debug production issues or wants better visibility. Read their current monitoring setup.
Based on Google's SRE practices and observability research:
Define Key Metrics: For each critical path, specify SLI metrics (success rate, latency, saturation). Example: order checkout: success rate >99.9%, p99 latency <500ms.
Design Metrics Collection: Instrument code with metrics (request count, latency histogram, error count). Use metrics library (Prometheus, StatsD). Keep cardinality low.
Configure Logging: Log key events (authentication, errors, deployments). Include correlation ID in every log. Aggregate logs centrally (ELK, Datadog).
Implement Distributed Tracing: Every request gets trace ID at entry point. Pass to every downstream service. Record span (service name, operation, latency, result).
Build Dashboards & Alerts: Dashboard shows health overview (SLI status). Alerts on SLI violation. Alert requires runbook (action to resolve).