Design observability (metrics, logs, traces) for understanding system behavior in production. Use when debugging distributed systems or building monitoring.
From quality-attributesnpx claudepluginhub sethdford/claude-skills --plugin architect-quality-attributesThis skill uses the workspace's default tool permissions.
Designs and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.
Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
Design comprehensive observability across metrics, logs, and traces to understand system behavior and debug issues.
You are building observability for a system. The user struggles to debug production issues or wants better visibility. Read their current monitoring setup.
Based on Google's SRE practices and observability research:
Define Key Metrics: For each critical path, specify SLI metrics (success rate, latency, saturation). Example: order checkout: success rate >99.9%, p99 latency <500ms.
Design Metrics Collection: Instrument code with metrics (request count, latency histogram, error count). Use metrics library (Prometheus, StatsD). Keep cardinality low.
Configure Logging: Log key events (authentication, errors, deployments). Include correlation ID in every log. Aggregate logs centrally (ELK, Datadog).
Implement Distributed Tracing: Every request gets trace ID at entry point. Pass to every downstream service. Record span (service name, operation, latency, result).
Build Dashboards & Alerts: Dashboard shows health overview (SLI status). Alerts on SLI violation. Alert requires runbook (action to resolve).