This skill should be used when the user asks to "design observability", "set up monitoring", "implement tracing", "configure alerting", or "define SLOs". Also triggers on mentions of OpenTelemetry, Prometheus, Grafana, ELK, correlation IDs, burn rate, or runbooks. Use this skill even if the user only asks about one pillar like logging — the full three-pillar observability context is always relevant for production systems. [EXPLICIT]
From jm-adknpx claudepluginhub javimontano/jm-adk-alfaThis skill is limited to using the following tools:
agents/guardian.mdagents/lead.mdagents/specialist.mdagents/support.mdevals/evals.jsonknowledge/body-of-knowledge.mdknowledge/knowledge-graph.mdprompts/meta.mdprompts/primary.mdprompts/variations/deep.mdprompts/variations/quick.mdreferences/observability-patterns.mdtemplates/output.docx.mdtemplates/output.htmlObservability architecture enables teams to understand system behavior from external outputs — logs, traces, and metrics. The skill produces comprehensive observability strategies covering the three pillars, alerting frameworks, and incident response integration that transform raw telemetry into actionable operational intelligence. [EXPLICIT]
Si no se puede observar, no se puede operar. Si no se puede operar, no existe. Los 3 pilares (logs, metrics, traces) son el mínimo, no el máximo. SLO-based alerting reemplaza threshold alerting, y la respuesta a incidentes empieza con observabilidad — no termina ahí.
The user provides a system or platform name as $ARGUMENTS. Parse $1 as the system/platform name used throughout all output artifacts. [EXPLICIT]
Parameters:
{MODO}: piloto-auto (default) | desatendido | supervisado | paso-a-paso
{FORMATO}: markdown (default) | html | dual{VARIANTE}: ejecutiva (~40% — S1 strategy + S4 metrics/dashboards + S5 alerting) | técnica (full 6 sections, default)Before generating architecture, detect the technology context:
!find . -name "*.yaml" -o -name "*.yml" -o -name "docker-compose*" -o -name "*.tf" -o -name "otel*" | head -20
If reference materials exist, load them:
Read ${CLAUDE_SKILL_DIR}/references/observability-patterns.md
Define the overarching approach to system understanding through the three pillars. [EXPLICIT]
Three-pillar assessment: Evaluate current maturity of logging, tracing, and metrics. Score each 1-5 (ad hoc -> optimized).
OpenTelemetry (OTel) Adoption Plan:
OTel Collector Topology Decision:
| Pattern | Description | When to use |
|---|---|---|
| Agent (per-node) | Sidecar or DaemonSet alongside each app. Lightweight, local processing. | Default starting point. Minimizes network hops. |
| Gateway (centralized) | Standalone service receiving from multiple agents. Heavy processing, routing. | Tail-based sampling, multi-backend routing, data enrichment. |
| Hierarchical (Agent + Gateway) | Agents handle local batching/filtering; Gateways handle aggregation/sampling. | Production recommendation for >10 services. |
Critical topology rule for tail-based sampling: Separate Agent from Gateway. All spans from the same trace must reach the same Gateway instance for correct sampling decisions. Use trace-ID-based routing (consistent hashing) at the load balancer in front of Gateways.
Sampling strategies:
Retention tiers: Hot 7d (real-time query), warm 30d (aggregated), cold 90d+ (archived, compliance)
Observability-driven development: Instrument first, then code. Define "healthy" in telemetry before writing business logic. New features ship with dashboards and alerts as part of definition of done.
Design structured logging with aggregation, correlation, and retention management. [EXPLICIT]
Mandatory structured log fields (JSON):
timestamp, level, service, traceId, spanId, message, environment, version
Log Level Standards:
| Level | Production use | Definition |
|---|---|---|
| ERROR | Always on | Actionable failures requiring intervention |
| WARN | Always on | Degraded but functioning, may need attention |
| INFO | Always on | Business events, request lifecycle |
| DEBUG | Off (enable per-service temporarily) | Development detail, never in prod by default |
Correlation ID propagation: Request ID generated at entry point, passed through all service calls via W3C Trace Context headers. Every log line includes traceId and spanId.
Aggregation pipeline: Collection agent (Vector, Fluentd) -> Processing (filtering, enrichment) -> Storage
Storage backend decision:
Sensitive data: PII masking at collection layer (Vector transforms, Fluentd filters). Never log credentials, tokens, or full credit card numbers.
Retention: ERROR 90d, WARN 30d, INFO 30d, DEBUG 7d (non-prod only)
Implement trace propagation, span design, and cross-signal correlation. [EXPLICIT]
Trace propagation: W3C Trace Context headers across HTTP, gRPC metadata, message queue headers (Kafka record headers, AMQP properties)
Span design: One span per logical operation:
Span attributes: Operation name, status code, error flag, custom business attributes (order ID, tenant ID)
Sampling strategy:
Exemplars — Metrics-to-Traces Linking: Exemplars attach a trace/span reference to a specific metric data point. Configure both OTel metric and trace SDKs; record metrics within an active span context. This enables clicking from a latency spike on a dashboard directly to the offending trace — the critical bridge between "what is happening" (metrics) and "why" (traces). Enable exemplars in Prometheus (--enable-feature=exemplar-storage) and Grafana (exemplar data source configuration). [EXPLICIT]
Service map: Auto-generate topology from trace data. Review weekly for unexpected dependencies.
Storage: Tempo (Grafana-native, cost-efficient), Jaeger (mature, Elasticsearch/Cassandra backend), X-Ray (AWS native)
Define metric types, naming conventions, collection methods, and dashboard hierarchy. [EXPLICIT]
Metric types: Counters (requests total), Gauges (active connections), Histograms (latency distribution)
Naming convention: <service>_<component>_<metric>_<unit> (e.g., api_http_request_duration_seconds)
Framework methods:
Dashboard hierarchy:
Log-Based Metrics: Derive counters and gauges from structured logs without separate instrumentation. Extract ERROR counts per service per minute, parse latency from request logs. Tools: Loki recording rules, Vector transforms, CloudWatch Metric Filters. Useful for legacy systems that only emit logs.
Cardinality Management:
| Problem | Solution |
|---|---|
URL paths with IDs (/users/123) | Normalize to /users/{id} |
| User/request IDs as labels | Remove; use trace correlation instead |
| Unbounded enum labels | Allowlist known values, bucket rest as "other" |
| Per-pod metrics in large clusters | Aggregate to service level, drill down on demand |
Default OTel cardinality limit: 2000 unique time series per metric (configurable via View API). Cardinality explosion is the primary driver of unpredictable observability costs. Enforce per-service observability budgets (e.g., max 5000 active series per service). Use delta temporality for high-cardinality counters.
Dashboard-as-code: Grafana dashboards in JSON/Jsonnet, version-controlled. Deploy with Terraform or grizzly.
Build SLO-based alerting with burn rate windows, severity levels, and runbook integration. [EXPLICIT]
Alert Fatigue Prevention — Error Budget Burn Rate Model (Google SRE):
| Alert type | Burn rate | Long window | Short window | Action |
|---|---|---|---|---|
| Fast burn | 14.4x | 1 hour | 5 minutes | Page immediately (P1) |
| Medium burn | 6x | 6 hours | 30 minutes | Page during hours (P2) |
| Slow burn | 1x | 3 days | 6 hours | Create ticket (P3) |
This dual-window approach reduces false positives while maintaining sensitivity. Alert only when error budget burn rate is sustained — eliminates transient noise. [EXPLICIT]
Severity Levels:
| Severity | Criteria | Response | Notification |
|---|---|---|---|
| P1 | Customer-visible impact, SLO breach imminent | Page immediately, 15min response | PagerDuty/OpsGenie |
| P2 | Degraded performance, no SLO breach yet | Page during business hours | Slack + on-call |
| P3 | Anomaly, potential future issue | Next business day ticket | Email + Jira |
Alert hygiene:
Runbook linkage: Every alert MUST link to a runbook with: diagnostic steps, likely root causes, remediation actions, escalation path. No alert without a runbook.
Connect observability to incident management with on-call, classification, and post-mortem feedback. [EXPLICIT]
On-call: Rotation schedules (weekly), follow-the-sun for distributed teams, primary + secondary Classification: Severity matrix (impact x urgency), auto-suggest severity from SLO data
Incident timeline: Auto-generated from:
Post-mortem template: Timeline, impact (users affected, duration, error budget consumed), root cause, contributing factors, action items with owners and deadlines
Feedback loops: Post-mortem actions feed into:
Blameless culture: Focus on system failures, not individual mistakes. Mandatory post-mortems for P1, optional for P2.
| Capability | Grafana Stack (OSS) | Datadog | New Relic | Honeycomb |
|---|---|---|---|---|
| Metrics | Prometheus/Mimir | Built-in | Built-in | Limited (traces-first) |
| Logs | Loki | Built-in | Built-in | Limited |
| Traces | Tempo | Built-in | Built-in | Core strength |
| Exemplars | Native | Supported | Supported | Native |
| High-cardinality exploration | Limited | Good | Good | Excellent |
| Cost model | Infra + ops | Per-host + ingestion | Per-GB ingested | Per-event |
| Vendor lock-in | None (OTel native) | Medium | Medium | Low (OTel) |
| Best for | Platform teams, cost-conscious, full control | Full-stack teams, fast setup | Legacy + modern mixed | Debugging complex distributed systems |
Selection criteria: <20 services and no platform team -> managed (Datadog/New Relic). Platform engineering team present -> Grafana stack. Complex distributed debugging priority -> Honeycomb. Always use OTel SDK regardless of backend for portability.
| Decision | Enables | Constrains | When to Use |
|---|---|---|---|
| Full trace sampling | Complete visibility | High storage cost, processing overhead | Debugging, low-traffic, compliance |
| Tail-based sampling | Captures slow/error reliably | Collector buffering, added latency | High-traffic, anomaly focus |
| Managed platform | Fast setup, unified UI | Vendor lock-in, cost at scale | Small-medium teams, rapid value |
| Open-source stack | No lock-in, full control | Ops burden, integration work | Large teams with platform capacity |
| SLO-based alerting | Reduced noise, business-aligned | Requires SLO definition discipline | Mature teams with defined service levels |
Greenfield System: Instrument from day one. Embed OTel SDK in service templates. Define logging and tracing standards before first deployment.
Legacy Monolith: Start with infrastructure metrics and access logs. Add structured logging incrementally. Use APM agents for automatic instrumentation. Trace boundaries at external calls.
Serverless / FaaS: Cold starts complicate tracing. Use OTel Lambda layers or vendor-native tracing (X-Ray, Cloud Trace). Push metrics (no scrape endpoint). Log to stdout with structured format.
Multi-Cloud or Hybrid: Normalize telemetry with OTel. Centralized Gateway Collector aggregating across clouds. Standardize naming conventions regardless of provider.
High-Cardinality Environments: Microservices with many endpoints and tenants. Use label allowlists, drop unused labels at collection, enforce per-service series budgets.
Before finalizing delivery, verify:
| Format | Default | Description |
|---|---|---|
markdown | ✅ | Rich Markdown + Mermaid diagrams. Token-efficient. |
html | On demand | Branded HTML (Design System). Visual impact. |
dual | On demand | Both formats. |
Default output is Markdown with embedded Mermaid diagrams. HTML generation requires explicit {FORMATO}=html parameter. [EXPLICIT]
Primary: A-01_Observability_Architecture.html — Executive summary, three-pillar strategy, OTel Collector topology, logging standards, tracing design, metric taxonomy, alerting framework, incident response integration.
Secondary: OTel Collector configuration (agent + gateway), Grafana dashboard JSON, burn rate alert rules, runbook templates, post-mortem template.
Author: Javier Montaño | Last updated: 2026-03-12
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.