Guides observability implementation across logs, metrics, and traces. Covers golden signals, SLO-based alerting, and maturity assessment.
How this skill is triggered — by the user, by Claude, or both
Slash command
/universal-dev-standards:observability-assistant [service name or observability topic | 服務名稱或可觀測性主題][service name or observability topic | 服務名稱或可觀測性主題]This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
> **Language**: English | [繁體中文](../../locales/zh-TW/skills/observability-assistant/SKILL.md)
Language: English | 繁體中文
Version: 1.0.0 Last Updated: 2026-06-19 Applicability: Claude Code Skills
Core Standard: This skill implements Observability Standards. For the authoritative methodology (full Metrics/Traces detail, sampling, OTel integration), refer to the core standard.
Guide observability implementation across the three pillars: Logs, Metrics, and Traces.
引導三支柱可觀測性實作:Logs、Metrics、Traces。
| Capability | Description | 說明 |
|---|---|---|
| Instrumentation Check | Pre-launch observability checklist | 上線前可觀測性檢查表 |
| Maturity Assessment | L0-L4 maturity self-evaluation | L0-L4 成熟度自評 |
| Metric Design | Help design metrics (type, naming, labels) | 協助設計 Metrics |
| Alert Design | Design SLO-based alerts with noise reduction | 設計 SLO-based 告警 |
| Golden Signals | Verify 4 golden signals coverage | 驗證四大黃金信號覆蓋 |
/observability # Show observability guide
/observability --checklist # Run instrumentation checklist
/observability --maturity # Maturity assessment (L0-L4)
/observability --alerting # Alert design guide
/observability "payment-service" # Guide for specific service
Each pillar gives a different lens; their power is in correlation.
| Pillar | What It Captures | When to Use | Granularity |
|---|---|---|---|
| Logs | Discrete events with context | Debugging, audit trails, error details | High (per-event) |
| Metrics | Numerical measurements over time | Dashboards, alerting, capacity planning | Low (aggregated) |
| Traces | Request flow across services | Latency analysis, dependency mapping | Medium (per-request) |
Correlation fields: trace_id links Logs ↔ Traces ↔ Metrics (via Exemplars);
service.name filters all three pillars. Workflow: metric anomaly → exemplar →
trace → trace_id in logs.
Based on Google SRE. Every service SHOULD monitor all four before production.
| Signal | Measure | Example Metric | Alert (SLO-based) |
|---|---|---|---|
| Latency | P50/P95/P99 via Histogram, split success/error | http.server.request.duration.seconds | P99 > X ms for 5 min |
| Traffic | Requests/sec, by route/method | http.server.request.total (rate) | drop > 50% or spike > 200% |
| Errors | errors / total requests | ...request.total{status=~"5.."} ÷ total | error rate > X% for 5 min |
| Saturation | CPU/mem/pool/disk utilization | system.cpu.utilization (Gauge) | resource > 80% for 10 min |
| Type | Behavior | Use When |
|---|---|---|
| Counter | Only goes up (resets on restart) | request count, error count, bytes sent |
| Gauge | Point-in-time, up/down | queue depth, active connections, memory |
| Histogram | Distribution across buckets | request duration, response size |
| Summary | Client-computed percentiles | legacy, no server-side aggregation |
Naming: <domain>.<entity>.<action>.<unit> in snake_case (e.g.
db.client.query.duration.seconds). Label cardinality: keep labels under
~1000 unique values — never use user_id / request_id / raw url / ip as
labels; record those in Logs or Traces instead.
| Level | Name | Characteristics | Upgrade Action |
|---|---|---|---|
| L0 | No Observability | only stdout/stderr; debug via SSH + tail -f | structured logging; centralize collection |
| L1 | Basic Logging | structured JSON logs, centralized, searchable | add business metrics; first dashboard |
| L2 | Metrics-Driven | Logs + Metrics, dashboards, threshold alerts | enable tracing; SLO-based alerting |
| L3 | Full Observability | three pillars + correlation + SLO alerts + Golden Signals | anomaly detection; auto-remediation |
| L4 | Intelligent | AIOps anomaly detection, predictive alerts, auto-remediation | maintain, optimize, share learnings |
Self-check: find logs for one request across services (L1+) → dashboards of error rate & latency (L2+) → trace request ingress→DB→back (L3+) → auto-detect anomalies before users report (L4).
Before deploying a service to production:
trace_id correlation可觀測性引導完成。建議下一步:
- 執行
/slo定義 SLI/SLO/Error Budget ⭐ 推薦- 執行
/incident設定事故回應流程- 執行
/checkin提交變更
npx claudepluginhub asiaostrich/universal-dev-standards --plugin universal-dev-standardsProvides standards for monitoring, metrics, alerting, and observability including golden signals, RED/USE methods, metric naming conventions, and alert severity levels.
Audits and designs observability instrumentation: structured logging, metrics, tracing, and alerting. Use when reviewing coverage gaps or defining SLIs/SLOs.
Designs production-grade monitoring, logging, and tracing systems with SLI/SLO management, alerting, and incident response workflows.