Skill

observability-design

Designs logging, metrics, and distributed tracing architectures using OpenTelemetry and observability platforms. Trigger: "observability design", "logging architecture", "metrics strategy", "distributed tracing", "OpenTelemetry".

From sovereign-architect

Install

Run in your terminal

npx claudepluginhub javimontano/mao-sovereign-architect

Tool Access

This skill is limited to using the following tools:

ReadGlobGrepBashAgent

Supporting Assets

View in Repository

evals/evals.json

examples/sample-output.md

prompts/use-case-prompts.md

references/body-of-knowledge.md

Skill Content

Observability Design

Architects comprehensive observability systems covering the three pillars — logs, metrics, and traces — with correlation strategies, alerting rules, and dashboard design for operational excellence.

Guiding Principle

"You cannot improve what you cannot observe. You cannot debug what you cannot trace."

Procedure

Step 1 — Define Observability Requirements

Identify the key user journeys and critical business transactions to monitor.
Define SLIs (Service Level Indicators) for each service: latency, error rate, throughput, saturation.
Establish SLOs (Service Level Objectives) with error budgets.
Map the system topology to identify all instrumentation points.
Determine compliance requirements for log retention and data residency.

Step 2 — Instrumentation Strategy

Select OpenTelemetry as the vendor-neutral instrumentation standard.
Define automatic instrumentation targets: HTTP frameworks, database clients, message brokers.
Identify custom instrumentation points: business events, critical decision paths, external API calls.
Establish context propagation rules: trace ID, span ID, baggage items across service boundaries.
Define sampling strategy: head-based sampling for high-throughput, tail-based sampling for error capture.

Step 3 — Telemetry Pipeline

Logs: Structured JSON logging with standard fields (timestamp, level, service, trace_id, span_id, message).
Metrics: RED metrics (Rate, Errors, Duration) per service + USE metrics (Utilization, Saturation, Errors) per resource.
Traces: Distributed traces with spans for each service hop, database call, and external API call.
Design the collection pipeline: OTel Collector → processing/filtering → backend storage.
Specify the correlation strategy: every log entry includes trace_id and span_id for cross-pillar navigation.

Step 4 — Dashboards & Alerting

Design the dashboard hierarchy: system overview → service detail → endpoint detail → trace explorer.
Define alert rules based on SLOs: alert when error budget burn rate exceeds threshold.
Establish on-call runbooks linked to each alert.
Specify escalation paths and notification channels (PagerDuty, Slack, email).
Define anomaly detection rules for proactive issue identification.

Quality Criteria

Every service emits correlated logs, metrics, and traces with shared context (trace_id).
SLOs are defined for all critical user journeys with error budget tracking.
Alert rules have a signal-to-noise ratio > 80% (fewer than 20% false positives).
Dashboards follow the RED/USE methodology and answer "is the system healthy?" in < 30 seconds.

Anti-Patterns

Logging without structure (free-text logs that are impossible to parse or aggregate).
Alert fatigue: too many alerts, too sensitive thresholds, leading to ignored notifications.
Metrics without baselines — alerting on arbitrary thresholds instead of deviation from normal.
Observability as an afterthought — instrumenting only after incidents, missing the context needed to debug.

Similar Skills

agent-harness-construction

Designs and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.

everything-claude-code

138.0k

agent-payment-x402

Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.

everything-claude-code

138.0k

agent-eval

Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.

everything-claude-code

138.0k

Stats

Stars0

Forks0

Last CommitMar 28, 2026

Actions

View Source View Plugin View on GitHub View README

Procedure

Step 1 — Define Observability Requirements

Identify the key user journeys and critical business transactions to monitor.

Define SLIs (Service Level Indicators) for each service: latency, error rate, throughput, saturation.

Establish SLOs (Service Level Objectives) with error budgets.

Map the system topology to identify all instrumentation points.

Determine compliance requirements for log retention and data residency.

Step 2 — Instrumentation Strategy

Select OpenTelemetry as the vendor-neutral instrumentation standard.

Define automatic instrumentation targets: HTTP frameworks, database clients, message brokers.

Identify custom instrumentation points: business events, critical decision paths, external API calls.

Establish context propagation rules: trace ID, span ID, baggage items across service boundaries.

Define sampling strategy: head-based sampling for high-throughput, tail-based sampling for error capture.

Step 3 — Telemetry Pipeline

Logs: Structured JSON logging with standard fields (timestamp, level, service, trace_id, span_id, message).

Metrics: RED metrics (Rate, Errors, Duration) per service + USE metrics (Utilization, Saturation, Errors) per resource.

Traces: Distributed traces with spans for each service hop, database call, and external API call.

Design the collection pipeline: OTel Collector → processing/filtering → backend storage.

Specify the correlation strategy: every log entry includes trace_id and span_id for cross-pillar navigation.

Step 4 — Dashboards & Alerting

Design the dashboard hierarchy: system overview → service detail → endpoint detail → trace explorer.

Define alert rules based on SLOs: alert when error budget burn rate exceeds threshold.

Establish on-call runbooks linked to each alert.

Specify escalation paths and notification channels (PagerDuty, Slack, email).

Define anomaly detection rules for proactive issue identification.

Quality Criteria

Every service emits correlated logs, metrics, and traces with shared context (trace_id).

SLOs are defined for all critical user journeys with error budget tracking.

Alert rules have a signal-to-noise ratio > 80% (fewer than 20% false positives).

Dashboards follow the RED/USE methodology and answer "is the system healthy?" in < 30 seconds.

Anti-Patterns

Logging without structure (free-text logs that are impossible to parse or aggregate).

Alert fatigue: too many alerts, too sensitive thresholds, leading to ignored notifications.

Metrics without baselines — alerting on arbitrary thresholds instead of deviation from normal.

Observability as an afterthought — instrumenting only after incidents, missing the context needed to debug.