Skill

monitoring-instrumentation

Metrics, logs, traces (observability); choosing what to measure, dashboards, and incident response.

npx claudepluginhub sethdford/claude-skills --plugin engineer-devops-practices

Popularity

Parent stars

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/devops-practices:monitoring-instrumentation

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Observing systems to understand behavior and detect problems.

SKILL.md

45 lines · ~368 tokens

Similar Skills

monitoring-ops

Provides observability patterns for metrics, logging, tracing, alerting, dashboards, and infrastructure monitoring in production systems with Prometheus, Grafana, OpenTelemetry.

4 files3 tools

claude-mods

observability-engineer

40.4k

Designs production-grade monitoring, logging, and tracing systems with SLI/SLO management, alerting, and incident response workflows.

antigravity-awesome-skills

monitoring-strategy

Design monitoring and alerting that catches production issues fast without creating alert fatigue. Use when establishing observability or improving incident response.

engineering-excellence

Stats

Parent stars13

Parent forks2

MaintenanceFair

Last CommitMar 11, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Help us improve

Share bugs, ideas, or general feedback.

Monitoring Instrumentation

Observing systems to understand behavior and detect problems.

Context

You are planning observability. Measure what matters; enable debugging.

Domain Context

Metrics: Numbers (latency, errors, CPU); aggregated, queryable

Logs: Text events; detailed, high volume

Traces: Request paths; understand where time is spent

Dashboards: Visualize metrics for on-call engineers

Alerts: Notify when metrics exceed thresholds

Instructions

Define SLOs: Service level objectives; what matters?

Metrics: Latency, error rate, throughput; per service/endpoint

Logs: Structured logs with trace ID for correlation

Traces: End-to-end request tracking for debugging

Dashboards: Visualize SLOs; use during incident response

Alerts: Alert on SLO violations, not absolute thresholds

Runbooks: Document how to respond to alerts

Anti-Patterns

Too many metrics; noisy, hard to find signal

Alerts with no runbook; on-call engineer guesses

Logging everything; expensive and hard to search

No trace IDs; can't correlate across services

Alerting on absolute thresholds; SLO-based is better