From harness-claude
Audits existing observability instrumentation and designs structured logging, metrics, distributed tracing, and alerting for production services. Use for coverage gaps, SLIs/SLOs.
npx claudepluginhub intense-visions/harness-engineering --plugin harness-claudeThis skill uses the workspace's default tool permissions.
> Structured logging, metrics, distributed tracing, and alerting strategy. The three pillars of observability, assessed and designed for production readiness.
Assesses and implements observability: RED/USE metrics, structured logging, OpenTelemetry tracing, SLOs, alerting rules, and dashboards. For adding monitoring, Prometheus, or shipping without instrumentation.
Audits observability posture across services: scans for RED metrics, SLOs, alerts, runbooks, tracing, structured logging. Reports coverage matrix and critical gaps.
Provides observability expertise: structured logging, RED/USE metrics with Prometheus/Grafana, OpenTelemetry tracing, error tracking, alerting, and production debugging. Activates on observability, logging, metrics, tracing mentions.
Share bugs, ideas, or general feedback.
Structured logging, metrics, distributed tracing, and alerting strategy. The three pillars of observability, assessed and designed for production readiness.
See also:
harness-pulseis the read-side companion to this skill. Pulse READS observability data (and external analytics/error signals) to produce daily product-pulse reports; this skill DESIGNS the instrumentation that produces those signals in the first place.
Scan for observability libraries. Check package manifests for instrumentation dependencies:
Locate instrumentation code. Search for logger, metrics, and tracing initialization:
Detect collector and exporter configuration. Look for:
otel-collector-config.yaml)prometheus.yml)grafana/dashboards/)datadog.yaml)Identify alerting configuration. Search for:
alert.rules.yml)Present detection summary:
Observability Detection:
Logging: pino (structured JSON) -- 12 logger instances found
Metrics: prom-client -- 8 custom metrics defined
Tracing: @opentelemetry/sdk-trace-node -- initialized in src/tracing.ts
Collector: OpenTelemetry Collector -> Grafana Cloud
Alerting: 3 Prometheus alert rules, Slack integration
Audit logging quality. Evaluate each logger usage for:
Audit metrics coverage. Check for standard metrics:
Audit tracing implementation. Verify:
Audit alerting effectiveness. Check each alert for:
Score each pillar and identify gaps:
Observability Audit:
Logging: 7/10 -- structured, but missing correlation IDs in 4 services
Metrics: 5/10 -- RED metrics partial, no business metrics
Tracing: 8/10 -- good coverage, sampling needs tuning
Alerting: 3/10 -- only 3 rules, no SLO-based alerts, no runbooks
Design logging strategy. Recommend:
Design metrics strategy. Recommend:
http_request_duration_seconds)Design tracing strategy. Recommend:
Define SLIs and SLOs. For each service endpoint:
Design alerting strategy. Recommend:
Validate log output. Check that logs from a test run:
Validate metric exposition. Verify:
/metrics) is accessible and returns Prometheus formatValidate trace propagation. Verify end-to-end:
Validate alerting rules. Check:
Generate observability report:
Observability Validation: [PASS/WARN/FAIL]
Logging: PASS (structured, correlated, no PII detected)
Metrics: WARN (RED metrics present, missing business metrics)
Tracing: PASS (propagation verified, sampling at 10%)
Alerting: FAIL (no SLO-based alerts, 2 of 3 rules missing runbooks)
Priority actions:
1. Define SLOs for /api/orders and /api/payments endpoints
2. Add multi-burn-rate alerts based on SLO error budget
3. Write runbooks for existing alerts
4. Add order_total and payment_success_rate business metrics
harness skill run harness-observability -- Primary invocation for observability audit.harness validate -- Run after instrumentation changes to verify project health.harness check-deps -- Verify observability library dependencies are installed.emit_interaction -- Present audit results and SLO design recommendations.Phase 1: DETECT
Logging: pino with pino-http middleware
Metrics: @opentelemetry/sdk-metrics -> Prometheus
Tracing: @opentelemetry/sdk-trace-node -> Jaeger
Collector: OTel Collector (otel-collector-config.yaml)
Alerting: 5 Prometheus rules in monitoring/alerts.yml
Phase 2: AUDIT
Logging: 8/10 -- structured JSON, trace IDs present, missing request body size
Metrics: 6/10 -- http_request_duration_seconds present, missing queue depth
and business metrics (orders_created_total)
Tracing: 9/10 -- auto-instrumented HTTP + pg + Redis, manual spans on
checkout flow
Alerting: 4/10 -- static thresholds, no SLO burn rate, 2 missing runbooks
Phase 3: DESIGN
SLOs recommended:
- POST /api/orders: 99.9% availability, p99 < 800ms
- GET /api/products: 99.95% availability, p99 < 200ms
Alerting: Replace static "error rate > 5%" with multi-window burn rate
Metrics: Add orders_created_total, cart_abandonment_rate gauges
Logging: Add request/response body size for capacity planning
Phase 4: VALIDATE
Log output: PASS (valid JSON, no PII)
Metrics endpoint: PASS (all custom metrics present)
Trace propagation: PASS (end-to-end verified)
Alert rules: WARN (valid PromQL, but thresholds not SLO-based)
Result: WARN -- alerting strategy needs SLO alignment
Phase 1: DETECT
Logging: zap (structured) across 4 services
Metrics: Datadog dogstatsd client
Tracing: dd-trace-go with automatic HTTP/gRPC instrumentation
Collector: Datadog Agent (datadog.yaml in k8s/)
Alerting: 12 monitors in Datadog (Terraform-managed)
Phase 2: AUDIT
Logging: 9/10 -- consistent structured format, correlation IDs, no PII
Metrics: 7/10 -- RED metrics present, custom counters for business events,
but missing histogram for gRPC call duration
Tracing: 8/10 -- HTTP and gRPC instrumented, database spans present,
Redis spans missing
Alerting: 6/10 -- good coverage but static thresholds, no error budgets
Phase 3: DESIGN
1. Add dd-trace-go Redis integration for complete trace picture
2. Add grpc_server_handling_seconds histogram
3. Define SLOs in Datadog for top 5 endpoints
4. Convert 4 highest-priority monitors to SLO burn rate alerts
5. Add Datadog SLO dashboard for team visibility
Phase 4: VALIDATE
Log output: PASS
Metrics: WARN (missing gRPC histogram)
Traces: WARN (Redis spans missing)
Alerts: WARN (no SLO-based alerts)
Result: WARN -- 3 instrumentation gaps, alerting needs SLO alignment
| Rationalization | Reality |
|---|---|
| "We can see what's happening in CloudWatch logs — we don't need structured logging" | Unstructured log lines cannot be queried, aggregated, or correlated across services. When an incident spans three services, searching for a request ID across unstructured logs is manual forensics. Structured logging is not a nicety — it is the foundation for incident response. |
| "We'll add alerting once we've seen a few incidents and know what to alert on" | The first incident is the worst time to define alerting. SLO-based burn rate alerts can be defined from traffic patterns before any incidents occur. Waiting for incidents to define thresholds means every early failure goes undetected. |
| "User ID is a useful label for the latency metric — it helps us debug per-user issues" | User ID as a metric label creates one time series per user, which at 100,000 users means 100,000 label combinations. High-cardinality labels exhaust metric storage, cause query timeouts, and make the entire metrics system unstable. Use logs for per-user debugging; use metrics for aggregate signals. |
| "The tracing library is initialized, so we have distributed tracing" | Initializing the library creates root spans but does not propagate context across HTTP boundaries, instrument database calls, or connect traces to logs. Trace initialization without verified end-to-end propagation produces disconnected, useless traces. |
| "We have alerts — they're just not linked to runbooks yet" | An alert that fires at 3am without a runbook link requires the on-call engineer to start debugging from scratch. The absence of a runbook is not a documentation gap; it is a mean-time-to-recover multiplier. |