Skill

observability-gap-analyzer

Logging, metrics, and tracing completeness analysis across the system. Trigger: "observability gaps", "logging analysis", "metrics coverage", "tracing completeness".

From sa

Install

Run in your terminal

npx claudepluginhub javimontano/jm-adk --plugin sovereign-architect

Tool Access

This skill is limited to using the following tools:

ReadGlobGrepBashAgent

Supporting Assets

View in Repository

evals/evals.json

examples/sample-output.md

prompts/use-case-prompts.md

references/body-of-knowledge.md

Skill Content

Observability Gap Analyzer

Assess the three pillars of observability — logging, metrics, and tracing — for completeness, quality, and actionability.

Guiding Principle

"You cannot debug what you cannot observe. Observability is the difference between fixing in minutes and guessing for hours."

Procedure

Step 1 — Logging Assessment

Identify logging framework and configuration (log levels, format, destination).
Check for structured logging vs. unstructured string messages.
Verify log levels are used correctly: ERROR for failures, WARN for degradation, INFO for business events, DEBUG for troubleshooting.
Detect sensitive data in logs: PII, tokens, passwords.
Assess log coverage on critical paths: auth, payments, errors [HECHO].

Step 2 — Metrics Analysis

Identify metrics instrumentation: Prometheus, StatsD, CloudWatch, custom.
Check for RED metrics per service: Rate, Errors, Duration.
Check for USE metrics per resource: Utilization, Saturation, Errors.
Assess custom business metrics: conversion rates, queue depths, SLI-based metrics.
Verify alerting rules exist for critical metrics [HECHO].

Step 3 — Tracing Completeness

Check for distributed tracing implementation (OpenTelemetry, Jaeger, Datadog APM).
Verify trace context propagation across service boundaries.
Assess span coverage: are all significant operations instrumented?
Check for trace sampling strategy (100% vs. probabilistic vs. tail-based).
Verify correlation between traces, logs, and metrics (trace ID in logs) [HECHO].

Step 4 — Observability Gap Report

Score each pillar: logging (33%), metrics (33%), tracing (33%).
Identify blind spots: system states that cannot be diagnosed with current instrumentation.
Map gaps to incident scenarios: "If X fails, can we detect and diagnose it?"
Recommend instrumentation additions prioritized by incident impact.

Quality Criteria

All three pillars assessed independently [HECHO]
Gaps mapped to specific incident scenarios
Sensitive data in logs identified and flagged
Correlation between pillars checked (trace IDs in logs, metrics linked to traces)

Anti-Patterns

Only checking if logging exists without assessing quality and coverage
Ignoring metrics while focusing only on logs (logs don't scale for dashboards)
Implementing tracing without propagating context across boundaries
Alert fatigue: too many alerts with no prioritization or runbooks

Similar Skills

skill-lookup

Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.

prompts.chat

157.5k

prompt-lookup

Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.

prompts.chat

157.5k

agent-eval

Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.

everything-claude-code

138.0k

Stats

Parent Repo Stars0

Parent Repo Forks0

Last CommitMar 22, 2026

Actions

View Source View Plugin View on GitHub View README

Procedure

Step 1 — Logging Assessment

Identify logging framework and configuration (log levels, format, destination).

Check for structured logging vs. unstructured string messages.

Verify log levels are used correctly: ERROR for failures, WARN for degradation, INFO for business events, DEBUG for troubleshooting.

Detect sensitive data in logs: PII, tokens, passwords.

Assess log coverage on critical paths: auth, payments, errors [HECHO].

Step 2 — Metrics Analysis

Identify metrics instrumentation: Prometheus, StatsD, CloudWatch, custom.

Check for RED metrics per service: Rate, Errors, Duration.

Check for USE metrics per resource: Utilization, Saturation, Errors.

Assess custom business metrics: conversion rates, queue depths, SLI-based metrics.

Verify alerting rules exist for critical metrics [HECHO].

Step 3 — Tracing Completeness

Check for distributed tracing implementation (OpenTelemetry, Jaeger, Datadog APM).

Verify trace context propagation across service boundaries.

Assess span coverage: are all significant operations instrumented?

Check for trace sampling strategy (100% vs. probabilistic vs. tail-based).

Verify correlation between traces, logs, and metrics (trace ID in logs) [HECHO].

Step 4 — Observability Gap Report

Score each pillar: logging (33%), metrics (33%), tracing (33%).

Identify blind spots: system states that cannot be diagnosed with current instrumentation.

Map gaps to incident scenarios: "If X fails, can we detect and diagnose it?"

Recommend instrumentation additions prioritized by incident impact.