From claude-resources
Observability principles — structured logging, metrics (RED/USE), distributed tracing, health checks, alerting philosophy. Use when adding instrumentation to a service, debugging a production incident that exposes blind spots, designing alerting rules, setting up Prometheus / OpenTelemetry / slog, or writing `/healthz` and `/readyz` endpoints. Trigger on any task mentioning "logging", "metrics", "traces", "observability", "dashboards", "alerts", "pagerduty", "SLO", "golden signals", or "why is this service slow?". Pair with language-specific observability skill for instrumentation code.
npx claudepluginhub deandum/claude-resources --plugin go-skillsThis skill uses the workspace's default tool permissions.
Log only actionable information. Where logging is expensive, instrumentation is cheap.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Guides agent creation for Claude Code plugins with file templates, frontmatter specs (name, description, model), triggering examples, system prompts, and best practices.
Log only actionable information. Where logging is expensive, instrumentation is cheap.
/healthz (liveness) and /readyz (readiness) before anything else| Pillar | Purpose | Granularity |
|---|---|---|
| Logs | Discrete events, debugging | Per-request |
| Metrics | Aggregated measurements, dashboards | Per-interval |
| Traces | Request flow across services | Per-request |
Logs answer "what happened?" Metrics answer "how much?" Traces answer "where did time go?"
| Level | Use For | Production |
|---|---|---|
| Error | Failures needing attention or alerting | Always on |
| Warn | Unusual situations, potential problems | Always on |
| Info | State changes, request completion, startup/shutdown | Always on |
| Debug | Detailed diagnostics | Off by default |
| Type | Use For | Example |
|---|---|---|
| Counter | Totals that only increase | http_requests_total |
| Histogram | Distributions (latency, size) | http_request_duration_seconds |
| Gauge | Current value that goes up/down | active_connections |
/healthz): process alive? Always 200 if running. No dependency checks./readyz): ready for traffic? Check dependencies (DB, cache, downstream services).error_rate > 5% for 5 minutes (users are seeing failures — act)instance restart detected (may be a planned deploy — may not need action)For the full golden-signals decision framework, severity calibration, and alert-fatigue prevention, see references/alerting.md.
| Shortcut | Reality |
|---|---|
| "We'll add logging later" | Debugging without logs is guessing. Add structured logging from day one. |
| "Log everything to be safe" | Noise hides signal. Log at boundaries with structured fields. |
| "High-cardinality labels won't hurt" | Cardinality explodes — a label with 10k values becomes 10k time series. Scrape cost scales linearly, storage quadratically. Put high-cardinality data in logs or traces, not metric labels. |
| "We'll sample logs to save costs" | Head-based sampling loses the exact trace you need to debug an incident. Use request-level sampling with biased retention (always keep errors, sample success). |
| "Metrics are overkill for this" | You cannot improve what you cannot measure. RED method is cheap to implement. |
| "Averages are good enough" | Averages hide outliers. Use histograms for latency — p50/p95/p99 matter. |
/healthz and /readyz)