Search everything...

Skill

observability

Observability discipline: structured logging, metrics instrumentation, distributed tracing, and signal correlation. Invoke whenever task involves any interaction with observability concerns — adding logging, designing metrics, instrumenting traces, correlating signals, reviewing instrumentation, or understanding when to use which pillar.

Install

npx claudepluginhub xobotyi/cc-foundry --plugin backend

Tool Access

This skill uses the workspace's default tool permissions.

Preview

**If you cannot ask arbitrary questions about your system's behavior from the outside, your system is not observable —

SKILL.md

Similar Skills

design-system

Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.

team-skills-platform

163.7k

ui-demo

Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.

team-skills-platform

163.7k

kotlin-patterns

Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.

team-skills-platform

163.7k

Stats

Parent Repo Stars4

Parent Repo Forks0

Last CommitMar 14, 2026

Actions

View Source View Plugin View on GitHub View README

observability | backend | ClaudePluginHub

Skill

observability

From backend

Install

npx claudepluginhub xobotyi/cc-foundry --plugin backend

Tool Access

This skill uses the workspace's default tool permissions.

Preview

**If you cannot ask arbitrary questions about your system's behavior from the outside, your system is not observable —

SKILL.md

Observability

If you cannot ask arbitrary questions about your system's behavior from the outside, your system is not observable — it is merely monitored.

Observability is the ability to understand internal system state from its external outputs. Three complementary signals make this possible: logs (discrete events), metrics (aggregate measurements), and traces (request-scoped causal chains). Each pillar answers different questions. Using the wrong pillar for a question wastes resources and hides the answer.

This skill covers high-level observability discipline — purposes, interconnection, and good practices for all three pillars. It is technology-agnostic: specific tools (Prometheus, StatsD, OpenTelemetry) have their own dedicated skills.

The Three Pillars

Logs — What Happened

Logs are timestamped, discrete event records. They capture what happened at a specific moment: an error thrown, a user action, a configuration loaded, a connection refused.

Use logs when you need:

Rich diagnostic context for a specific event
Debugging information with full error details and stack traces
Audit trails of who did what and when
Record of discrete state transitions

Logs are poor at:

Showing aggregate system health (use metrics)
Tracing request flow across services (use traces)
High-frequency numeric trends (too expensive at volume)

Metrics — How Is It Doing

Metrics are numeric measurements aggregated over time. They capture how the system is performing as quantitative time series: request rates, error percentages, latencies, queue depths, resource utilization.

Use metrics when you need:

Real-time health signals and alerting
Trend analysis over hours, days, weeks
Capacity planning and saturation monitoring
Pre-aggregated data that scales cheaply regardless of traffic

Metrics are poor at:

Explaining why something is broken (use logs)
Showing the path of a single request (use traces)
Storing per-event detail (cardinality explosion)

Traces — How Did It Flow

Traces record the causal chain of operations that make up a single request as it propagates through distributed components. A trace is a tree of spans, where each span represents one unit of work (an HTTP call, a database query, a queue publish).

Use traces when you need:

End-to-end latency breakdown across services
Dependency mapping and bottleneck identification
Understanding the path a failing request took
Correlating work across process and network boundaries

Traces are poor at:

Aggregate health monitoring (use metrics)
Detailed per-event diagnostics on a single node (use logs)
Cheap, long-term trend storage (traces are expensive at 100% sampling)

Choosing the Right Signal

"Is the system healthy right now?" — Metrics
"Why did this specific request fail?" — Traces + Logs
"What happened at 03:14 on node-7?" — Logs
"Where is the bottleneck in checkout flow?" — Traces
"Are error rates increasing over the last hour?" — Metrics
"What was the full stack trace of that exception?" — Logs
"Which downstream service is slow?" — Traces
"How much headroom does the database have?" — Metrics

Structured Logging

Always Structured

Emit logs as structured records (JSON or equivalent key-value format) with a consistent schema. Unstructured string logs are acceptable only in local development. Structured logs are machine-parseable, indexable, and filterable at scale.

Log Levels

Use levels consistently. Every team member must agree on what each level means.

FATAL/CRITICAL — Process cannot continue; about to crash. Alerting: Page immediately
ERROR — Operation failed; requires investigation. Alerting: Alert / ticket
WARN — Unexpected condition; system compensated. Alerting: Monitor trend
INFO — Significant business or lifecycle event. Alerting: Dashboard
DEBUG — Diagnostic detail for developers. Alerting: Never in production by default
TRACE — Extremely verbose step-by-step flow. Alerting: Never in production

Rules:

Production defaults to INFO or above. DEBUG/TRACE are off unless explicitly enabled for a bounded investigation window.
WARN is not a dumping ground. If it never leads to action, it is noise — downgrade to DEBUG or remove it.
ERROR means something is broken. Expected conditions (404 for missing resources, validation failures from bad input) are not errors — they are INFO with a status field.
Make log level configurable at runtime without restarts.

Structured Fields

Every log record should include these baseline fields:

timestamp: ISO 8601, UTC
level: Severity (ERROR, WARN, INFO, ...)
message: Human-readable summary of the event
service: Service name emitting the log
version: Service version / build / commit SHA
trace_id: Distributed trace ID (if in request context)
span_id: Current span ID (if in request context)

Add contextual fields relevant to the event:

user_id: User-initiated actions
request_id: Per-request correlation
duration_ms: Timed operations
error.type: Error class/name
error.message: Error description
error.stack: Stack trace (ERROR level only)
http.method, http.path, http.status: HTTP request/response
db.operation, db.duration_ms: Database calls

Sensitive Data

Never log:

Passwords, tokens, API keys, secrets
Full credit card numbers, SSNs, or equivalent PII
Session tokens or authentication cookies
Request/response bodies containing user-submitted personal data

When user identifiers are needed, log opaque IDs (user_id), not email addresses or names. If regulations (GDPR, HIPAA) apply, verify that logged fields comply. When in doubt, omit the field.

Logging at Boundaries

At application startup:

INFO: service name, version, loaded configuration (without secrets), listen address
WARN: degraded mode (e.g., fallback to local cache because Redis is unreachable)
ERROR/FATAL: unrecoverable startup failures

Per incoming request:

INFO: method, path (scrubbed of PII), status code, duration, request dimensions (tenant, region)
WARN/ERROR: only for unexpected exceptions; catch at the top-level handler

Per outgoing dependency call:

INFO or DEBUG: target service, operation, status, duration
ERROR: failures in dependent services (Redis, database, queue, etc.)

Log Once, at the Right Level

Log a raised exception once. Do not catch-log-rethrow at every layer. Let exceptions propagate to the top-level handler, which logs with full context. If you must log and rethrow, do it only when adding context that would otherwise be lost.

Metrics

Metric Types

Counter — Monotonically increasing; resets on restart. Use for totals: requests, errors, bytes sent
Gauge — Arbitrary value; goes up and down. Use for snapshots: queue depth, memory usage, connections
Histogram — Client-side aggregation into buckets. Use for distributions: request latency, payload size
Summary — Client-side quantile calculation. Use for pre-computed percentiles (less flexible than histogram)

Rules:

Use counters for events that accumulate. Derive rates from counters (rate(), increase()), never store pre-computed rates.
Use gauges for current-state snapshots. Never rate() a gauge.
Use histograms for latency and size distributions. Histograms enable percentile calculation across instances; summaries do not aggregate.
Export timestamps as Unix epoch seconds, not "time since" values.
Initialize all metrics with a zero value at startup to avoid missing-metric problems.

What to Measure

The Four Golden Signals (Google SRE)

For every user-facing service, measure these four:

Latency — Time to serve a request. Example: http_request_duration_seconds histogram
Traffic — Demand on the system. Example: http_requests_total counter by method/path
Errors — Rate of failed requests. Example: http_requests_total{status=~"5.."}
Saturation — How "full" the service is. Example: CPU usage, memory, queue depth, thread pool

Distinguish successful latency from error latency. A fast 500 is not good latency. A slow error is worse than a fast error. Track both.

RED Method (Request-Centric)

For every microservice:

Rate — requests per second
Errors — failed requests per second
Duration — distribution of request latency

RED is a focused subset of the golden signals, optimized for request-driven services.

USE Method (Resource-Centric)

For every resource (CPU, memory, disk, network, thread pool):

Utilization — percentage of capacity in use
Saturation — backlog / queue depth
Errors — resource-level error count

RED tells you what is degraded from the user's perspective. USE tells you why at the infrastructure level. Use both together.

Service-Type Instrumentation

Online-serving (HTTP, gRPC) — Request rate, error rate, latency (p50/p90/p99), in-flight requests
Offline-processing (workers, pipelines) — Items in/out per stage, processing duration, last-processed timestamp, queue depth
Batch jobs — Last successful completion time, job duration, records processed, exit status
Caches — Hit rate, miss rate, eviction count, latency to backend on miss
Thread/connection pools — Pool size, active count, queue length, wait time

Metric Naming

Metric names should be self-documenting. Follow these conventions:

Prefix with namespace. myapp_http_requests_total, not requests_total.
Use base units. Seconds (not milliseconds), bytes (not megabytes), ratio 0-1 (not percentage 0-100).
Suffix with unit. _seconds, _bytes, _total (for unit-less counters).
One metric, one unit, one quantity. Never mix request size with request duration in the same metric.
snake_case. http_request_duration_seconds, not httpRequestDurationSeconds.

Good	Bad
`http_request_duration_seconds`	`request_latency` (no unit, ambiguous)
`http_requests_total`	`http_responses_500_total` (use labels)
`node_memory_usage_bytes`	`memory_mb` (not base unit)
`process_cpu_seconds_total`	`cpu_percent` (use ratio 0-1)

Labels and Cardinality

Labels add dimensions to a metric. Every unique combination of label values creates a separate time series.

Good labels (bounded, low cardinality):

method (GET, POST, PUT, DELETE)
status_code (200, 404, 500 — or class: 2xx, 4xx, 5xx)
service, region, version

Dangerous labels (unbounded, high cardinality):

user_id, email, session_id
request_path with dynamic segments (/users/12345)
error_message (arbitrary strings)

Rules:

Keep label cardinality below 10 values per label for most metrics.
If a label can grow unbounded, it does not belong on a metric. Log it instead.
Use labels instead of encoding dimensions in the metric name. http_requests_total{method="GET"}, not http_get_requests_total.
Ensure sum() or avg() across all label values is meaningful. If not, split into separate metrics.

Percentiles and Tail Latency

Averages hide outliers. A service with 100ms average latency may have 1% of requests taking 5 seconds. That 1% tail can dominate user experience when users hit multiple services per page load.

Always track p50, p90, p99 latency at minimum.
Use histograms with exponentially distributed bucket boundaries (e.g., 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s).
Alert on p99, not mean. Mean latency alerts miss tail degradation.

Distributed Tracing

Core Concepts

Trace — End-to-end record of a single request across all services
Span — One unit of work within a trace (HTTP call, DB query, function)
Root span — First span in a trace; has no parent
Child span — Span nested under a parent; represents a sub-operation
Span context — Immutable bag of trace_id + span_id + flags, propagated across boundaries
Span attributes — Key-value metadata on a span (http.method, db.statement)
Span events — Timestamped annotations within a span's lifetime
Span links — Causal references between spans in different traces

Span Kinds

Client — Outgoing synchronous call. Example: HTTP request to another service
Server — Incoming synchronous call. Example: Handling an HTTP request
Producer — Creates async work. Example: Publishing to a message queue
Consumer — Processes async work. Example: Consuming from a message queue
Internal — No network boundary. Example: In-process function instrumentation

Context Propagation

Context propagation is the mechanism that connects spans across process boundaries into a single trace. Without it, you get disconnected spans, not traces.

Rules:

Propagate context on every outgoing call. HTTP headers (W3C Trace Context or B3), message metadata, gRPC metadata — every cross-process boundary must carry trace context.
Extract context on every incoming call. The receiving service must extract trace context and create a child span under the propagated parent.
Use W3C Trace Context (traceparent/tracestate) as the default propagation format unless the ecosystem requires otherwise (e.g., legacy B3).
Never generate a new trace ID when you should be continuing an existing trace. A new trace ID means a broken trace.

What to Trace

Instrument at meaningful boundaries:

Incoming HTTP/gRPC requests — Always — auto-instrument
Outgoing HTTP/gRPC calls — Always — auto-instrument
Database queries — Always — auto-instrument or manual
Cache operations — Yes — hit/miss as attribute
Queue publish/consume — Yes — link producer and consumer spans
Significant business operations — Yes — manual spans for key logic
Tight loops / trivial functions — No — noise, performance cost

Span Attributes

Attach attributes that enable filtering and analysis:

http.method, http.route, http.status_code: HTTP spans
db.system, db.operation, db.statement: Database spans
messaging.system, messaging.operation: Queue spans
rpc.system, rpc.method: RPC spans
error (boolean), error.type, error.message: Error conditions
service.name, service.version: All spans (set on resource)

Use semantic conventions for attribute names rather than inventing custom ones. Consistent naming enables cross-service analysis.

Span Status

Unset — Completed without error (default). When: most successful operations
Error — Operation failed. When: server errors, exceptions
Ok — Explicitly marked successful. When: only when you need to override ambiguity

Leave status as Unset for normal success. Set Error only for actual failures. Do not set Error for client errors like 404 on a server span — the server operated correctly.

Sampling

At high traffic volumes, tracing 100% of requests is expensive. Sampling reduces cost while preserving signal.

Strategy	How It Works	Trade-off
Head-based	Decide at trace start whether to sample	Simple; may miss rare errors
Tail-based	Decide after trace completes based on content	Catches errors; needs buffering infrastructure
Always-on for errors	Sample 100% of error traces, probabilistic for success	Good default balance

Rules:

Never drop error traces. If cost is a concern, sample successful traces at a lower rate but keep 100% of error and high-latency traces.
Sample at the entry point (head) and propagate the decision. Do not let each service decide independently — this creates partial traces.
Start with a low sampling rate (1-10%) and increase based on need, not the reverse.

Connecting the Pillars

The three pillars become powerful when correlated. An alert fires on a metric → you find the offending trace → the trace points to a span → the span's logs reveal the root cause.

Correlation Keys

trace_id — Links logs and spans to the same trace. Where: logs, span context
span_id — Links a log to the exact span that produced it. Where: logs, span context
request_id — Correlates all work for one inbound request. Where: logs, HTTP headers
service.name + service.version — Groups telemetry by source. Where: all signals

Rules:

Embed trace_id and span_id in every log record emitted within a request context. This is the primary bridge between logs and traces.
Use a correlation/request ID that is assigned at the edge (API gateway, load balancer) and propagated to all downstream services.
Attach exemplars to metrics. An exemplar is a trace_id attached to a specific metric observation, enabling drill-down from a metric spike to a representative trace.

The Correlation Workflow

Metrics dashboard shows error rate spike
  → Filter by service + time window
  → Find exemplar trace_id on the error counter
  → Open trace in tracing UI
  → Identify the failing span (database timeout)
  → Search logs by trace_id for full error details
  → Root cause: connection pool exhausted

Metric-to-Trace Exemplars

Exemplars attach a trace_id sample to a metric data point. When you see a latency spike on a histogram, the exemplar gives you a concrete trace to investigate rather than guessing.

Attach exemplars to histogram observations for latency metrics.
Attach exemplars to counter increments for error metrics.
Not every metric point needs an exemplar — one per scrape interval is sufficient.

Trace-to-Log Linking

When viewing a trace, each span should link to its logs. When viewing a log, the trace_id should link back to the full trace. This bidirectional linking is the backbone of incident investigation.

Anti-Patterns

Logging everything at DEBUG in production — Disk/cost explosion, noise buries signal. Fix: default to INFO; enable DEBUG temporarily per-component
catch (err) { log(err); throw err; } at every layer — Same error logged N times across the call stack. Fix: log once at the top-level handler
Metrics with unbounded label cardinality — Time series explosion; monitoring system degrades. Fix: use bounded labels; move high-cardinality data to logs
Encoding dimensions in metric names — Cannot aggregate; proliferates metrics. Fix: use labels: requests_total{method="GET"}
Averaging latency for alerting — Hides tail latency; misses degradation for minority of users. Fix: alert on p99 from histograms
Missing trace context propagation — Broken traces; spans from different services are disconnected. Fix: propagate context on every cross-process call
Sampling each service independently — Partial traces — some spans sampled, some dropped. Fix: decide at head, propagate sampling decision
Logging PII / secrets — Compliance violations, security risk. Fix: audit log fields; log opaque IDs, never raw PII
Alert on every metric wiggle — Alert fatigue; team ignores pages. Fix: alert on symptoms (golden signals), not causes; require actionability
Treating WARN as a soft ERROR — WARN becomes noise nobody reads. Fix: WARN = system compensated but situation is unusual; ERROR = broken
Storing pre-computed rates instead of counters — Cannot re-aggregate over different windows. Fix: store raw counters; derive rates at query time
No baseline metrics for new services — Cannot tell if behavior is normal or degraded. Fix: instrument golden signals from day one, before first deploy

Application

When Writing Code

Instrument from the start. Add golden signal metrics, structured logging, and trace context propagation before the first production deploy — not after the first incident.
Follow the conventions silently. Apply structured logging, metric naming, and tracing patterns without narrating each rule.
If the codebase has existing patterns, follow them. Consistency within a codebase beats theoretical correctness. Flag divergences from this skill's guidance once, then move on.
Choose the right pillar. Before adding instrumentation, ask: "Is this a metric, a log, or a span?" Use the decision table above.
Connect the signals. Every log in a request context must carry trace_id and span_id. Every error metric should have an exemplar.

When Reviewing Code

Check that new endpoints/operations have golden signal coverage. Missing metrics on a new endpoint is a review blocker.
Verify structured logging. Unstructured log.Print("something happened") in production code should be flagged with the fix inline.
Check log levels. Expected client errors logged as ERROR, or debug noise left on at INFO, are common mistakes.
Verify trace context propagation. Any new outgoing HTTP/gRPC/queue call must propagate trace context. Missing propagation breaks traces.
Check label cardinality. New metric labels must be bounded. Flag unbounded labels (user IDs, free-text) immediately.
No sensitive data in logs or span attributes. Passwords, tokens, PII in telemetry is a security and compliance defect.

Bad review comment:
  "According to observability best practices, you should consider
   adding structured logging with appropriate fields..."

Good review comment:
  "Missing trace_id in log context — requests through this handler
   won't correlate to traces. Add ctx.TraceID() to the logger fields."

Integration

This skill provides observability discipline alongside other skills:

Coding skill — Discovery, planning, verification workflow
Observability (this skill) — What to log, measure, and trace
Tool-specific skills (Prometheus, StatsD, OTel) — How to implement with a specific technology

The coding skill governs workflow. This skill governs observability design decisions. Tool-specific skills govern implementation details for their respective technologies.

Observability is not an afterthought. Instrument from day one. If you cannot observe it, you cannot operate it.

Similar Skills

design-system

Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.

team-skills-platform

163.7k

ui-demo

team-skills-platform

163.7k

kotlin-patterns

team-skills-platform

163.7k

Stats

Parent Repo Stars4

Parent Repo Forks0

Last CommitMar 14, 2026

Actions

View Source View Plugin View on GitHub View README

Observability

If you cannot ask arbitrary questions about your system's behavior from the outside, your system is not observable — it is merely monitored.

The Three Pillars

Logs — What Happened

Logs are timestamped, discrete event records. They capture what happened at a specific moment: an error thrown, a user action, a configuration loaded, a connection refused.

Use logs when you need:

Rich diagnostic context for a specific event
Debugging information with full error details and stack traces
Audit trails of who did what and when
Record of discrete state transitions

Logs are poor at:

Showing aggregate system health (use metrics)
Tracing request flow across services (use traces)
High-frequency numeric trends (too expensive at volume)

Metrics — How Is It Doing

Use metrics when you need:

Real-time health signals and alerting
Trend analysis over hours, days, weeks
Capacity planning and saturation monitoring
Pre-aggregated data that scales cheaply regardless of traffic

Metrics are poor at:

Explaining why something is broken (use logs)
Showing the path of a single request (use traces)
Storing per-event detail (cardinality explosion)

Traces — How Did It Flow

Use traces when you need:

End-to-end latency breakdown across services
Dependency mapping and bottleneck identification
Understanding the path a failing request took
Correlating work across process and network boundaries

Traces are poor at:

Aggregate health monitoring (use metrics)
Detailed per-event diagnostics on a single node (use logs)
Cheap, long-term trend storage (traces are expensive at 100% sampling)

Choosing the Right Signal

"Is the system healthy right now?" — Metrics
"Why did this specific request fail?" — Traces + Logs
"What happened at 03:14 on node-7?" — Logs
"Where is the bottleneck in checkout flow?" — Traces
"Are error rates increasing over the last hour?" — Metrics
"What was the full stack trace of that exception?" — Logs
"Which downstream service is slow?" — Traces
"How much headroom does the database have?" — Metrics

Structured Logging

Always Structured

Log Levels

Use levels consistently. Every team member must agree on what each level means.

FATAL/CRITICAL — Process cannot continue; about to crash. Alerting: Page immediately
ERROR — Operation failed; requires investigation. Alerting: Alert / ticket
WARN — Unexpected condition; system compensated. Alerting: Monitor trend
INFO — Significant business or lifecycle event. Alerting: Dashboard
DEBUG — Diagnostic detail for developers. Alerting: Never in production by default
TRACE — Extremely verbose step-by-step flow. Alerting: Never in production

Rules:

Production defaults to INFO or above. DEBUG/TRACE are off unless explicitly enabled for a bounded investigation window.
WARN is not a dumping ground. If it never leads to action, it is noise — downgrade to DEBUG or remove it.
ERROR means something is broken. Expected conditions (404 for missing resources, validation failures from bad input) are not errors — they are INFO with a status field.
Make log level configurable at runtime without restarts.

Structured Fields

Every log record should include these baseline fields:

timestamp: ISO 8601, UTC
level: Severity (ERROR, WARN, INFO, ...)
message: Human-readable summary of the event
service: Service name emitting the log
version: Service version / build / commit SHA
trace_id: Distributed trace ID (if in request context)
span_id: Current span ID (if in request context)

Add contextual fields relevant to the event:

user_id: User-initiated actions
request_id: Per-request correlation
duration_ms: Timed operations
error.type: Error class/name
error.message: Error description
error.stack: Stack trace (ERROR level only)
http.method, http.path, http.status: HTTP request/response
db.operation, db.duration_ms: Database calls

Sensitive Data

Never log:

Passwords, tokens, API keys, secrets
Full credit card numbers, SSNs, or equivalent PII
Session tokens or authentication cookies
Request/response bodies containing user-submitted personal data

When user identifiers are needed, log opaque IDs (user_id), not email addresses or names. If regulations (GDPR, HIPAA) apply, verify that logged fields comply. When in doubt, omit the field.

Logging at Boundaries

At application startup:

INFO: service name, version, loaded configuration (without secrets), listen address
WARN: degraded mode (e.g., fallback to local cache because Redis is unreachable)
ERROR/FATAL: unrecoverable startup failures

Per incoming request:

INFO: method, path (scrubbed of PII), status code, duration, request dimensions (tenant, region)
WARN/ERROR: only for unexpected exceptions; catch at the top-level handler

Per outgoing dependency call:

INFO or DEBUG: target service, operation, status, duration
ERROR: failures in dependent services (Redis, database, queue, etc.)

Log Once, at the Right Level

Metrics

Metric Types

Counter — Monotonically increasing; resets on restart. Use for totals: requests, errors, bytes sent
Gauge — Arbitrary value; goes up and down. Use for snapshots: queue depth, memory usage, connections
Histogram — Client-side aggregation into buckets. Use for distributions: request latency, payload size
Summary — Client-side quantile calculation. Use for pre-computed percentiles (less flexible than histogram)

Rules:

Use counters for events that accumulate. Derive rates from counters (rate(), increase()), never store pre-computed rates.
Use gauges for current-state snapshots. Never rate() a gauge.
Use histograms for latency and size distributions. Histograms enable percentile calculation across instances; summaries do not aggregate.
Export timestamps as Unix epoch seconds, not "time since" values.
Initialize all metrics with a zero value at startup to avoid missing-metric problems.

What to Measure

The Four Golden Signals (Google SRE)

For every user-facing service, measure these four:

Latency — Time to serve a request. Example: http_request_duration_seconds histogram
Traffic — Demand on the system. Example: http_requests_total counter by method/path
Errors — Rate of failed requests. Example: http_requests_total{status=~"5.."}
Saturation — How "full" the service is. Example: CPU usage, memory, queue depth, thread pool

Distinguish successful latency from error latency. A fast 500 is not good latency. A slow error is worse than a fast error. Track both.

RED Method (Request-Centric)

For every microservice:

Rate — requests per second
Errors — failed requests per second
Duration — distribution of request latency

RED is a focused subset of the golden signals, optimized for request-driven services.

USE Method (Resource-Centric)

For every resource (CPU, memory, disk, network, thread pool):

Utilization — percentage of capacity in use
Saturation — backlog / queue depth
Errors — resource-level error count

RED tells you what is degraded from the user's perspective. USE tells you why at the infrastructure level. Use both together.

Service-Type Instrumentation

Online-serving (HTTP, gRPC) — Request rate, error rate, latency (p50/p90/p99), in-flight requests
Offline-processing (workers, pipelines) — Items in/out per stage, processing duration, last-processed timestamp, queue depth
Batch jobs — Last successful completion time, job duration, records processed, exit status
Caches — Hit rate, miss rate, eviction count, latency to backend on miss
Thread/connection pools — Pool size, active count, queue length, wait time

Metric Naming

Metric names should be self-documenting. Follow these conventions:

Prefix with namespace. myapp_http_requests_total, not requests_total.
Use base units. Seconds (not milliseconds), bytes (not megabytes), ratio 0-1 (not percentage 0-100).
Suffix with unit. _seconds, _bytes, _total (for unit-less counters).
One metric, one unit, one quantity. Never mix request size with request duration in the same metric.
snake_case. http_request_duration_seconds, not httpRequestDurationSeconds.

Good	Bad
`http_request_duration_seconds`	`request_latency` (no unit, ambiguous)
`http_requests_total`	`http_responses_500_total` (use labels)
`node_memory_usage_bytes`	`memory_mb` (not base unit)
`process_cpu_seconds_total`	`cpu_percent` (use ratio 0-1)

Labels and Cardinality

Labels add dimensions to a metric. Every unique combination of label values creates a separate time series.

Good labels (bounded, low cardinality):

method (GET, POST, PUT, DELETE)
status_code (200, 404, 500 — or class: 2xx, 4xx, 5xx)
service, region, version

Dangerous labels (unbounded, high cardinality):

user_id, email, session_id
request_path with dynamic segments (/users/12345)
error_message (arbitrary strings)

Rules:

Keep label cardinality below 10 values per label for most metrics.
If a label can grow unbounded, it does not belong on a metric. Log it instead.
Use labels instead of encoding dimensions in the metric name. http_requests_total{method="GET"}, not http_get_requests_total.
Ensure sum() or avg() across all label values is meaningful. If not, split into separate metrics.

Percentiles and Tail Latency

Averages hide outliers. A service with 100ms average latency may have 1% of requests taking 5 seconds. That 1% tail can dominate user experience when users hit multiple services per page load.

Always track p50, p90, p99 latency at minimum.
Use histograms with exponentially distributed bucket boundaries (e.g., 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s).
Alert on p99, not mean. Mean latency alerts miss tail degradation.

Distributed Tracing

Core Concepts

Trace — End-to-end record of a single request across all services
Span — One unit of work within a trace (HTTP call, DB query, function)
Root span — First span in a trace; has no parent
Child span — Span nested under a parent; represents a sub-operation
Span context — Immutable bag of trace_id + span_id + flags, propagated across boundaries
Span attributes — Key-value metadata on a span (http.method, db.statement)
Span events — Timestamped annotations within a span's lifetime
Span links — Causal references between spans in different traces

Span Kinds

Client — Outgoing synchronous call. Example: HTTP request to another service
Server — Incoming synchronous call. Example: Handling an HTTP request
Producer — Creates async work. Example: Publishing to a message queue
Consumer — Processes async work. Example: Consuming from a message queue
Internal — No network boundary. Example: In-process function instrumentation

Context Propagation

Context propagation is the mechanism that connects spans across process boundaries into a single trace. Without it, you get disconnected spans, not traces.

Rules:

Propagate context on every outgoing call. HTTP headers (W3C Trace Context or B3), message metadata, gRPC metadata — every cross-process boundary must carry trace context.
Extract context on every incoming call. The receiving service must extract trace context and create a child span under the propagated parent.
Use W3C Trace Context (traceparent/tracestate) as the default propagation format unless the ecosystem requires otherwise (e.g., legacy B3).
Never generate a new trace ID when you should be continuing an existing trace. A new trace ID means a broken trace.

What to Trace

Instrument at meaningful boundaries:

Incoming HTTP/gRPC requests — Always — auto-instrument
Outgoing HTTP/gRPC calls — Always — auto-instrument
Database queries — Always — auto-instrument or manual
Cache operations — Yes — hit/miss as attribute
Queue publish/consume — Yes — link producer and consumer spans
Significant business operations — Yes — manual spans for key logic
Tight loops / trivial functions — No — noise, performance cost

Span Attributes

Attach attributes that enable filtering and analysis:

http.method, http.route, http.status_code: HTTP spans
db.system, db.operation, db.statement: Database spans
messaging.system, messaging.operation: Queue spans
rpc.system, rpc.method: RPC spans
error (boolean), error.type, error.message: Error conditions
service.name, service.version: All spans (set on resource)

Use semantic conventions for attribute names rather than inventing custom ones. Consistent naming enables cross-service analysis.

Span Status

Unset — Completed without error (default). When: most successful operations
Error — Operation failed. When: server errors, exceptions
Ok — Explicitly marked successful. When: only when you need to override ambiguity

Leave status as Unset for normal success. Set Error only for actual failures. Do not set Error for client errors like 404 on a server span — the server operated correctly.

Sampling

At high traffic volumes, tracing 100% of requests is expensive. Sampling reduces cost while preserving signal.

Strategy	How It Works	Trade-off
Head-based	Decide at trace start whether to sample	Simple; may miss rare errors
Tail-based	Decide after trace completes based on content	Catches errors; needs buffering infrastructure
Always-on for errors	Sample 100% of error traces, probabilistic for success	Good default balance

Rules:

Never drop error traces. If cost is a concern, sample successful traces at a lower rate but keep 100% of error and high-latency traces.
Sample at the entry point (head) and propagate the decision. Do not let each service decide independently — this creates partial traces.
Start with a low sampling rate (1-10%) and increase based on need, not the reverse.

Connecting the Pillars

The three pillars become powerful when correlated. An alert fires on a metric → you find the offending trace → the trace points to a span → the span's logs reveal the root cause.

Correlation Keys

trace_id — Links logs and spans to the same trace. Where: logs, span context
span_id — Links a log to the exact span that produced it. Where: logs, span context
request_id — Correlates all work for one inbound request. Where: logs, HTTP headers
service.name + service.version — Groups telemetry by source. Where: all signals

Rules:

Embed trace_id and span_id in every log record emitted within a request context. This is the primary bridge between logs and traces.
Use a correlation/request ID that is assigned at the edge (API gateway, load balancer) and propagated to all downstream services.
Attach exemplars to metrics. An exemplar is a trace_id attached to a specific metric observation, enabling drill-down from a metric spike to a representative trace.

The Correlation Workflow

Metrics dashboard shows error rate spike
  → Filter by service + time window
  → Find exemplar trace_id on the error counter
  → Open trace in tracing UI
  → Identify the failing span (database timeout)
  → Search logs by trace_id for full error details
  → Root cause: connection pool exhausted

Metric-to-Trace Exemplars

Exemplars attach a trace_id sample to a metric data point. When you see a latency spike on a histogram, the exemplar gives you a concrete trace to investigate rather than guessing.

Attach exemplars to histogram observations for latency metrics.
Attach exemplars to counter increments for error metrics.
Not every metric point needs an exemplar — one per scrape interval is sufficient.

Trace-to-Log Linking

When viewing a trace, each span should link to its logs. When viewing a log, the trace_id should link back to the full trace. This bidirectional linking is the backbone of incident investigation.

Anti-Patterns

Logging everything at DEBUG in production — Disk/cost explosion, noise buries signal. Fix: default to INFO; enable DEBUG temporarily per-component
catch (err) { log(err); throw err; } at every layer — Same error logged N times across the call stack. Fix: log once at the top-level handler
Metrics with unbounded label cardinality — Time series explosion; monitoring system degrades. Fix: use bounded labels; move high-cardinality data to logs
Encoding dimensions in metric names — Cannot aggregate; proliferates metrics. Fix: use labels: requests_total{method="GET"}
Averaging latency for alerting — Hides tail latency; misses degradation for minority of users. Fix: alert on p99 from histograms
Missing trace context propagation — Broken traces; spans from different services are disconnected. Fix: propagate context on every cross-process call
Sampling each service independently — Partial traces — some spans sampled, some dropped. Fix: decide at head, propagate sampling decision
Logging PII / secrets — Compliance violations, security risk. Fix: audit log fields; log opaque IDs, never raw PII
Alert on every metric wiggle — Alert fatigue; team ignores pages. Fix: alert on symptoms (golden signals), not causes; require actionability
Treating WARN as a soft ERROR — WARN becomes noise nobody reads. Fix: WARN = system compensated but situation is unusual; ERROR = broken
Storing pre-computed rates instead of counters — Cannot re-aggregate over different windows. Fix: store raw counters; derive rates at query time
No baseline metrics for new services — Cannot tell if behavior is normal or degraded. Fix: instrument golden signals from day one, before first deploy

Application

When Writing Code

Instrument from the start. Add golden signal metrics, structured logging, and trace context propagation before the first production deploy — not after the first incident.
Follow the conventions silently. Apply structured logging, metric naming, and tracing patterns without narrating each rule.
If the codebase has existing patterns, follow them. Consistency within a codebase beats theoretical correctness. Flag divergences from this skill's guidance once, then move on.
Choose the right pillar. Before adding instrumentation, ask: "Is this a metric, a log, or a span?" Use the decision table above.
Connect the signals. Every log in a request context must carry trace_id and span_id. Every error metric should have an exemplar.

When Reviewing Code

Check that new endpoints/operations have golden signal coverage. Missing metrics on a new endpoint is a review blocker.
Verify structured logging. Unstructured log.Print("something happened") in production code should be flagged with the fix inline.
Check log levels. Expected client errors logged as ERROR, or debug noise left on at INFO, are common mistakes.
Verify trace context propagation. Any new outgoing HTTP/gRPC/queue call must propagate trace context. Missing propagation breaks traces.
Check label cardinality. New metric labels must be bounded. Flag unbounded labels (user IDs, free-text) immediately.
No sensitive data in logs or span attributes. Passwords, tokens, PII in telemetry is a security and compliance defect.

Bad review comment:
  "According to observability best practices, you should consider
   adding structured logging with appropriate fields..."

Good review comment:
  "Missing trace_id in log context — requests through this handler
   won't correlate to traces. Add ctx.TraceID() to the logger fields."

Integration

This skill provides observability discipline alongside other skills:

Coding skill — Discovery, planning, verification workflow
Observability (this skill) — What to log, measure, and trace
Tool-specific skills (Prometheus, StatsD, OTel) — How to implement with a specific technology

The coding skill governs workflow. This skill governs observability design decisions. Tool-specific skills govern implementation details for their respective technologies.

Observability is not an afterthought. Instrument from day one. If you cannot observe it, you cannot operate it.