Help us improve
Share bugs, ideas, or general feedback.
From fabrik
Guides applying observability best practices: wide structured events, OpenTelemetry instrumentation, distributed tracing, SLO-based alerting, and production debugging. Use when adding logging, metrics, traces, or defining SLOs.
npx claudepluginhub maragudk/fabrik --plugin fabrikHow this skill is triggered — by the user, by Claude, or both
Slash command
/fabrik:observabilityThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Observability is the ability to understand any state your system can get into -- including novel, never-before-seen failures -- by asking arbitrary questions of your telemetry, without shipping new code to investigate. The litmus test: can you debug a brand-new problem you never predicted, iteratively, in seconds?
Instruments code with logging, metrics, traces, and alerting so production behavior is visible and diagnosable. Use when shipping features, diagnosing slow incidents, or reviewing PRs with I/O or cross-service calls.
Design observability (metrics, logs, traces) for understanding system behavior in production. Use when debugging distributed systems or building monitoring.
Audits and designs observability instrumentation: structured logging, metrics, tracing, and alerting. Use when reviewing coverage gaps or defining SLIs/SLOs.
Share bugs, ideas, or general feedback.
Observability is the ability to understand any state your system can get into -- including novel, never-before-seen failures -- by asking arbitrary questions of your telemetry, without shipping new code to investigate. The litmus test: can you debug a brand-new problem you never predicted, iteratively, in seconds?
This matters because modern systems (microservices, managed dependencies, ephemeral infrastructure, many network hops per request) fail in genuinely novel ways. The hard question shifted from why is this code wrong to where in the system is the problem. Traditional monitoring only catches failures you predicted and set thresholds for ("known-unknowns"); observability is what lets you debug the ones nobody anticipated ("unknown-unknowns").
This skill is about the engineering side: how to instrument code so it's debuggable in production, and how to alert on what actually matters. The cultural and organizational side (rolling it out across a team, build-vs-buy) lives in references/adoption.md.
Use it whenever you are:
These are the ideas to apply by default. The reference files give the concrete code patterns and deeper rationale.
The fundamental unit of observability is the arbitrarily wide structured event: one record per unit of work (typically one request), carrying many key-value pairs of context.
The pattern: when a request enters, initialize an empty context map; throughout its life, append anything interesting; when it exits or errors, emit the whole thing as one rich event. Mature instrumentation routinely carries 300+ fields per event. There is no practical limit -- the wider the event, the more questions you can answer later.
Capture data from all three phases:
Think like a debugger that records every variable's value and every (possibly remote) function call's timing -- then ships that snapshot somewhere queryable.
These two properties are what make events useful for finding unknown-unknowns:
user_id, request_id, trace_id, shopping_cart_id, build_id, hostname. These are the most powerful debugging dimensions because a unique ID is how you find one needle in the haystack. Rule of thumb: you can always bucket high cardinality down to low later, but you can never recover cardinality you didn't capture -- so capture it.This is also why pre-aggregated metrics fall short for debugging: a metric is one number over a time window with a few low-cardinality tags. It discards per-request context, can't slice by user ID, and forces you to decide what to measure before the bug happens. Use events as your primary signal; reach for standalone metrics only when you genuinely need exact, unsampled, process-wide counts.
Use OpenTelemetry (OTel) as the default instrumentation layer. Instrument once against a vendor-neutral API, then send telemetry to any backend. Proprietary agents create lock-in, and re-instrumenting is the most labor-intensive part of switching tools, so portability is worth a lot.
See references/instrumentation.md for OTel concepts, span structure, context propagation, and Go code patterns.
A trace follows one request across process and network boundaries. Each unit of work is a span; spans nest into parent-child relationships and render as a waterfall. Distributed tracing is what makes cascading problems (a slow downstream DB showing up as latency across many upstream services) diagnosable.
The mechanics -- the five required span fields, context propagation via headers, and what custom fields to add -- are in references/instrumentation.md.
Threshold alerts ("CPU > 80%") fire on potential causes and produce so many false positives that teams learn to ignore them -- this is normalization of deviance, and it's how real incidents get missed.
Instead, alert on symptoms of degraded user experience using SLOs backed by event-based SLIs and an error budget. An alert earns its place only if it's both a reliable indicator of user pain and actionable. Auto-remediated events (autoscaling, failover) should not page.
SLO alerts deliberately decouple what (users are hurting) from why (the cause) -- they tell you to investigate, and observability is what lets you find the cause of even an unknown failure. Rule of thumb: collect data for everything, but alert only on user-impacting symptoms.
See references/slos.md for defining SLIs, error budgets, and predictive burn alerts.
When investigating, resist pattern-matching against past incidents (that knowledge doesn't transfer and locks debugging to the most senior person). Instead use the core analysis loop: start from the symptom, verify it's real, find which dimensions distinguish the affected events from the baseline, filter to isolate, and repeat. With wide events this is a methodical, teachable process that works even on failures you've never seen -- the best debugger becomes the most curious engineer, not the longest-tenured.
See references/adoption.md for the core analysis loop in detail.
At high volume, keeping every event costs more than it's worth, and most events are near-identical successes. Sampling cuts cost while -- unlike aggregation -- preserving full cardinality on the events you keep.
The two things that bite people:
See references/instrumentation.md for sampling strategies and code.
trace_id.references/instrumentation.md -- OpenTelemetry concepts, structured event construction, span fields and tracing mechanics, context propagation, and sampling, with Go code patterns.references/slos.md -- SLOs, SLIs (prefer event-based), error budgets, and predictive burn alerts (lookahead/baseline windows).references/adoption.md -- the core analysis loop, the monitoring-vs-observability boundary (system vs software), build-vs-buy, and rolling observability out across a team.