From 97
Provides rigid checks for structured logs, request IDs, tracing, metrics in production request handlers, RPCs, background jobs to ensure diagnosability.
npx claudepluginhub oribarilan/97 --plugin 97This skill uses the workspace's default tool permissions.
Code that runs fine in dev and goes inert in production is the dominant operational failure mode for modern services. **When you add code that will run for users, you also add the diagnosability of that code: structured logs, trace context across process boundaries, metrics with bounded cardinality, signals an operator can read without your help.**
Audits existing observability instrumentation and designs structured logging, metrics, distributed tracing, and alerting for production services. Use for coverage gaps, SLIs/SLOs.
Instruments Python apps with structured logging via structlog, Prometheus metrics, distributed tracing, and correlation IDs for production observability and debugging.
Guides TDD in Laravel with PHPUnit/Pest: unit/feature/integration tests, factories, RefreshDatabase, fakes, and 80%+ coverage for features, bugs, refactors, Eloquent models.
Share bugs, ideas, or general feedback.
Code that runs fine in dev and goes inert in production is the dominant operational failure mode for modern services. When you add code that will run for users, you also add the diagnosability of that code: structured logs, trace context across process boundaries, metrics with bounded cardinality, signals an operator can read without your help.
This is a rigid skill. Jump to the sub-section that matches what you're writing and run that sub-section's checks.
These checks matter most when adding a request handler, RPC, or background job that will run in production with users depending on diagnosability. In MVPs, prototypes, internal dev tools, and one-off scripts, structured-logging, tracing, and SLO discipline are premature — prefer the simplest thing that works.
Invoke when you're about to:
log.info / log.warn / log.error calls in code that will run under loadIf the change adds an observability call to production code even slightly, invoke anyway — the cardinality and trace-context bugs are not.
timestamp, level, event (a short stable name like user_login_failed), plus the relevant context fields (request_id, user_id when not sensitive, route, duration_ms, status). Example: logger.info(f"user {user.id} logged in via {provider} at {ts}") is unsearchable; logger.info("user_login", user_id=user.id, provider=provider) is queryable. (OTel/StructuredLogs.)security-and-trust-boundaries); whether log files belong on disk or stdout (build-deploy-and-tooling 12F/XI). This skill decides what fields go on the line and how they are shaped.requests.get, manual queue producer). Example: a handler that reads from one service and writes to another with no propagation — the trace breaks at the boundary and the operator cannot see the cross-service path. (OTel/TraceContext.)failed_logins_total{user_id="...", reason="..."} produces a new time series per user — millions of series for a system with millions of users, and the metrics backend falls over. Per-user, per-request, per-trace-id data belongs in logs and traces, not metric labels. Metric labels are for low-cardinality, bounded sets: HTTP method, route template, status class, region, downstream name. (OE/CardinalityDiscipline.)SRE/GoldenSignals.)These thoughts mean STOP — apply the domain check before committing:
| Thought | Reality |
|---|---|
| "I'll log a single human-readable string — it's easier to grep." | Free-form strings are unsearchable in production aggregators. Log structured key-value with stable event names; the operator queries by field, not by substring. (OTel/StructuredLogs) |
| "I'll add the user id as a metric label so we can see per-user failures." | Per-user labels create a time series per user. Use a metric for the count; put the user id in logs and traces where high cardinality is fine. (OE/CardinalityDiscipline) |
| "I'll add the full URL path as a label." | Same problem — /users/12345 and /users/12346 are different series. Use the route template (/users/:id), not the realized path. (OE/CardinalityDiscipline) |
| "I'll instrument every helper function with a span." | Spans cover meaningful units of work; one per private helper buries the trace in noise. Span per request / transaction / job, not per function. (OTel/TraceContext) |
"The downstream call uses raw requests.get — no need to thread the trace headers." | The trace breaks at the boundary; the operator cannot see the cross-service path. Propagate W3C Trace Context, even when bypassing the tracer SDK. (OTel/TraceContext) |
| "We don't measure latency on this background job — it'll be fine." | Without latency / traffic / errors / saturation visibility, the only way to know it broke is a user complaint. Wire at least the four signals for production service code. (SRE/GoldenSignals) |
| "The request id is in the trace — we don't need it in the log." | Logs without the request id force the operator to traverse the trace just to correlate one error line. Put the request id on every log line for the request. (OTel/StructuredLogs) |
For every observability surface your change touches, all of the following are true:
event name, and includes the request id.security-and-trust-boundaries).If any box that applies to your change is unchecked, you are not done. Either finish, or revert and re-plan.
| ID | Principle | Source |
|---|---|---|
OTel/StructuredLogs | Structured key/value logs with stable event names | OpenTelemetry semantic conventions; SRE book |
OTel/TraceContext | W3C Trace Context propagated across every cross-process call | OpenTelemetry semantic conventions; Observability Engineering |
SRE/GoldenSignals | The four signals for service code: latency, traffic, errors, saturation | Site Reliability Engineering, ch. 6 |
OE/CardinalityDiscipline | High-cardinality data belongs in logs and traces, not metric labels | Observability Engineering (Majors et al.) |
See principles.md for the long-form distillations and source citations.