From skills-by-amrit
Use when auditing logging, monitoring, alerting, tracing, or metrics. Covers structured logging, error tracking, health checks, dashboards, distributed tracing, and incident detection capabilities.
npx claudepluginhub boparaiamrit/skills-by-amritThis skill uses the workspace's default tool permissions.
If you can't observe it, you can't debug it. If you can't debug it, you can't fix it. Observability is the difference between "something is wrong" and "here's exactly what's wrong, where, since when, and for whom."
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Migrates code, prompts, and API calls from Claude Sonnet 4.0/4.5 or Opus 4.1 to Opus 4.5, updating model strings on Anthropic, AWS, GCP, Azure platforms.
Analyzes BMad project state from catalog CSV, configs, artifacts, and query to recommend next skills or answer questions. Useful for help requests, 'what next', or starting BMad.
If you can't observe it, you can't debug it. If you can't debug it, you can't fix it. Observability is the difference between "something is wrong" and "here's exactly what's wrong, where, since when, and for whom."
Core principle: Every production system must answer four questions at any point in time: What happened? When? Why? To whom?
NO PRODUCTION SERVICE WITHOUT STRUCTURED LOGGING, HEALTH CHECKS, AND ERROR TRACKING. NO DEPLOYMENT WITHOUT MONITORING. NO ALERT WITHOUT A RUNBOOK.
code-review instead)performance-audit)YOU CANNOT:
- Say "logging is fine" — read actual log output and verify it's structured, contextual, and actionable
- Say "we have monitoring" — show the dashboard, verify it shows real-time data, check alerting rules
- Say "errors are tracked" — verify error grouping, notification routing, and triage workflow
- Skip checking what happens when a dependency goes down — simulate failure mentally or actually
- Assume health checks work — call the health endpoint and verify the response checks REAL dependencies
- Trust that alerts work — verify the notification pipeline end-to-end (alert → channel → person)
- Say "we'll add observability later" — it's needed BEFORE production, not after the first incident
| Rationalization | Reality |
|---|---|
| "We'll add logging when something breaks" | You won't know it's broken without logging. Circular dependency. |
| "Console.log is fine for now" | Unstructured text logs are unsearchable, unfilterable, and fill up disk fast. |
| "We don't need distributed tracing yet" | You will need it the moment your first cross-service bug appears. Retrofitting is painful. |
| "Nobody looks at the dashboard" | Then your dashboard shows the wrong things. Fix the dashboard, don't eliminate monitoring. |
| "Alerts are too noisy so we ignore them" | Noisy alerts are worse than no alerts. Tune them — don't disable. |
| "Our error rate is low enough" | Low ≠ zero. One uncaught error in a payment flow is one too many. |
| "We know the system well enough" | You won't in 6 months, and neither will the person who replaces you. |
1. If this service crashed right now, how would we know? How quickly?
2. If response times doubled, which alert fires? In how many minutes?
3. Can I trace a single user request from entry to database and back?
4. If I search for a specific user's actions today, what do I find?
5. When was the last error? What was it? Is it resolved?
6. What's the difference between "service is down" and "service is degraded"? Can your health checks tell?
7. If a downstream dependency becomes slow (not down, just slow), does the system degrade gracefully?
8. Can an on-call engineer understand a 3AM alert without looking at code?
9. Are there any errors happening right now that nobody knows about?
10. What was the MTTR (mean time to resolution) for the last 3 incidents? Could better observability have reduced it?
Structured logging requirements:
{
"timestamp": "2024-01-15T10:30:00.123Z",
"level": "error",
"message": "Payment processing failed",
"service": "payment-service",
"trace_id": "abc-123-def-456",
"span_id": "ghi-789",
"user_id": "user_456",
"error": {
"type": "PaymentGatewayError",
"message": "Connection timeout after 30s",
"code": "GATEWAY_TIMEOUT",
"stack": "..."
},
"context": {
"order_id": "order_789",
"amount": 99.99,
"currency": "USD",
"gateway": "stripe",
"attempt": 2
},
"duration_ms": 30042
}
Logging checklist:
| Check | Status | Detection |
|---|---|---|
| Structured format (JSON, not text) | grep -rn "console.log|print(" --include="*.ts" --include="*.py" . | grep -v node_modules | |
| Consistent log levels (debug, info, warn, error) | grep -rn "logger\.|log\." --include="*.ts" --include="*.py" . | head -30 | |
| Request correlation IDs (trace_id) | grep -rn "trace_id|correlation_id|request_id" --include="*.ts" --include="*.py" . | |
| No PII in logs (passwords, tokens, SSN, emails) | grep -rn "password|secret|token|ssn" --include="*.ts" --include="*.py" . | grep -i log | |
| Error logs include stack traces | Verify error handler attaches stack | |
| Sufficient context for debugging | Read 5 random log entries — can you debug from them alone? | |
| Log rotation / retention configured | Check logging config for max size / days | |
| Log aggregation (centralized collection) | Check for logging service (CloudWatch, Datadog, ELK, Loki) | |
| Sensitive data redacted | Check for redaction middleware in logging pipeline | |
| Request/response logging for APIs | Check middleware for HTTP request logging |
Log levels guide (enforce consistency):
| Level | When | Example | Does NOT Include |
|---|---|---|---|
debug | Detailed diagnostic info for developers | "Query took 45ms, returned 12 rows" | In production (should be disabled) |
info | Normal operations worth recording | "User logged in", "Order created", "Deploy started" | Routine internal events (loop iterations) |
warn | Unexpected but handled situations | "Rate limit approaching 80%", "Retry attempt 2/3", "Deprecated API called" | Things that are actually errors |
error | Failures requiring attention | "Payment failed", "Database connection lost", "External API returned 500" | Things the code recovers from gracefully |
fatal | Application cannot continue | "Config file missing", "Database unreachable on startup", "Port already in use" | Recoverable errors |
Common logging anti-patterns:
| Anti-Pattern | Example | Fix |
|---|---|---|
| Log and swallow | catch (e) { log(e); } — then nothing | Log AND handle appropriately |
| Generic messages | logger.error("Error occurred") | Include WHAT error, WHERE, in WHICH context |
| Missing context | logger.error(error.message) | Add user_id, request_id, input data |
| Sensitive data | logger.info("User logged in", { password }) | Redact sensitive fields |
| Wrong level | logger.error("User not found") — 404 is not error | Use warn for expected-but-unusual |
| Excessive logging | Logging every loop iteration | Log aggregates or milestones |
1. DOES the service have a health endpoint? (/health, /healthz, /ready)
2. DOES it check actual dependencies? (DB connection, cache, external APIs)
3. IS it used by load balancers / orchestrators for routing decisions?
4. DOES it distinguish liveness vs readiness?
5. DOES it include version information?
6. IS the health check itself fast (< 500ms)?
7. DOES it avoid false positives? (returns healthy when service is actually degraded)
Liveness vs Readiness:
| Check | Purpose | What it Tests | Failure Action |
|---|---|---|---|
Liveness (/healthz) | "Is the process running?" | Process alive, not deadlocked | Restart container |
Readiness (/ready) | "Can it serve traffic?" | Dependencies reachable, migrations done | Remove from load balancer |
Startup (/startup) | "Has it finished initializing?" | Warm-up complete, caches loaded | Wait longer before killing |
Health check response (gold standard):
{
"status": "healthy",
"checks": {
"database": { "status": "healthy", "latency_ms": 5, "connection_pool": "8/20" },
"cache": { "status": "healthy", "latency_ms": 1, "hit_rate": "94%" },
"external_api": { "status": "degraded", "latency_ms": 2500, "note": "Slow but responding" },
"disk": { "status": "healthy", "free_gb": 45.2, "usage_percent": 62 }
},
"version": "1.2.3",
"commit": "abc1234",
"uptime_seconds": 86400,
"timestamp": "2024-01-15T10:30:00Z"
}
1. ARE errors captured and aggregated? (Sentry, Bugsnag, Datadog, Rollbar)
2. ARE errors grouped intelligently? (not 1000 identical alerts)
3. DO errors include user context, request context, and breadcrumbs?
4. ARE error notifications routed correctly? (critical → PagerDuty, low → Slack)
5. IS there error rate monitoring with thresholds?
6. ARE errors triaged and assigned? (not just collected)
7. IS there a distinction between expected errors (404, validation) and unexpected errors (500, null reference)?
8. ARE source maps uploaded for frontend error tracking?
Error tracking maturity model:
| Level | Description | Assessment |
|---|---|---|
| 0 — None | Errors disappear into void | 🔴 Critical |
| 1 — Basic | Errors logged to file, reviewed manually | 🟠 High risk |
| 2 — Collected | Errors sent to tracking service (Sentry) | 🟡 Acceptable |
| 3 — Managed | Errors grouped, prioritized, assigned | 🟢 Good |
| 4 — Proactive | Error budgets, anomaly detection, auto-remediation | 🟢 Excellent |
1. ARE key business metrics tracked? (signups, orders, revenue, churn)
2. ARE key technical metrics tracked? (response time, error rate, throughput, saturation)
3. IS there a dashboard? (Grafana, Datadog, CloudWatch)
4. ARE alerts configured for anomalies?
5. ARE metrics labeled with dimensions? (by endpoint, by user tier, by region)
6. ARE custom metrics defined for domain-specific concerns?
7. CAN you compare current metrics to historical baselines?
Essential metrics frameworks:
RED Method (for request-driven services):
| Metric | What It Measures | Alert Threshold Example |
|---|---|---|
| Rate | Requests per second | < 50% of baseline for 5 min |
| Errors | Error rate (% or count) | > 1% error rate for 2 min |
| Duration | Response time (p50, p95, p99) | p99 > 2s for 5 min |
USE Method (for infrastructure/resources):
| Metric | What It Measures | Alert Threshold Example |
|---|---|---|
| Utilization | CPU, memory, disk, connections | > 80% for 10 min |
| Saturation | Queue depth, thread pool | > 90% capacity |
| Errors | Hardware errors, connection errors | Any non-zero |
The Four Golden Signals (Google SRE):
| Signal | Measures |
|---|---|
| Latency | How long requests take (distinguish success vs error latency) |
| Traffic | How much demand is placed on the system |
| Errors | Rate of failed requests |
| Saturation | How "full" the system is (most constrained resource) |
1. ARE requests traceable across services? (OpenTelemetry, Jaeger, Zipkin)
2. ARE trace IDs propagated through the ENTIRE call chain? (HTTP headers, message queues, async jobs)
3. CAN you reconstruct a full request path from a single trace ID?
4. ARE slow operations visible as spans with duration?
5. ARE database queries captured as spans?
6. ARE external API calls captured as spans?
7. IS sampling configured appropriately? (100% for errors, 1-10% for normal traffic)
Trace propagation checklist:
| Boundary | Propagated? | Method |
|---|---|---|
| HTTP → HTTP | W3C Trace Context header or B3 header | |
| HTTP → Queue | Message metadata | |
| Queue → Worker | Read from message metadata | |
| Sync → Async | Explicit context passing | |
| Service → Database | Span wrapping DB client | |
| Service → External API | Injected in outgoing headers |
1. ARE alerts defined for critical metrics? (error rate, latency, availability)
2. DO alerts have clear, actionable titles? (not "Alert #472")
3. DO alerts include runbooks or links? (what to do when this fires)
4. IS there alert severity tiering? (page vs warn vs info)
5. IS there alert fatigue? (> 50 alerts/week per team = too many)
6. DO alerts have proper thresholds? (not too sensitive, not too lax)
7. IS there an on-call rotation? (who gets paged?)
8. ARE alerts tested periodically? (fire drills for alerting)
Alert quality rubric:
| Quality | Bad Alert | Good Alert |
|---|---|---|
| Title | "Error detected" | "Payment API error rate > 5% for 3 min" |
| Context | None | Current rate: 8.2%, baseline: 0.3%, affected endpoints: /checkout |
| Action | "Investigate" | "Check payment gateway status, see runbook: link" |
| Routing | Everyone | On-call for payment-team |
| Threshold | Any error | > 5% error rate sustained for 3 minutes |
# Observability Audit: [Project Name]
## Summary
| Capability | Status | Tool | Maturity Level | Assessment |
|-----------|--------|------|---------------|------------|
| Structured Logging | ✅/⚠️/❌ | [Tool] | [0-4] | [Notes] |
| Health Checks | ✅/⚠️/❌ | — | [0-4] | [Notes] |
| Error Tracking | ✅/⚠️/❌ | [Tool] | [0-4] | [Notes] |
| Metrics & Dashboards | ✅/⚠️/❌ | [Tool] | [0-4] | [Notes] |
| Distributed Tracing | ✅/⚠️/❌ | [Tool] | [0-4] | [Notes] |
| Alerting | ✅/⚠️/❌ | [Tool] | [0-4] | [Notes] |
## Can Answer The Four Questions?
| Question | Answer | Confidence |
|----------|--------|------------|
| What happened? | ✅/❌ | [How quickly, how precisely] |
| When? | ✅/❌ | [Timestamp precision] |
| Why? | ✅/❌ | [Root cause discoverability] |
| To whom? | ✅/❌ | [User attribution capability] |
## Findings
[Standard severity format — sorted by severity]
## Summary
| Severity | Count |
|----------|-------|
| 🔴 Critical | N |
| 🟠 High | N |
| 🟡 Medium | N |
| 🟢 Low | N |
## Verdict: [PASS / CONDITIONAL PASS / FAIL]
console.log / print() as the primary logging strategyarchitecture-auditincident-response capabilities (what you can observe, you can respond to)performance-audit metrics analysissecurity-audit for audit logging and intrusion detectionci-cd-audit for deployment marker integration