Application monitoring, logging, and observability for production applications. Use when: setting up error tracking, adding logging, creating health endpoints, configuring alerting, preparing for production, or when debugging production issues.
From cksnpx claudepluginhub cardinalconseils/claude-starter --plugin cksThis skill is limited to using the following tools:
Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
Designs and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.
Monitoring turns invisible failures into visible signals. Without it, you learn about problems when users complain — or worse, when they leave silently. The goal is not dashboards; it is knowing when something breaks before users do.
Use JSON format with consistent fields across all log entries.
Required fields: timestamp, level, message, request_id Useful fields: user_id (not PII), service_name, duration_ms, status_code
Log levels — use them correctly:
error — Something failed and needs attention (broken request, unhandled exception)warn — Degraded state but still functioning (slow query, retry succeeded, cache miss)info — Important business events (user signup, payment processed, deploy complete)debug — Development detail (request payload, query plan) — never in productionLog: Auth failures, permission denials, input validation failures, external API calls (with duration), slow queries (> threshold), startup/shutdown events, deploy markers.
Never log: Passwords, API keys, tokens, session secrets, full credit card numbers, PII (emails, phone numbers, addresses). Sanitize or redact before logging.
Configure Sentry, LogRocket, Bugsnag, or similar. Capture stack traces, group by root cause, set up alerts on new error types. Tag errors with release version for regression detection.
/health — Basic liveness check. Returns 200 if the process is running. Used by load balancers and container orchestrators.
/ready — Readiness check. Verifies dependencies: database connection, cache availability, external API reachability. Returns 503 if any dependency is down.
External ping service (UptimeRobot, Pingdom, Better Stack) hits /health every 1-5 minutes from multiple regions. Alert on 2+ consecutive failures to avoid false positives.
Page (immediate action required): 5xx error rate spike, health check failure, database unreachable.
Notify (review soon): Slow response times (p95 > threshold), high memory usage, elevated error rate.
Avoid alert fatigue: Fewer meaningful alerts are better than many noisy ones. Every alert should have a clear action. If you ignore an alert regularly, fix it or remove it.
| Metric | What It Tells You |
|---|---|
| Response time (p50/p95/p99) | User experience and backend health |
| Error rate (5xx / total) | Reliability — target < 0.1% |
| Uptime percentage | SLA compliance — target 99.9%+ |
| Active users | Business health and load baseline |
| Rationalization | Reality |
|---|---|
| "We'll add monitoring when we need it" | You need it the moment a real user touches your app. Flying blind is gambling. |
| "Console.log is fine" | Console.log in production is noise. Structured logging with levels is signal. |
| "We don't have enough traffic to monitor" | Monitoring catches bugs before users report them. Traffic volume is irrelevant. |
| "Dashboards are enough" | Nobody watches dashboards at 3am. Alerts catch what dashboards miss. |
| "We'll just check the logs" | Unstructured logs at scale are unsearchable. Structure them from the start. |