From nw
Evaluate production readiness with checklists for monitoring, alerting, observability, incident response, and quality gates.
npx claudepluginhub nwave-ai/nwave --plugin nwThis skill uses the workspace's default tool permissions.
- **Performance**: response time | throughput | latency percentiles (P50, P95, P99)
Implements SRE practices for production reliability: SLO/SLI definitions, monitoring/alerting, chaos engineering, incident runbooks, capacity planning. Handles brownfield extensions.
Design monitoring and alerting that catches production issues fast without creating alert fatigue. Use when establishing observability or improving incident response.
Handles observability and reliability: SLO-based alerting with runbooks, OpenTelemetry instrumentation for RED metrics/logs/traces, incident response, monitoring audits, and coverage checks.
Share bugs, ideas, or general feedback.
Server/container health and resource utilization | Network performance and connectivity | Storage capacity and I/O performance | Security event detection.
| Tier | Condition | Response |
|---|---|---|
| Page | Service down, data loss risk, security breach | Immediate response |
| Urgent | Error rate >2x baseline, latency SLA breach | Response within 15 min |
| Warning | Capacity >80%, error rate trending up | Response within 1 hour |
| Info | Deployment complete, metric threshold crossed | Review next business day |
Regular update and patching schedule | Backup verification (test restores quarterly) | Security vulnerability scanning (automated, weekly) | Performance baseline recalibration (after major changes).
Operational runbooks for common procedures | Architecture documentation with system diagrams | Deployment procedures and configuration management | Troubleshooting guides for known failure modes.
Before declaring production-ready, all must pass:
For CI/CD architecture lessons and measurement coupling pitfalls, see cicd-and-deployment skill.