By tonone-ai
Observability & reliability engineer — monitoring, alerting, SRE, incident response, SLOs
npx claudepluginhub tonone-ai/tonone --plugin vigilWrite SLO-based alert rules with burn rate thresholds and paired runbooks. Outputs actual alert configs, not a strategy doc. Use when asked to "set up alerts", "create runbooks", "define SLOs", or "alerting strategy".
Verify observability posture — audit monitoring coverage, find blind spots, prioritize gaps. Use when asked "is monitoring sufficient", "observability review", "are we covered", or "pre-launch monitoring check".
Incident response — diagnose production issues, find root cause, propose fix with rollback. Use when asked about "something is broken", "production issue", "why is this down", "incident", or "debug production".
Instrument a service with OpenTelemetry — RED metrics, structured logs, distributed tracing, and health checks. Outputs actual code and config, not a plan. Use when asked to "add monitoring", "instrument this", "add logging", "set up tracing", or "observability".
Observability reconnaissance — inventory what monitoring exists, map coverage, highlight blind spots. Use when asked "what monitoring exists", "observability assessment", or "what can we see".
Uses power tools
Uses Bash, Write, or Edit tools
Share bugs, ideas, or general feedback.
DevsForge site reliability engineering specialist for building resilient and scalable systems
Production reliability and observability across all environments. Master Datadog, CloudWatch, monitoring, incident response, SRE practices, and audit logging for enterprise compliance.
Editorial "Observability & Monitoring" bundle for Claude Code from Antigravity Awesome Skills.
Site Reliability Engineering discipline agent for reliability, monitoring, and incident response
Use this agent when you need to implement comprehensive monitoring, observability, and alerting systems for enterprise B2B applications. This agent specializes in APM, logging, metrics, distributed tracing, SLA monitoring, and proactive incident management for business-critical systems. Examples: