Skill

monitoring-setup

Designs monitoring dashboards, alerting rules, SLOs, and error budgets for production systems. Trigger: "monitoring setup", "dashboards", "alerts", "SLOs", "error budgets", "observability".

From sovereign-architect

Install

Run in your terminal

npx claudepluginhub javimontano/mao-sovereign-architect

Tool Access

This skill is limited to using the following tools:

ReadGlobGrepBashAgent

Supporting Assets

View in Repository

evals/evals.json

examples/sample-output.md

prompts/use-case-prompts.md

references/body-of-knowledge.md

Skill Content

Similar Skills

browser-automation

Guides browser automation with Playwright, Puppeteer, Selenium for e2e testing and scraping. Teaches reliable selectors, auto-waits, isolation to fix flaky tests.

antigravity-bundle-qa-testing

31.1k

code-review-checklist

Provides checklists to review code for functionality, quality, security, performance, tests, and maintainability. Use for PRs, audits, team standards, and developer training.

antigravity-bundle-qa-testing

31.1k

ab-test-setup

Enforces A/B test setup with gates for hypothesis locking, metrics definition, sample size calculation, assumptions checks, and execution readiness before implementation.

antigravity-bundle-qa-testing

31.1k

Stats

Stars0

Forks0

Last CommitMar 28, 2026

Actions

View Source View Plugin View on GitHub View README

Procedure

Step 1 — Define Service Level Objectives

Identify critical user journeys and the services that support them.

Define SLIs (Service Level Indicators): latency, error rate, throughput, availability.

Set SLO targets: e.g., "99.9% of requests complete in under 300ms over a 30-day window."

Calculate error budgets: (1 - SLO) * total requests = allowable failures.

Define error budget policies: what happens when the budget is exhausted (feature freeze, reliability sprint).

Step 2 — Design Dashboards

System Overview Dashboard: Service health, request rate, error rate, latency percentiles (p50, p95, p99).

Business Metrics Dashboard: Revenue-impacting metrics, conversion rates, feature adoption.

Infrastructure Dashboard: CPU, memory, disk, network, pod count, node health.

Deployment Dashboard: Deploy frequency, rollback count, change failure rate, lead time.

Follow the USE method (Utilization, Saturation, Errors) for infrastructure and RED method (Rate, Errors, Duration) for services.

Step 3 — Configure Alerting Rules

Alert on symptoms (error rate, latency) not causes (CPU, memory) — unless causes predict imminent failure.

Define severity levels: PAGE (wake someone up), TICKET (fix in business hours), LOG (informational).

Set appropriate thresholds with burn-rate alerts (fast burn vs. slow burn against error budget).

Configure alert routing: on-call rotation, escalation paths, notification channels.

Include runbook links in every alert so responders know what to do.

Step 4 — Operationalize and Iterate

Establish on-call rotation with clear responsibilities and escalation.

Schedule quarterly SLO reviews to adjust targets based on data.

Track alert noise: if an alert fires and requires no action, it should be tuned or removed.

Conduct post-incident reviews that feed back into monitoring improvements.

Document all dashboards, alerts, and SLOs in an operational catalog.

Procedure

Step 1 — Define Service Level Objectives

Identify critical user journeys and the services that support them.

Define SLIs (Service Level Indicators): latency, error rate, throughput, availability.

Set SLO targets: e.g., "99.9% of requests complete in under 300ms over a 30-day window."

Calculate error budgets: (1 - SLO) * total requests = allowable failures.

Define error budget policies: what happens when the budget is exhausted (feature freeze, reliability sprint).

Step 2 — Design Dashboards

System Overview Dashboard: Service health, request rate, error rate, latency percentiles (p50, p95, p99).

Business Metrics Dashboard: Revenue-impacting metrics, conversion rates, feature adoption.

Infrastructure Dashboard: CPU, memory, disk, network, pod count, node health.

Deployment Dashboard: Deploy frequency, rollback count, change failure rate, lead time.

Follow the USE method (Utilization, Saturation, Errors) for infrastructure and RED method (Rate, Errors, Duration) for services.

Step 3 — Configure Alerting Rules

Alert on symptoms (error rate, latency) not causes (CPU, memory) — unless causes predict imminent failure.

Define severity levels: PAGE (wake someone up), TICKET (fix in business hours), LOG (informational).

Set appropriate thresholds with burn-rate alerts (fast burn vs. slow burn against error budget).

Configure alert routing: on-call rotation, escalation paths, notification channels.

Include runbook links in every alert so responders know what to do.

Step 4 — Operationalize and Iterate

Establish on-call rotation with clear responsibilities and escalation.

Schedule quarterly SLO reviews to adjust targets based on data.

Track alert noise: if an alert fires and requires no action, it should be tuned or removed.

Conduct post-incident reviews that feed back into monitoring improvements.

Document all dashboards, alerts, and SLOs in an operational catalog.

monitoring-setup

monitoring-setup

Monitoring Setup

Guiding Principle

Procedure

Step 1 — Define Service Level Objectives

Step 2 — Design Dashboards

Step 3 — Configure Alerting Rules

Step 4 — Operationalize and Iterate

Quality Criteria

Anti-Patterns

Monitoring Setup

Guiding Principle

Procedure

Step 1 — Define Service Level Objectives

Step 2 — Design Dashboards

Step 3 — Configure Alerting Rules

Step 4 — Operationalize and Iterate

Quality Criteria

Anti-Patterns