Manages SLO/SLI definition, baseline-monitoring Helm chart configuration, and Datadog SLO validation. Use when defining service level objectives, configuring monitoring, validating SLOs, or analyzing service reliability with Datadog metrics.
npx claudepluginhub andercore-labs/claudes-kitchen --plugin operational-excellenceThis skill uses the workspace's default tool permissions.
**SCOPE:** SLO/SLI design and baseline-monitoring integration for Andercore services.
Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.
Dynamically discovers and combines enabled skills into cohesive, unexpected delightful experiences like interactive HTML or themed artifacts. Activates on 'surprise me', inspiration, or boredom cues.
Generates images from structured JSON prompts via Python script execution. Supports reference images and aspect ratios for characters, scenes, products, visuals.
SCOPE: SLO/SLI design and baseline-monitoring integration for Andercore services.
MANDATORY: All Andercore services MUST use baseline-monitoring
| Component | Requirement |
|---|---|
| Helm Chart | baseline-monitoring v1.1.0+ |
| Repository | https://github.com/andercore/helm-charts/tree/main/charts/baseline-monitoring |
| Values | global.mandatory.{app, owner, tier}, global.framework |
| Datadog | MCP tools for analysis (when available) |
SLI = Measurable indicator (latency, error rate, throughput)
SLO = SLI + Target (99.9% < 200ms)
SLA = SLO + Consequences (refund if breach)
Tier 1: 99.9% uptime (critical services)
Tier 2: 99.5% uptime (important services)
Tier 3: 99.0% uptime (non-critical services)
New service SLO definition | SLO adjustment | baseline-monitoring setup | Datadog analysis | Error budget monitoring | Framework migration
| Category | SLI | Measurement |
|---|---|---|
| Availability | Success rate | (successful_requests / total_requests) × 100 |
| Latency | Response time | p50, p95, p99 percentiles |
| Throughput | Request rate | requests/second |
| Quality | Error rate | (failed_requests / total_requests) × 100 |
| Durability | Data retention | (retained_data / stored_data) × 100 |
Pattern: SLI + Target + Time Window
✓ 99.9% requests < 200ms (28-day rolling)
✓ 99.95% availability (monthly)
✓ p99 latency < 500ms (weekly)
✗ "Fast response times" (not measurable)
✗ "High availability" (no target)
| Tier | Services | Uptime SLO | Use Case |
|---|---|---|---|
| 1 | Critical | 99.9% | Payment, auth, core API |
| 2 | Important | 99.5% | Notifications, reporting |
| 3 | Non-critical | 99.0% | Admin tools, analytics |
Error budget calculation:
SLO = 99.9% → Error budget = 0.1%
1M requests/month → 1000 errors allowed
# Chart.yaml
dependencies:
- name: baseline-monitoring
version: "1.1.0"
repository: "@oci://ghcr.io/andercore"
# values.yaml
global:
mandatory:
app: "service-name" # REQUIRED
owner: "team-name" # REQUIRED
tier: 1 # REQUIRED: 1, 2, or 3
framework: "express" # REQUIRED: express, fastify, fastapi, koa, http
# values-prod.yaml
global:
environment: "prod-weu"
# values-stg.yaml
global:
environment: "stg-weu"
Default uptime SLO (enabled automatically):
baseline-monitoring:
slos:
uptime:
enabled: true
target: "" # Empty = tier-based default
timeframe: "7d"
Composite Availability SLO (Recommended):
Track both application-level (HTTP) and infrastructure-level (Kubernetes) availability separately:
baseline-monitoring:
slos:
# Application availability (HTTP success rate via OpenTelemetry)
application_availability:
enabled: true
name: "Application Availability"
description: "99.9% of requests succeed (HTTP 2xx/3xx)"
target: "99.9"
timeframe: "7d"
query:
numerator: |
sum:http.server.requests{service:{{ .serviceName }}, env:{{ .serviceEnv }}, http.response.status_code:2*}.as_count() +
sum:http.server.requests{service:{{ .serviceName }}, env:{{ .serviceEnv }}, http.response.status_code:3*}.as_count()
denominator: |
sum:http.server.requests{service:{{ .serviceName }}, env:{{ .serviceEnv }}}.as_count()
# Infrastructure availability (Pod health via Kubernetes)
infrastructure_availability:
enabled: true
name: "Infrastructure Availability"
description: "Healthy pods available 99.9% of time"
target: "99.9"
timeframe: "7d"
query:
numerator: |
sum:kubernetes_state.deployment.replicas_available{deployment:{{ .serviceName }}, env:{{ .serviceEnv }}}.rollup(avg, 60)
denominator: |
sum:kubernetes_state.deployment.replicas_desired{deployment:{{ .serviceName }}, env:{{ .serviceEnv }}}.rollup(avg, 60)
# Latency SLO (OpenTelemetry preferred, APM fallback)
latency:
enabled: true
name: "Latency SLO"
description: "99% of requests under 200ms"
target: "99.0"
timeframe: "30d"
query:
# Option 1: OpenTelemetry (preferred)
numerator: |
sum:http.server.duration{service:{{ .serviceName }}, env:{{ .serviceEnv }}, p99:http.server.duration:<200}.as_count()
denominator: |
sum:http.server.requests{service:{{ .serviceName }}, env:{{ .serviceEnv }}}.as_count()
# Option 2: APM tracing (fallback if OTel not available)
# numerator: |
# sum:trace.express.request.hits{service:{{ .serviceName }}, env:{{ .serviceEnv }}, p99:trace.express.request.duration:<200}.as_count()
# denominator: |
# sum:trace.express.request.hits{service:{{ .serviceName }}, env:{{ .serviceEnv }}}.as_count()
# Error rate SLO (4xx + 5xx combined)
error_rate:
enabled: true
name: "Error Rate SLO"
description: "< 1% error rate (4xx + 5xx)"
target: "99.0"
timeframe: "30d"
query:
numerator: |
sum:http.server.requests{service:{{ .serviceName }}, env:{{ .serviceEnv }}, !http.response.status_code:4*, !http.response.status_code:5*}.as_count()
denominator: |
sum:http.server.requests{service:{{ .serviceName }}, env:{{ .serviceEnv }}}.as_count()
Why separate availability SLOs?
| Scenario | Application SLO | Infrastructure SLO | Root Cause |
|---|---|---|---|
| All pods healthy, 5% 5xx errors | 95% ✗ | 100% ✓ | Application code issue |
| 2/3 pods crash, 0% errors | 100% ✓ | 66% ✗ | Infrastructure/K8s issue |
| All pods healthy, 0% errors | 100% ✓ | 100% ✓ | Healthy |
| Rolling deploy (brief downtime) | 100% ✓ | 95% ~ | Expected deployment |
Three metric sources available:
Framework-agnostic, standards-based:
| Metric | Purpose | Tags |
|---|---|---|
| http.server.requests | Request count | service, env, http.response.status_code, http.method, http.route |
| http.server.duration | Request latency | service, env, http.route |
Advantages:
Status code filtering:
http.response.status_code:2* # 2xx success
http.response.status_code:4* # 4xx client errors
http.response.status_code:5* # 5xx server errors
Datadog APM tracing (framework-specific):
| Framework | Inbound Metric | Query Pattern |
|---|---|---|
| Express | trace.express.request | trace.express.request.hits |
| Fastify | trace.fastify.request | trace.fastify.request.hits |
| Koa | trace.koa.request | trace.koa.request.hits |
| FastAPI | trace.fastapi.request | trace.fastapi.request.hits |
| HTTP | trace.http.request | trace.http.request.hits |
Status code filtering:
http.status_class:2xx # 2xx success
http.status_class:4xx # 4xx client errors
http.status_class:5xx # 5xx server errors
Note: Must set global.framework to match actual framework
Infrastructure availability:
| Metric | Purpose |
|---|---|
| kubernetes_state.deployment.replicas_available | Healthy running pods |
| kubernetes_state.deployment.replicas_desired | Target pod count |
| kubernetes.containers.restarts | Container restart count |
| kubernetes.pods.running | Running pod count |
Use for infrastructure SLOs (separate from application SLOs)
When mcp__datadog__ tools available:*
1. Query Datadog metrics (30-day window)
2. Calculate current RED metrics
3. Analyze performance vs tier targets
4. Suggest SLO configuration
5. Generate baseline-monitoring YAML
6. Validate queries in Datadog
7. Deploy via ArgoCD
Uptime SLO:
Numerator: sum:trace.express.request.hits{service:api,env:prod,!http.status_class:5xx}.as_count()
Denominator: sum:trace.express.request.hits{service:api,env:prod}.as_count()
Latency SLO:
Numerator: sum:trace.express.request.hits{service:api,p99:trace.express.request.duration:<500}.as_count()
Denominator: sum:trace.express.request.hits{service:api}.as_count()
Query validation:
1. Test in Datadog Metrics Explorer
2. Verify time series returns data
3. Calculate ratio manually
4. Confirm matches expected SLO
Default monitors (enabled by default):
| Monitor | Purpose | Tier-Based Threshold |
|---|---|---|
| High latency | p95 latency threshold | Yes |
| High error rate | 5xx error rate | Yes |
| Pod restarts | Restart detection | Yes |
| High CPU | CPU threshold | Yes |
| High memory | Memory threshold | Yes |
| Low pods | Minimum pods | Yes (disabled in staging) |
Customize:
baseline-monitoring:
monitors:
enabled: true
notificationTag: "@slack-my-team"
latency:
threshold: 500 # Override (ms)
priority: 5 # 1-5 (1=highest)
timeWindow: "5m"
errorRate:
threshold: 5 # Override (%)
MANDATORY before deployment:
| Pattern | Severity | Fix |
|---|---|---|
| No baseline-monitoring dependency | CRITICAL | Add to Chart.yaml |
| Missing global.mandatory.app | CRITICAL | Add service name |
| Missing global.mandatory.owner | CRITICAL | Add team name |
| Missing global.mandatory.tier | CRITICAL | Add tier (1, 2, or 3) |
| Missing global.framework | CRITICAL | Add framework |
| Wrong framework in query | CRITICAL | Use correct trace.{framework}.request or http.server.requests |
| No SLOs defined | CRITICAL | Enable uptime SLO or define custom |
| HTTP-only uptime SLO | ERROR | Add infrastructure_availability SLO with kubernetes_state metrics |
| OpenTelemetry metrics missing | ERROR | Verify http.server.requests exists in Datadog |
| Kubernetes metrics missing | ERROR | Verify kubernetes_state.deployment.replicas_* exists |
| Target > tier default | ERROR | Lower target or justify |
| Missing template variables | ERROR | Use {{ .serviceName }}, {{ .serviceEnv }} |
| Hardcoded service name | ERROR | Use {{ .serviceName }} |
| Invalid timeframe | ERROR | Use 7d, 30d, or 90d |
| Datadog MCP available but not used | WARNING | Use mcp__datadog__* tools |
If tools available:
// 1. Fetch metrics
mcp__datadog__get_service_metrics({
service: "my-service",
env: "prod",
timeframe: "30d"
})
// 2. Analyze performance
→ Calculate uptime, p99 latency
→ Compare vs tier targets
→ Suggest SLO config
// 3. Validate queries
mcp__datadog__query_metrics({
query: "sum:trace.express.request.hits{service:my-service}.as_count()",
from: "7d",
to: "now"
})
// 4. Check existing SLOs
mcp__datadog__list_slos({
service: "my-service",
env: "prod"
})
Multi-window approach:
| Window | Purpose | Alert Condition |
|---|---|---|
| 1-hour | Incident detection | Burn rate > 14.4× |
| 7-day | Tactical response | Burn rate > 2× |
| 28-day | Strategic planning | Track long-term trends |
Calculation:
28-day SLO = 99.9% (0.1% error budget)
1-hour budget = 0.1% / (28 × 24) = 0.0001488%
14.4× burn = 0.002143% errors in 1 hour → page
1. kubectl get datadogmonitor -n <namespace>
2. argocd app get <app-name>
3. kubectl logs -n datadog -l app=datadog-operator
1. Verify framework matches global.framework
2. Check metric exists in Datadog Metrics Explorer
3. Confirm service/env tags correct
4. Test query in Datadog first
1. Export monitor from Datadog
2. Convert to baseline-monitoring YAML
3. Test in staging
4. Deploy via ArgoCD
5. Delete manual monitor
1. Identify current: trace.express.request
2. Verify new: trace.fastify.request
3. Update global.framework: "fastify"
4. Update SLO queries
5. Deploy staging → prod
baseline-monitoring: https://github.com/andercore/helm-charts/tree/main/charts/baseline-monitoring SLO Theory: https://sre.google/workbook/implementing-slos/ Datadog Queries: https://docs.datadoghq.com/dashboards/querying/ Complete Example: See slo-complete-example.md Observability: See skill:operational-excellence:observability-recipe