From argos
Yeni servis observability bootstrap — structured log + RED/USE metric + OTel trace + SLO/SLI tanımı + actionable alert + 3-tier dashboard. Stack-agnostic (Prometheus/Loki/Tempo veya Datadog/Honeycomb).
npx claudepluginhub resultakak/argos --plugin argosThis skill uses the workspace's default tool permissions.
`agents/shared/severity-rubric.md` ve `agents/shared/escalation-matrix.md` default-load
Mandates invoking relevant skills via tools before any response in coding sessions. Covers access, priorities, and adaptations for Claude Code, Copilot CLI, Gemini CLI.
Share bugs, ideas, or general feedback.
agents/shared/severity-rubric.md ve agents/shared/escalation-matrix.md default-load
sayılır (agents/coordination.md §11). Bu skill'in çıktısı Critical / High / Medium /
Low + kanıt formatında olmak zorunda — spekülatif Critical yasak. Sahiplik dışı bulgu
ilgili agent'a delege; karar yetkisi eşiği aşılırsa kullanıcı onayı zorunlu.
| Sinyal | SLI |
|---|---|
| Availability | 2xx_3xx_4xx_count / total_count (5xx fail) |
| Latency | requests < 500ms / total veya p99 < 500ms |
| Throughput/Saturation | current_qps < capacity * 0.7 |
| Quality | Cache hit %, freshness, accuracy (ürüne özel) |
Az + anlamlı: 3-5 SLI / servis.
- service: api-svc
slos:
- name: availability
sli: "rate(http_requests_total{code!~'5..'}[1m]) / rate(http_requests_total[1m])"
target: 0.999
window: 30d_rolling
- name: latency_p99
sli: "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))"
target: 0.5 # 500ms
window: 30d_rolling
Error budget hesabı:
Hedef ekibe gerçekçi olmalı; aspirational yerine maintained.
groups:
- name: api-svc-slo-burn
rules:
- alert: APIHighBurnFast
expr: |
(
sum(rate(http_requests_total{service="api",code=~"5.."}[1h]))
/ sum(rate(http_requests_total{service="api"}[1h]))
) > 14.4 * 0.001 # 14.4x burn (= %2 budget / 1h)
and
(
sum(rate(http_requests_total{service="api",code=~"5.."}[5m]))
/ sum(rate(http_requests_total{service="api"}[5m]))
) > 14.4 * 0.001
for: 2m
labels:
severity: critical
annotations:
summary: "api-svc burning %2 of monthly budget per hour"
runbook_url: "https://runbooks.example/api-svc/high-error-rate"
AND (1h + 5m) kondisyonu: false-positive azaltır.
# structured logger + correlation
import structlog
log = structlog.get_logger()
log.info(
"order_created",
order_id=order.id,
customer_id=customer.id, # NOT pii (mask)
amount=order.total,
trace_id=ctx.trace_id, # OTel context
)
1%) veya error-only full.from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint=OTLP_URL)))
trace.set_tracer_provider(provider)
# auto-instrument: FastAPI, requests, sqlalchemy
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
FastAPIInstrumentor.instrument_app(app)
traceparent header propagation.from prometheus_client import Counter, Histogram, Gauge
http_requests_total = Counter(
"http_requests_total",
"Total HTTP requests",
labelnames=["service", "method", "endpoint", "code"],
# NOT label: user_id (cardinality patlar)
)
http_request_duration_seconds = Histogram(
"http_request_duration_seconds",
"HTTP request duration",
labelnames=["service", "method", "endpoint"],
buckets=(0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10),
)
| Tier | İçerik | İlk panel |
|---|---|---|
| Overview | SLO + golden signals + deploy timeline | Error budget burn-down |
| Drill-down | Endpoint başına RED + slow query + cache miss | Top-5 slow endpoint |
| Infra | CPU/Memory/Network/Disk + DB pool + Queue depth | Saturation heatmap |
service, env, version Grafana variable.userId) — metric storage patlar.for: 0s rule — flapping.User: /observe-bootstrap api-svc
Agent (observability-engineer):
1. Servis envanteri: 12 endpoint, ana journey checkout (POST /orders).
2. Bağımlılık: postgres-rw, redis, payment-provider, kafka.
3. SLI seçimi:
- availability: 5xx oran < %0.1 (30d)
- latency_p99: < 500ms (30d)
- cache_hit_rate: > %85 (30d)
4. SLO error budget: %0.1 / 30d = 43m 12s downtime/ay.
5. Burn rate alert: 14.4x (critical), 6x (high), 3x (medium).
6. Stack tespit: Prometheus + Loki + Tempo (OTel SDK uyumlu).
7. Diff:
- `app/observability.py` — structlog + OTel init
- `app/middleware.py` — PII redact, trace propagation
- `kubernetes/api-svc-rules.yaml` — burn rate Prom rule
- `grafana/api-svc-overview.json` — overview dashboard
8. Alert drill plan: feature flag ile %5xx tetikle, runbook test et.
9. Output: 4 dosya diff + dashboard JSON + runbook link checklist.
# Observability Bootstrap: <service>
## SLI/SLO
| SLI | Target | Window |
## Critical / High / Medium / Low (mevcut gap)
## Diff (özet)
```yaml
# config / SDK init / alert rule
| Severity | Burn rate | Runbook |