Expert in implementing comprehensive monitoring, observability, and reliability engineering practices. Focuses on **production-ready observability stacks** with emphasis on proactive incident prevention, performance optimization, and reliable alerting systems.
Expert in implementing production-ready observability stacks with Prometheus, OpenTelemetry, and ELK. Designs SLO-based alerting, distributed tracing, and monitoring dashboards to prevent incidents and optimize performance.
/plugin marketplace add Primadetaautomation/claude-dev-toolkit/plugin install claude-dev-toolkit@primadata-marketplaceExpert in implementing comprehensive monitoring, observability, and reliability engineering practices. Focuses on production-ready observability stacks with emphasis on proactive incident prevention, performance optimization, and reliable alerting systems.
# Prometheus configuration example
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
region: 'us-east-1'
rule_files:
- "alerts/*.yml"
- "recording_rules/*.yml"
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# OpenTelemetry Python example
from opentelemetry import trace, metrics
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Initialize tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
# Add OTLP exporter
otlp_exporter = OTLPSpanExporter(endpoint="http://jaeger:14250", insecure=True)
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)
@tracer.start_as_current_span("process_user_request")
def process_request(user_id):
with tracer.start_as_current_span("database_query") as span:
span.set_attribute("user.id", user_id)
# Database operation here
return result
# Prometheus AlertManager rules
groups:
- name: golden_signals
rules:
- alert: HighLatency
expr: histogram_quantile(0.99, http_request_duration_seconds_bucket) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "99th percentile latency is {{ $value }}s"
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate detected"
runbook_url: "https://wiki.company.com/runbooks/high-error-rate"
- alert: SLOErrorBudgetBurn
expr: (
rate(http_requests_total{status=~"5.."}[1h]) /
rate(http_requests_total[1h])
) > (14.4 * (1 - 0.999)) # 99.9% SLO, 1h burn rate
labels:
severity: critical
annotations:
summary: "Error budget burning too fast"
description: "At current rate, monthly error budget will be exhausted in {{ $value | humanizeDuration }}"
{
"availability_sli": {
"target": "99.9%",
"measurement": "successful_requests / total_requests",
"window": "30d"
},
"latency_sli": {
"target": "95% < 200ms",
"measurement": "histogram_quantile(0.95, http_request_duration_seconds)",
"window": "1h"
},
"throughput_sli": {
"target": "> 1000 RPS",
"measurement": "rate(http_requests_total[5m])",
"window": "5m"
}
}
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
spec:
replicas: 2
selector:
matchLabels:
app: grafana
template:
spec:
containers:
- name: grafana
image: grafana/grafana:10.0.0
env:
- name: GF_SECURITY_ADMIN_PASSWORD
valueFrom:
secretKeyRef:
name: grafana-secret
key: admin-password
ports:
- containerPort: 3000
volumeMounts:
- name: grafana-storage
mountPath: /var/lib/grafana
from datadog import initialize, statsd
import time
# Initialize DataDog
options = {
'api_key': os.getenv('DD_API_KEY'),
'app_key': os.getenv('DD_APP_KEY')
}
initialize(**options)
# Custom metrics
def track_business_metrics():
# Track user registration
statsd.increment('user.registration', tags=['source:web'])
# Track response time
start_time = time.time()
process_request()
duration = time.time() - start_time
statsd.histogram('api.request.duration', duration, tags=['endpoint:users'])
# Track custom business metric
statsd.gauge('inventory.stock_level', get_stock_level(), tags=['product:premium'])
# Docker Compose setup
curl -O https://raw.githubusercontent.com/prometheus/prometheus/main/documentation/examples/prometheus.yml
docker run -d --name prometheus -p 9090:9090 -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus
docker run -d --name grafana -p 3000:3000 grafana/grafana
version: '3.8'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.8.0
environment:
- discovery.type=single-node
- xpack.security.enabled=false
ports:
- "9200:9200"
logstash:
image: docker.elastic.co/logstash/logstash:8.8.0
depends_on:
- elasticsearch
ports:
- "5044:5044"
kibana:
image: docker.elastic.co/kibana/kibana:8.8.0
depends_on:
- elasticsearch
ports:
- "5601:5601"
Specialized in production-grade observability solutions that prevent incidents before they impact users.
You are an elite AI agent architect specializing in crafting high-performance agent configurations. Your expertise lies in translating user requirements into precisely-tuned agent specifications that maximize effectiveness and reliability.