Master Kubernetes observability, monitoring with Prometheus, logging, metrics, and distributed tracing. Learn to implement comprehensive monitoring strategies.
Implements comprehensive Kubernetes observability using Prometheus, Loki, and OpenTelemetry. Claude uses this when you need to set up monitoring stacks, write PromQL/LogQL queries, configure SLO-based alerting, or troubleshoot missing metrics and logs.
/plugin marketplace add pluginagentmarketplace/custom-plugin-kubernetes/plugin install kubernetes-assistant@pluginagentmarketplace-kubernetesThis skill inherits all available tools. When active, it can use any tool Claude has access to.
assets/config.yamlreferences/GUIDE.mdscripts/helper.pyProduction-grade Kubernetes observability covering the complete stack from infrastructure metrics to application tracing. This skill provides deep expertise in implementing SLO-based monitoring, multi-signal observability, and proactive alerting for enterprise environments.
Prometheus Stack Installation
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
-n monitoring --create-namespace \
--set grafana.adminPassword=secure-password \
--set prometheus.prometheusSpec.retention=30d
Essential PromQL Queries
# Pod CPU usage
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m])) by (pod)
# Memory utilization
sum(container_memory_working_set_bytes{namespace="production"}) by (pod)
/ sum(container_spec_memory_limit_bytes{namespace="production"}) by (pod) * 100
# Request rate
sum(rate(http_requests_total[5m])) by (service)
# Error rate (5xx)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100
# P99 latency
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: api-server
namespace: monitoring
spec:
selector:
matchLabels:
app: api-server
namespaceSelector:
matchNames:
- production
endpoints:
- port: metrics
interval: 15s
path: /metrics
Loki Stack
apiVersion: v1
kind: ConfigMap
metadata:
name: promtail-config
data:
promtail.yaml: |
server:
http_listen_port: 3101
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
LogQL Queries
# Errors in production
{namespace="production"} |= "error"
# JSON log parsing
{app="api-server"} | json | status >= 500
# Rate of errors
rate({namespace="production"} |= "error" [5m])
OpenTelemetry Collector
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: otel-collector
spec:
mode: deployment
config: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 10s
exporters:
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [jaeger]
SLO Definition
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: api-server-slo
spec:
groups:
- name: slo.rules
rules:
# Availability SLO: 99.9%
- record: slo:availability:ratio
expr: |
sum(rate(http_requests_total{status!~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
# Latency SLO: P99 < 200ms
- record: slo:latency:p99
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
- name: slo.alerts
rules:
- alert: HighErrorRate
expr: (1 - slo:availability:ratio) > 0.001
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate exceeds SLO (>0.1%)"
- alert: HighLatency
expr: slo:latency:p99 > 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "P99 latency exceeds 200ms"
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-config
stringData:
alertmanager.yaml: |
global:
resolve_timeout: 5m
route:
receiver: 'default'
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
routes:
- match:
severity: critical
receiver: 'pagerduty'
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'
api_url: '${SLACK_WEBHOOK}'
- name: 'pagerduty'
pagerduty_configs:
- service_key: '${PD_SERVICE_KEY}'
- name: 'slack'
slack_configs:
- channel: '#alerts'
Monitoring Problem?
│
├── No metrics
│ ├── Check ServiceMonitor selector
│ ├── Verify /metrics endpoint
│ └── Check Prometheus targets
│
├── Missing logs
│ ├── Check Promtail/Fluentbit pods
│ ├── Verify log format
│ └── Check Loki ingestion
│
└── Alert not firing
├── Check PromQL expression
├── Verify thresholds
└── Check Alertmanager routes
# Prometheus targets
kubectl port-forward -n monitoring svc/prometheus 9090
# Visit /targets
# Grafana access
kubectl port-forward -n monitoring svc/grafana 3000
# Check ServiceMonitors
kubectl get servicemonitors -A
# Alertmanager status
kubectl port-forward -n monitoring svc/alertmanager 9093
| Challenge | Solution |
|---|---|
| High cardinality | Reduce labels, aggregation |
| Retention costs | Tiered storage, downsampling |
| Alert fatigue | SLO-based alerting |
| Missing traces | Auto-instrumentation |
| Metric | Target |
|---|---|
| Metric collection | 100% services |
| Log retention | 30 days |
| Alert response | <5 minutes |
| Dashboard coverage | All critical |
This skill should be used when the user asks to "create a slash command", "add a command", "write a custom command", "define command arguments", "use command frontmatter", "organize commands", "create command with file references", "interactive command", "use AskUserQuestion in command", or needs guidance on slash command structure, YAML frontmatter fields, dynamic arguments, bash execution in commands, user interaction patterns, or command development best practices for Claude Code.
This skill should be used when the user asks to "create an agent", "add an agent", "write a subagent", "agent frontmatter", "when to use description", "agent examples", "agent tools", "agent colors", "autonomous agent", or needs guidance on agent structure, system prompts, triggering conditions, or agent development best practices for Claude Code plugins.
This skill should be used when the user asks to "create a hook", "add a PreToolUse/PostToolUse/Stop hook", "validate tool use", "implement prompt-based hooks", "use ${CLAUDE_PLUGIN_ROOT}", "set up event-driven automation", "block dangerous commands", or mentions hook events (PreToolUse, PostToolUse, Stop, SubagentStop, SessionStart, SessionEnd, UserPromptSubmit, PreCompact, Notification). Provides comprehensive guidance for creating and implementing Claude Code plugin hooks with focus on advanced prompt-based hooks API.