From enterprise-harness-engineering
Queries Prometheus metrics and alert rules via HTTP API for CPU/memory/disk utilization, service health, capacity trends, and target checks using PromQL.
npx claudepluginhub addxai/enterprise-harness-engineering --plugin enterprise-harness-engineeringThis skill uses the workspace's default tool permissions.
Query monitoring metrics, check alerts, and verify target health via the Prometheus HTTP API. API and PromQL syntax are referenced through Context7 MCP; only environment-specific rules are documented here.
Provides Prometheus and Grafana Cloud Metrics reference with PromQL queries, aggregations, alerting, recording rules, and drilldown for monitoring and metrics architecture.
Configures Prometheus for metric collection, scraping, recording/alert rules, and monitoring of infrastructure/applications. Includes Kubernetes/Helm and Docker Compose setups.
Generates PromQL queries, alerting/recording rules, and Prometheus dashboards via interactive workflow clarifying goals, metrics, and use cases like Grafana viz or troubleshooting.
Share bugs, ideas, or general feedback.
Query monitoring metrics, check alerts, and verify target health via the Prometheus HTTP API. API and PromQL syntax are referenced through Context7 MCP; only environment-specific rules are documented here.
Configure your Prometheus endpoint before using this skill:
| Variable | Description | Required |
|---|---|---|
PROMETHEUS_URL | Your Prometheus server URL (e.g. http://prometheus.internal:9090) | Yes |
Common metric prefixes to monitor:
node_* — Node Exporter (host metrics: CPU, memory, disk, network)kube_* — kube-state-metrics (K8s object state: deployments, pods, nodes)container_* — cAdvisor (container resource usage)apiserver_* — K8s API Server metricskubelet_* — Kubelet metricsprometheus_* — Prometheus self-monitoringIf you have additional exporters (Kafka, Redis, custom applications), add their metric prefixes here:
| Prefix | Source | Description |
|---|---|---|
kafka_* | Kafka Exporter | Broker and consumer group metrics |
fluentbit_* | Fluent Bit | Log pipeline metrics |
| (add your own) |
Authentication: Configure as needed for your environment (none, basic auth, or bearer token).
API endpoints and PromQL syntax can be found in the official Prometheus documentation.
PROMETHEUS_URL accordinglystep should not be smaller than the scrape interval (typically 15s-60s) to avoid invalid interpolationrate() / sum by() aggregationsdate -v-1H +%s instead of the Linux date -d '1 hour ago' +%sJob labels are the key to locating services. Common naming patterns:
| Pattern | Example | Description |
|---|---|---|
{env}-{region}-{service} | prod-gateway | Service by environment and region |
kubernetes-{resource} | kubernetes-pods | Standard K8s metrics |
{component}-exporter | kafka-exporter | Dedicated exporters |
Configure your own job naming convention here to help the agent locate services correctly.
If you run Kafka with a Kafka Exporter, this is a common pattern:
# Aggregate consumer lag by consumergroup and topic
sum by (consumergroup, topic) (kafka_consumergroup_lag)
Normal lag range depends on your workload. Sustained growth indicates consumer processing capacity issues.
node_cpu_seconds_total -> node_memory_MemAvailable_bytes -> node_filesystem_avail_bytes -> locate high-load nodeskafka_brokers (broker count) -> kafka_consumergroup_lag (consumer lag) -> kafka_topic_partition_under_replicated_partition (under-replicated partitions)container_cpu_usage_seconds_total -> container_memory_working_set_bytes -> aggregate by pod/namespacekube_node_status_condition -> kube_pod_status_phase -> kube_deployment_status_replicas_unavailable# High-cardinality label aggregation -- will cause Prometheus OOM
curl "$PROMETHEUS_URL/api/v1/query?query=sum by(pod)(rate(container_cpu_usage_seconds_total[5m]))"
# pod label cardinality is too high (hundreds of pods); aggregate by namespace or deployment instead
# Check Kafka consumer lag
curl -s "$PROMETHEUS_URL/api/v1/query?query=sum%20by%20(consumergroup,topic)(kafka_consumergroup_lag)" | jq '.data.result[] | {group: .metric.consumergroup, topic: .metric.topic, lag: .value[1]}'
# Check node CPU usage top 10
curl -s "$PROMETHEUS_URL/api/v1/query?query=topk(10,100*(1-rate(node_cpu_seconds_total{mode=\"idle\"}[5m])))" | jq '.data.result[] | {node: .metric.instance, cpu_pct: .value[1]}'
# Disk space prediction (will it be full in 24h)
curl -s "$PROMETHEUS_URL/api/v1/query?query=predict_linear(node_filesystem_avail_bytes{mountpoint=\"/\"}[24h],86400)" | jq '.data.result[] | {instance: .metric.instance, predicted_bytes: .value[1]}'