Skill

monitoring-expert

Configures monitoring systems with Prometheus/Grafana, implements structured logging/metrics/tracing/alerting, load testing with k6/Artillery, profiling, and capacity planning. For observability and production debugging.

Prometheus

Node

monitoring

devops

npx claudepluginhub jeffallan/claude-skills --plugin fullstack-dev-skills

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Observability and performance specialist implementing comprehensive monitoring, alerting, tracing, and performance testing systems.

Supporting Assets

references/alerting-rules.mdreferences/application-profiling.mdreferences/capacity-planning.mdreferences/dashboards.mdreferences/opentelemetry.mdreferences/performance-testing.mdreferences/prometheus-metrics.mdreferences/structured-logging.md

SKILL.md

Similar Skills

springboot-security

172.9k

Applies Spring Security best practices for authn/authz, input validation, CSRF, secrets, headers, rate limiting, and dependency security in Java Spring Boot services.

everything-claude-code

agent-introspection-debugging

172.9k

Implements structured self-debugging workflow for AI agent failures: capture errors, diagnose patterns like loops or context overflow, apply contained recoveries, and generate introspection reports.

everything-claude-code

energy-procurement

172.9k

Provides expertise on electricity/gas procurement, tariff optimization, demand charge management, renewable PPA evaluation, hedging, load profiling, and multi-facility energy strategies.

everything-claude-code

Stats

Stars8718

Forks718

Last CommitApr 29, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Monitoring Expert

Observability and performance specialist implementing comprehensive monitoring, alerting, tracing, and performance testing systems.

Core Workflow

Assess — Identify what needs monitoring (SLIs, critical paths, business metrics)
Instrument — Add logging, metrics, and traces to the application (see examples below)
Collect — Configure aggregation and storage (Prometheus scrape, log shipper, OTLP endpoint); verify data arrives before proceeding
Visualize — Build dashboards using RED (Rate/Errors/Duration) or USE (Utilization/Saturation/Errors) methods
Alert — Define threshold and anomaly alerts on critical paths; validate no false-positive flood before shipping

Quick-Start Examples

Structured Logging (Node.js / Pino)

import pino from 'pino';

const logger = pino({ level: 'info' });

// Good — structured fields, includes correlation ID
logger.info({ requestId: req.id, userId: req.user.id, durationMs: elapsed }, 'order.created');

// Bad — string interpolation, no correlation
console.log(`Order created for user ${userId}`);

Prometheus Metrics (Node.js)

import { Counter, Histogram, register } from 'prom-client';

const httpRequests = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status'],
});

const httpDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request latency',
  labelNames: ['method', 'route'],
  buckets: [0.05, 0.1, 0.3, 0.5, 1, 2, 5],
});

// Instrument a route
app.use((req, res, next) => {
  const end = httpDuration.startTimer({ method: req.method, route: req.path });
  res.on('finish', () => {
    httpRequests.inc({ method: req.method, route: req.path, status: res.statusCode });
    end();
  });
  next();
});

// Expose scrape endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

OpenTelemetry Tracing (Node.js)

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { trace } from '@opentelemetry/api';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({ url: 'http://jaeger:4318/v1/traces' }),
});
sdk.start();

// Manual span around a critical operation
const tracer = trace.getTracer('order-service');
async function processOrder(orderId) {
  const span = tracer.startSpan('order.process');
  span.setAttribute('order.id', orderId);
  try {
    const result = await db.saveOrder(orderId);
    span.setStatus({ code: SpanStatusCode.OK });
    return result;
  } catch (err) {
    span.recordException(err);
    span.setStatus({ code: SpanStatusCode.ERROR });
    throw err;
  } finally {
    span.end();
  }
}

Prometheus Alerting Rule

groups:
  - name: api.rules
    rules:
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m])
          / rate(http_requests_total[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5% on {{ $labels.route }}"

k6 Load Test

import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '1m', target: 50 },   // ramp up
    { duration: '5m', target: 50 },   // sustained load
    { duration: '1m', target: 0 },    // ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<500'],  // 95th percentile < 500 ms
    http_req_failed:   ['rate<0.01'],  // error rate < 1%
  },
};

export default function () {
  const res = http.get('https://api.example.com/orders');
  check(res, { 'status is 200': (r) => r.status === 200 });
  sleep(1);
}

Reference Guide

Load detailed guidance based on context:

Topic	Reference	Load When
Logging	`references/structured-logging.md`	Pino, JSON logging
Metrics	`references/prometheus-metrics.md`	Counter, Histogram, Gauge
Tracing	`references/opentelemetry.md`	OpenTelemetry, spans
Alerting	`references/alerting-rules.md`	Prometheus alerts
Dashboards	`references/dashboards.md`	RED/USE method, Grafana
Performance Testing	`references/performance-testing.md`	Load testing, k6, Artillery, benchmarks
Profiling	`references/application-profiling.md`	CPU/memory profiling, bottlenecks
Capacity Planning	`references/capacity-planning.md`	Scaling, forecasting, budgets

Constraints

MUST DO

Use structured logging (JSON)
Include request IDs for correlation
Set up alerts for critical paths
Monitor business metrics, not just technical
Use appropriate metric types (counter/gauge/histogram)
Implement health check endpoints

MUST NOT DO

Log sensitive data (passwords, tokens, PII)
Alert on every error (alert fatigue)
Use string interpolation in logs (use structured fields)
Skip correlation IDs in distributed systems

Documentation