From harness-claude
Identifies performance bottlenecks using OpenTelemetry traces, histogram metrics, and span timing patterns. Debugs slow API responses, profiles DB queries, detects N+1 issues, and monitors SLOs like p99 latency.
npx claudepluginhub intense-visions/harness-engineering --plugin harness-claudeThis skill uses the workspace's default tool permissions.
> Identify performance bottlenecks using trace analysis, histogram metrics, and span timing patterns
Instruments code with OpenTelemetry spans for distributed tracing to visualize request flows across services, debug microservice latency, and correlate logs/errors.
Instruments apps with OpenTelemetry for distributed tracing and Jaeger/Tempo integration. Debugs latency in microservices, analyzes request flows, correlates traces with logs/metrics.
Guides implementing distributed tracing in microservices with OpenTelemetry, covering traces, spans, context propagation, and cross-service debugging.
Share bugs, ideas, or general feedback.
Identify performance bottlenecks using trace analysis, histogram metrics, and span timing patterns
// Performance-instrumented service
import { trace, metrics } from '@opentelemetry/api';
const tracer = trace.getTracer('order-service');
const meter = metrics.getMeter('order-service');
const requestDuration = meter.createHistogram('http.server.request.duration', {
description: 'HTTP request duration',
unit: 'ms',
});
const dbQueryDuration = meter.createHistogram('db.query.duration', {
description: 'Database query duration',
unit: 'ms',
});
const dbQueryCounter = meter.createCounter('db.query.count', {
description: 'Number of database queries per request',
unit: '1',
});
// Middleware that tracks request performance
export async function performanceMiddleware(req: Request, res: Response, next: NextFunction) {
const start = performance.now();
const parentSpan = trace.getActiveSpan();
// Track DB query count for N+1 detection
let queryCount = 0;
const originalQuery = db.query;
db.query = async (...args: any[]) => {
queryCount++;
const qStart = performance.now();
try {
return await originalQuery.apply(db, args);
} finally {
dbQueryDuration.record(performance.now() - qStart, {
'db.operation': args[0]?.split(' ')[0] || 'unknown',
});
}
};
res.on('finish', () => {
const duration = performance.now() - start;
const route = req.route?.path || req.path;
requestDuration.record(duration, {
'http.method': req.method,
'http.route': route,
'http.status_code': res.statusCode,
});
dbQueryCounter.add(queryCount, { 'http.route': route });
// Flag potential N+1 queries
if (queryCount > 10 && parentSpan) {
parentSpan.addEvent('performance.warning', {
'warning.type': 'n_plus_one',
'db.query_count': queryCount,
'http.route': route,
});
}
db.query = originalQuery;
});
next();
}
// SLO monitoring with histograms
const sloLatencyTarget = 500; // 500ms target
// In dashboard queries (PromQL example):
// Error budget: 1 - (histogram_quantile(0.99, rate(http_server_request_duration_bucket[5m])) / 500)
// SLO compliance: sum(rate(http_server_request_duration_bucket{le="500"}[5m])) / sum(rate(http_server_request_duration_count[5m]))
// Waterfall analysis helper — log span timing breakdown
function analyzeTrace(spans: ReadableSpan[]): void {
const root = spans.find((s) => !s.parentSpanId);
if (!root) return;
const totalMs = root.duration[0] * 1000 + root.duration[1] / 1e6;
const breakdown = spans
.filter((s) => s !== root)
.map((s) => ({
name: s.name,
duration: s.duration[0] * 1000 + s.duration[1] / 1e6,
percentage: (((s.duration[0] * 1000 + s.duration[1] / 1e6) / totalMs) * 100).toFixed(1),
}))
.sort((a, b) => b.duration - a.duration);
console.table(breakdown);
// Shows which spans consume the most time in the request
}
Key performance metrics to track:
| Metric | Type | Purpose |
|---|---|---|
http.server.request.duration | Histogram | Overall API latency |
db.query.duration | Histogram | Database performance |
http.client.request.duration | Histogram | Outgoing call latency |
db.query.count | Counter | N+1 query detection |
http.server.active_requests | UpDownCounter | Concurrency tracking |
Percentile analysis: Average latency hides outliers. Always track p50 (median), p95 (most users), and p99 (worst case):
Trace-based analysis pattern:
Common bottleneck patterns:
Promise.allDeployment comparison: Tag spans with the deployment version. Compare p99 latency between versions to detect regressions immediately.
https://opentelemetry.io/docs/concepts/signals/traces/#span-events