**Purpose**: Set up monitoring, alerting, and observability to detect incidents early.
Sets up monitoring, alerting, and observability for applications and infrastructure. Helps implement metrics, logging, and tracing with Prometheus, Grafana, and ELK.
/plugin marketplace add anton-abyzov/specweave/plugin install sw-infra@specweavePurpose: Set up monitoring, alerting, and observability to detect incidents early.
What to Monitor:
Tools:
Application Metrics:
http_requests_total # Total requests
http_request_duration_seconds # Response time (histogram)
http_requests_errors_total # Error count
http_requests_in_flight # Concurrent requests
Infrastructure Metrics:
node_cpu_seconds_total # CPU usage
node_memory_usage_bytes # Memory usage
node_disk_usage_bytes # Disk usage
node_network_receive_bytes_total # Network in
Database Metrics:
pg_stat_database_tup_returned # Rows returned
pg_stat_database_tup_fetched # Rows fetched
pg_stat_database_deadlocks # Deadlock count
pg_stat_activity_connections # Active connections
What to Log:
Log Levels:
Tools:
BAD (unstructured):
console.log("User logged in: " + userId);
GOOD (structured JSON):
logger.info("User logged in", {
userId: 123,
ip: "192.168.1.1",
timestamp: "2025-10-26T12:00:00Z",
userAgent: "Mozilla/5.0...",
});
// Output:
// {"level":"info","message":"User logged in","userId":123,"ip":"192.168.1.1",...}
Benefits:
Purpose: Track request flow through distributed systems
Example:
User Request → API Gateway → Auth Service → Payment Service → Database
1ms 2ms 50ms 100ms 30ms
↑ SLOW SPAN
Tools:
When to Use:
BAD (cause-based):
GOOD (symptom-based):
P1 (SEV1) - Page On-Call:
P2 (SEV2) - Notify During Business Hours:
P3 (SEV3) - Email/Slack:
Rules:
Example Bad Alert:
Subject: Alert
Body: Server is down
Example Good Alert:
Subject: [P1] API Server Down - Production
Body:
- Service: api.example.com
- Issue: Health check failing for 5 minutes
- Impact: All users affected (100%)
- Runbook: https://wiki.example.com/runbook/api-down
- Dashboard: https://grafana.example.com/d/api
Install Prometheus Client (Node.js):
const client = require('prom-client');
// Enable default metrics (CPU, memory, etc.)
client.collectDefaultMetrics();
// Custom metrics
const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'route', 'status'],
});
// Instrument code
app.use((req, res, next) => {
const end = httpRequestDuration.startTimer();
res.on('finish', () => {
end({ method: req.method, route: req.route.path, status: res.statusCode });
});
next();
});
// Expose metrics endpoint
app.get('/metrics', (req, res) => {
res.set('Content-Type', client.register.contentType);
res.end(client.register.metrics());
});
Prometheus Config (prometheus.yml):
scrape_configs:
- job_name: 'api-server'
static_configs:
- targets: ['localhost:3000']
scrape_interval: 15s
Application (send logs to Logstash):
const winston = require('winston');
const LogstashTransport = require('winston-logstash-transport').LogstashTransport;
const logger = winston.createLogger({
transports: [
new LogstashTransport({
host: 'logstash.example.com',
port: 5000,
}),
],
});
logger.info('User logged in', { userId: 123, ip: '192.168.1.1' });
Logstash Config:
input {
tcp {
port => 5000
codec => json
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "application-logs-%{+YYYY.MM.dd}"
}
}
Purpose: Check if service is healthy and ready to serve traffic
Types:
Example (Express.js):
// Liveness probe (simple check)
app.get('/healthz', (req, res) => {
res.status(200).send('OK');
});
// Readiness probe (check dependencies)
app.get('/ready', async (req, res) => {
try {
// Check database
await db.query('SELECT 1');
// Check Redis
await redis.ping();
// Check external API
await fetch('https://api.external.com/health');
res.status(200).send('Ready');
} catch (error) {
res.status(503).send('Not ready');
}
});
Kubernetes:
livenessProbe:
httpGet:
path: /healthz
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 10
periodSeconds: 5
SLI (Service Level Indicator):
SLO (Service Level Objective):
SLA (Service Level Agreement):
Example:
SLI: Availability = (successful requests / total requests) * 100
SLO: Availability must be ≥99.9% per month
SLA: If availability <99.9%, users get 10% refund
Application:
Infrastructure:
Database:
Alerts:
Dashboards:
Rate: Requests per second Errors: Error rate (%) Duration: Response time (p50, p95, p99)
Dashboard:
+-----------------+ +-----------------+ +-----------------+
| Rate | | Errors | | Duration |
| 1000 req/s | | 0.5% | | p95: 250ms |
+-----------------+ +-----------------+ +-----------------+
Utilization: % of resource used (CPU, memory, disk) Saturation: Queue depth, backlog Errors: Error count
Dashboard:
CPU: 70% utilization, 0.5 load average, 0 errors
Memory: 80% utilization, 0 swap, 0 OOM kills
Disk: 60% utilization, 5ms latency, 0 I/O errors
| Tool | Type | Best For | Cost |
|---|---|---|---|
| Prometheus + Grafana | Metrics | Self-hosted, cost-effective | Free |
| DataDog | Metrics, Logs, APM | All-in-one, easy setup | $15/host/month |
| New Relic | APM | Application performance | $99/user/month |
| ELK Stack | Logs | Log aggregation | Free (self-hosted) |
| Splunk | Logs | Enterprise log analysis | $1800/GB/year |
| Jaeger | Traces | Distributed tracing | Free |
| CloudWatch | Metrics, Logs | AWS-native | $0.30/metric/month |
| Azure Monitor | Metrics, Logs | Azure-native | $0.25/metric/month |
Designs feature architectures by analyzing existing codebase patterns and conventions, then providing comprehensive implementation blueprints with specific files to create/modify, component designs, data flows, and build sequences