- Service not responding
Diagnose and resolve production service outages using systematic troubleshooting. Follow guided steps to check service status, analyze logs, identify root causes like OOM crashes or configuration errors, and apply immediate fixes to restore service.
/plugin marketplace add anton-abyzov/specweave/plugin install sw-infra@specweave# Check if service is running (systemd)
systemctl status nginx
systemctl status application
systemctl status postgresql
# Check process
ps aux | grep nginx
pidof nginx
# Example output:
# nginx.service - nginx web server
# Active: inactive (dead) ← SERVICE IS DOWN
Check Service Logs (systemd):
# Last 50 lines of service logs
journalctl -u nginx -n 50
# Tail logs in real-time
journalctl -u nginx -f
# Look for:
# - Exit code (0 = normal, non-zero = error)
# - Error messages
# - Crash reason
Check Application Logs:
# Check application error log
tail -100 /var/log/application/error.log
# Look for:
# - Exception/error before crash
# - Stack trace
# - "Fatal error", "Segmentation fault"
Check System Logs:
# Check for OOM (Out of Memory) killer
dmesg | grep -i "out of memory\|oom\|killed process"
# Example:
# Out of memory: Killed process 1234 (node) total-vm:8GB
# ↑ OOM Killer terminated application
# Check kernel errors
dmesg | tail -50
# Check syslog
grep "error\|segfault" /var/log/syslog
Common causes:
| Symptom | Root Cause |
|---|---|
| "Out of memory" in dmesg | OOM Killer (memory leak, insufficient memory) |
| "Segmentation fault" | Application bug (crash) |
| "Address already in use" | Port already bound |
| "Connection refused" to database | Database down |
| "No such file or directory" | Missing config file |
| "Permission denied" | Wrong file permissions |
| Exit code 137 | Killed by OOM Killer |
| Exit code 139 | Segmentation fault |
Option A: Restart Service
# Restart service
systemctl restart nginx
# Check if started successfully
systemctl status nginx
# Test endpoint
curl http://localhost
# Impact: Service restored
# Risk: Low (if root cause not addressed, may crash again)
Option B: Fix Configuration Error (if config issue)
# Test configuration
nginx -t # nginx
postgresql --help # postgres
# If config error, check recent changes
git diff HEAD~1 /etc/nginx/nginx.conf
# Revert to working config
git checkout HEAD~1 /etc/nginx/nginx.conf
# Restart
systemctl restart nginx
Option C: Free Up Resources (if OOM)
# Check memory usage
free -h
# Kill memory-heavy processes (non-critical)
kill -9 <PID>
# Free page cache
sync && echo 3 > /proc/sys/vm/drop_caches
# Restart service
systemctl restart application
Option D: Change Port (if port conflict)
# Check what's using port
lsof -i :80
# Example:
# apache2 1234 root 4u IPv4 12345 0t0 TCP *:80 (LISTEN)
# ↑ Apache using port 80
# Stop conflicting service
systemctl stop apache2
# Start intended service
systemctl start nginx
Option A: Fix Crash Bug (if application bug)
# Check stack trace in logs
tail -100 /var/log/application/error.log
# Identify line causing crash
# Example: NullPointerException at PaymentService.java:42
# Deploy hotfix OR revert to previous version
git checkout <previous-working-commit>
npm run build && pm2 restart all
# Impact: Bug fixed, service stable
# Risk: Medium (need proper testing)
Option B: Increase Memory (if OOM)
# Short-term: Increase swap
dd if=/dev/zero of=/swapfile bs=1M count=2048
mkswap /swapfile
swapon /swapfile
# Long-term: Resize instance
# AWS: Change instance type (t3.medium → t3.large)
# Azure: Resize VM
# Impact: More memory available
# Risk: Medium (swap is slow, instance resize has downtime)
Option C: Enable Auto-Restart (systemd)
# Edit service file
# /etc/systemd/system/application.service
[Service]
Restart=always # Auto-restart on failure
RestartSec=10 # Wait 10s before restart
StartLimitBurst=5 # Max 5 restarts
StartLimitIntervalSec=60 # In 60 seconds
# Reload systemd
systemctl daemon-reload
# Impact: Service auto-restarts on crash
# Risk: Low (but doesn't fix root cause)
Option D: Route Traffic to Backup (if multi-instance)
# If using load balancer:
# 1. Remove failed instance from LB
# 2. Traffic goes to healthy instances
# AWS:
aws elbv2 deregister-targets \
--target-group-arn <arn> \
--targets Id=i-1234567890abcdef0
# Impact: Users see working instance
# Risk: Low (other instances handle load)
For each incident, determine:
Escalate to developer if:
Escalate to platform team if:
Escalate to on-call manager if:
After resolving:
# Service status
systemctl status <service>
systemctl restart <service>
journalctl -u <service> -n 50
# Process check
ps aux | grep <process>
pidof <process>
# Check OOM
dmesg | grep -i "out of memory\|oom"
# Check port usage
lsof -i :<port>
netstat -tlnp | grep <port>
# Test config
nginx -t
postgresql --help
# Health check
curl http://localhost/health
Designs feature architectures by analyzing existing codebase patterns and conventions, then providing comprehensive implementation blueprints with specific files to create/modify, component designs, data flows, and build sequences