Analyze system and application log data from sosreport archives, extracting error patterns, kernel panics, OOM events, service failures, and application crashes from journald logs and traditional log files within the sosreport directory structure to identify root causes of system failures and issues
/plugin marketplace add openshift-eng/ai-helpers/plugin install sosreport@ai-helpersThis skill inherits all available tools. When active, it can use any tool Claude has access to.
This skill provides detailed guidance for analyzing logs from sosreport archives, including journald logs, system logs, kernel messages, and application logs.
Use this skill when:
/sosreport:analyze command's log analysis phaseSosreports contain logs in several locations:
Journald logs: sos_commands/logs/journalctl_*
journalctl_--no-pager_--boot - Current boot logsjournalctl_--no-pager - All available logsjournalctl_--no-pager_--priority_err - Error priority logsTraditional system logs: var/log/
messages - System-level messagesdmesg - Kernel ring buffersecure - Authentication and security logscron - Cron job logsApplication logs: var/log/ (varies by application)
httpd/ - Apache logsnginx/ - Nginx logsaudit/audit.log - SELinux audit logsCheck for journald logs:
ls -la sos_commands/logs/journalctl_* 2>/dev/null || echo "No journald logs found"
Check for traditional system logs:
ls -la var/log/{messages,dmesg,secure} 2>/dev/null || echo "No traditional logs found"
Identify application-specific logs:
find var/log/ -type f -name "*.log" 2>/dev/null | head -20
Parse journalctl output for error patterns:
# Look for common error indicators
grep -iE "(error|failed|failure|critical|panic|segfault|oom)" sos_commands/logs/journalctl_--no-pager 2>/dev/null | head -100
Identify OOM (Out of Memory) killer events:
grep -i "out of memory\|oom.*kill" sos_commands/logs/journalctl_--no-pager 2>/dev/null
Find kernel panics:
grep -i "kernel panic\|bug:\|oops:" sos_commands/logs/journalctl_--no-pager 2>/dev/null
Check for segmentation faults:
grep -i "segfault\|sigsegv\|core dump" sos_commands/logs/journalctl_--no-pager 2>/dev/null
Extract service failures:
grep -i "failed to start\|failed with result" sos_commands/logs/journalctl_--no-pager 2>/dev/null
Check messages for errors:
# If file exists and is readable
if [ -f var/log/messages ]; then
grep -iE "(error|failed|failure|critical)" var/log/messages | tail -100
fi
Check dmesg for hardware issues:
if [ -f var/log/dmesg ]; then
grep -iE "(error|fail|warning|i/o error|bad sector)" var/log/dmesg
fi
Analyze authentication logs:
if [ -f var/log/secure ]; then
grep -iE "(failed|failure|invalid|denied)" var/log/secure | tail -50
fi
Count errors by severity:
# Critical errors
grep -ic "critical\|panic\|fatal" sos_commands/logs/journalctl_--no-pager 2>/dev/null || echo "0"
# Errors
grep -ic "error" sos_commands/logs/journalctl_--no-pager 2>/dev/null || echo "0"
# Warnings
grep -ic "warning\|warn" sos_commands/logs/journalctl_--no-pager 2>/dev/null || echo "0"
Find most frequent error messages:
grep -iE "(error|failed)" sos_commands/logs/journalctl_--no-pager 2>/dev/null | \
sed 's/^.*\]: //' | \
sort | uniq -c | sort -rn | head -10
Extract timestamps for error timeline:
# Get first and last error timestamps
grep -i "error" sos_commands/logs/journalctl_--no-pager 2>/dev/null | \
head -1 | awk '{print $1, $2, $3}'
grep -i "error" sos_commands/logs/journalctl_--no-pager 2>/dev/null | \
tail -1 | awk '{print $1, $2, $3}'
Identify application logs:
find var/log/ -type f \( -name "*.log" -o -name "*_log" \) 2>/dev/null
Check for stack traces and exceptions:
# Python tracebacks
grep -A 10 "Traceback (most recent call last)" var/log/*.log 2>/dev/null | head -50
# Java exceptions
grep -B 2 -A 10 "Exception\|Error:" var/log/*.log 2>/dev/null | head -50
Look for common application errors:
# Database connection errors
grep -i "connection.*refused\|connection.*timeout\|database.*error" var/log/*.log 2>/dev/null
# HTTP/API errors
grep -E "HTTP [45][0-9]{2}|status.*[45][0-9]{2}" var/log/*.log 2>/dev/null | head -20
Create a structured summary with the following information:
Error Statistics:
Critical Findings:
Top Error Messages (sorted by frequency):
Application-Specific Issues:
Log File Locations:
Missing log files:
Large log files:
head -n 10000 and tail -n 10000 to avoid memory issuesCompressed logs:
.gz files in var/log/zgrep instead of grep for compressed fileszgrep -i "error" var/log/messages*.gzBinary log formats:
sos_commands/logs/journalctl_* text outputsThe log analysis should produce:
LOG ANALYSIS SUMMARY
====================
Time Range: {first_log_entry} to {last_log_entry}
ERROR STATISTICS
----------------
Critical: {count}
Errors: {count}
Warnings: {count}
CRITICAL FINDINGS
-----------------
Kernel Panics: {count}
- {timestamp}: {panic_message}
OOM Killer Events: {count}
- {timestamp}: Killed {process_name} (PID: {pid})
Segmentation Faults: {count}
- {timestamp}: {process_name} segfaulted
Service Failures: {count}
- {service_name}: {failure_reason}
TOP ERROR MESSAGES
------------------
1. [{count}x] {error_message}
First seen: {timestamp}
Component: {component}
2. [{count}x] {error_message}
First seen: {timestamp}
Component: {component}
APPLICATION ERRORS
------------------
Stack Traces: {count} found in {log_files}
Database Errors: {count}
Network Errors: {count}
Auth Failures: {count}
LOG FILES FOR INVESTIGATION
---------------------------
- Primary: {sosreport_path}/sos_commands/logs/journalctl_--no-pager
- System: {sosreport_path}/var/log/messages
- Kernel: {sosreport_path}/var/log/dmesg
- Security: {sosreport_path}/var/log/secure
- Application: {sosreport_path}/var/log/{app_specific}
RECOMMENDATIONS
---------------
1. {actionable_recommendation_based_on_findings}
2. {actionable_recommendation_based_on_findings}
# Detect OOM events
grep -B 5 -A 15 "Out of memory" sos_commands/logs/journalctl_--no-pager
# Output interpretation:
# - Which process was killed
# - Memory state at the time
# - What triggered the OOM
# Find failed services
grep "failed to start\|Failed with result" sos_commands/logs/journalctl_--no-pager | \
awk -F'[][]' '{print $2}' | sort | uniq -c | sort -rn
# This shows which services failed most frequently
# Create error timeline
grep -i "error\|fail" sos_commands/logs/journalctl_--no-pager | \
awk '{print $1, $2, $3}' | sort | uniq -c
# Shows error frequency over time