Logs Analysis Skill

This skill provides detailed guidance for analyzing logs from sosreport archives, including journald logs, system logs, kernel messages, and application logs.

When to Use This Skill

Use this skill when:

Analyzing the /sosreport:analyze command's log analysis phase
Investigating specific log-related errors or warnings in a sosreport
Performing deep-dive analysis of system failures from logs
Identifying patterns and root causes in system logs

Prerequisites

Sosreport archive must be extracted to a working directory
Path to the sosreport root directory must be known
Basic understanding of Linux log structure and journald

Key Log Locations in Sosreport

Sosreports contain logs in several locations:

Journald logs: sos_commands/logs/journalctl_*
- journalctl_--no-pager_--boot - Current boot logs
- journalctl_--no-pager - All available logs
- journalctl_--no-pager_--priority_err - Error priority logs
Traditional system logs: var/log/
- messages - System-level messages
- dmesg - Kernel ring buffer
- secure - Authentication and security logs
- cron - Cron job logs
Application logs: var/log/ (varies by application)
- httpd/ - Apache logs
- nginx/ - Nginx logs
- audit/audit.log - SELinux audit logs

Implementation Steps

Step 1: Identify Available Log Sources

Check for journald logs:

ls -la sos_commands/logs/journalctl_* 2>/dev/null || echo "No journald logs found"

Check for traditional system logs:

ls -la var/log/{messages,dmesg,secure} 2>/dev/null || echo "No traditional logs found"

Identify application-specific logs:

find var/log/ -type f -name "*.log" 2>/dev/null | head -20

Step 2: Analyze Journald Logs

Parse journalctl output for error patterns:

# Look for common error indicators
grep -iE "(error|failed|failure|critical|panic|segfault|oom)" sos_commands/logs/journalctl_--no-pager 2>/dev/null | head -100

Identify OOM (Out of Memory) killer events:

grep -i "out of memory\|oom.*kill" sos_commands/logs/journalctl_--no-pager 2>/dev/null

Find kernel panics:

grep -i "kernel panic\|bug:\|oops:" sos_commands/logs/journalctl_--no-pager 2>/dev/null

Check for segmentation faults:

grep -i "segfault\|sigsegv\|core dump" sos_commands/logs/journalctl_--no-pager 2>/dev/null

Extract service failures:

grep -i "failed to start\|failed with result" sos_commands/logs/journalctl_--no-pager 2>/dev/null

Step 3: Analyze System Logs (var/log)

Check messages for errors:

# If file exists and is readable
if [ -f var/log/messages ]; then
  grep -iE "(error|failed|failure|critical)" var/log/messages | tail -100
fi

Check dmesg for hardware issues:

if [ -f var/log/dmesg ]; then
  grep -iE "(error|fail|warning|i/o error|bad sector)" var/log/dmesg
fi

Analyze authentication logs:

if [ -f var/log/secure ]; then
  grep -iE "(failed|failure|invalid|denied)" var/log/secure | tail -50
fi

Step 4: Count and Categorize Errors

Count errors by severity:

# Critical errors
grep -ic "critical\|panic\|fatal" sos_commands/logs/journalctl_--no-pager 2>/dev/null || echo "0"

# Errors
grep -ic "error" sos_commands/logs/journalctl_--no-pager 2>/dev/null || echo "0"

# Warnings
grep -ic "warning\|warn" sos_commands/logs/journalctl_--no-pager 2>/dev/null || echo "0"

Find most frequent error messages:

grep -iE "(error|failed)" sos_commands/logs/journalctl_--no-pager 2>/dev/null | \
  sed 's/^.*\]: //' | \
  sort | uniq -c | sort -rn | head -10

Extract timestamps for error timeline:

# Get first and last error timestamps
grep -i "error" sos_commands/logs/journalctl_--no-pager 2>/dev/null | \
  head -1 | awk '{print $1, $2, $3}'
grep -i "error" sos_commands/logs/journalctl_--no-pager 2>/dev/null | \
  tail -1 | awk '{print $1, $2, $3}'

Step 5: Analyze Application-Specific Logs

Identify application logs:

find var/log/ -type f \( -name "*.log" -o -name "*_log" \) 2>/dev/null

Check for stack traces and exceptions:

# Python tracebacks
grep -A 10 "Traceback (most recent call last)" var/log/*.log 2>/dev/null | head -50

# Java exceptions
grep -B 2 -A 10 "Exception\|Error:" var/log/*.log 2>/dev/null | head -50

Look for common application errors:

# Database connection errors
grep -i "connection.*refused\|connection.*timeout\|database.*error" var/log/*.log 2>/dev/null

# HTTP/API errors
grep -E "HTTP [45][0-9]{2}|status.*[45][0-9]{2}" var/log/*.log 2>/dev/null | head -20

Step 6: Generate Log Analysis Summary

Create a structured summary with the following information:

Error Statistics:
- Total critical errors
- Total errors
- Total warnings
- Time range of errors (first to last)
Critical Findings:
- Kernel panics (with timestamps)
- OOM killer events (with victim processes)
- Segmentation faults (with process names)
- Service failures (with service names)
Top Error Messages (sorted by frequency):
- Error message
- Count
- First occurrence timestamp
- Affected component/service
Application-Specific Issues:
- Stack traces found
- Database errors
- Network/connectivity errors
- Authentication failures
Log File Locations:
- Provide paths to specific log files for manual investigation
- Indicate which logs contain the most relevant information

Error Handling

Missing log files:
- If journalctl logs are missing, fall back to var/log/* files
- If traditional logs are missing, document this in the summary
- Some sosreports may have limited logs due to collection parameters
Large log files:
- For files larger than 100MB, sample the beginning and end
- Use head -n 10000 and tail -n 10000 to avoid memory issues
- Inform user that analysis is based on sampling
Compressed logs:
- Check for .gz files in var/log/
- Use zgrep instead of grep for compressed files
- Example: zgrep -i "error" var/log/messages*.gz
Binary log formats:
- Some logs may be in binary format (e.g., journald binary logs)
- Rely on sos_commands/logs/journalctl_* text outputs
- Do not attempt to parse binary files directly

Output Format

The log analysis should produce:

LOG ANALYSIS SUMMARY
====================

Time Range: {first_log_entry} to {last_log_entry}

ERROR STATISTICS
----------------
Critical: {count}
Errors: {count}
Warnings: {count}

CRITICAL FINDINGS
-----------------
Kernel Panics: {count}
  - {timestamp}: {panic_message}

OOM Killer Events: {count}
  - {timestamp}: Killed {process_name} (PID: {pid})

Segmentation Faults: {count}
  - {timestamp}: {process_name} segfaulted

Service Failures: {count}
  - {service_name}: {failure_reason}

TOP ERROR MESSAGES
------------------
1. [{count}x] {error_message}
   First seen: {timestamp}
   Component: {component}

2. [{count}x] {error_message}
   First seen: {timestamp}
   Component: {component}

APPLICATION ERRORS
------------------
Stack Traces: {count} found in {log_files}
Database Errors: {count}
Network Errors: {count}
Auth Failures: {count}

LOG FILES FOR INVESTIGATION
---------------------------
- Primary: {sosreport_path}/sos_commands/logs/journalctl_--no-pager
- System: {sosreport_path}/var/log/messages
- Kernel: {sosreport_path}/var/log/dmesg
- Security: {sosreport_path}/var/log/secure
- Application: {sosreport_path}/var/log/{app_specific}

RECOMMENDATIONS
---------------
1. {actionable_recommendation_based_on_findings}
2. {actionable_recommendation_based_on_findings}

Examples

Example 1: OOM Killer Analysis

# Detect OOM events
grep -B 5 -A 15 "Out of memory" sos_commands/logs/journalctl_--no-pager

# Output interpretation:
# - Which process was killed
# - Memory state at the time
# - What triggered the OOM

Example 2: Service Failure Pattern

# Find failed services
grep "failed to start\|Failed with result" sos_commands/logs/journalctl_--no-pager | \
  awk -F'[][]' '{print $2}' | sort | uniq -c | sort -rn

# This shows which services failed most frequently

Example 3: Timeline of Errors

# Create error timeline
grep -i "error\|fail" sos_commands/logs/journalctl_--no-pager | \
  awk '{print $1, $2, $3}' | sort | uniq -c

# Shows error frequency over time

Tips for Effective Analysis

Start with critical errors: Focus on panics, OOMs, and segfaults first
Look for patterns: Repeated errors often indicate systemic issues
Check timestamps: Correlate errors with the reported incident time
Consider context: Read surrounding log lines for context
Cross-reference: Correlate log findings with resource analysis
Be thorough: Check both journald and traditional logs
Document findings: Note file paths and line numbers for reference

Common Log Patterns to Look For

OOM Killer: "Out of memory: Kill process" → Memory pressure issue
Segfault: "segfault at" → Application crash, possible bug
I/O Error: "I/O error" in dmesg → Hardware or filesystem issue
Connection Refused: "Connection refused" → Service not running or firewall
Permission Denied: "Permission denied" → SELinux, file permissions, or ACL issue
Timeout: "timeout" → Network or resource contention
Failed to start: "Failed to start" → Service configuration or dependency issue

Logs Analysis