Resource Analysis Skill

This skill provides detailed guidance for analyzing system resource usage from sosreport archives, including memory, CPU, disk space, and process information.

When to Use This Skill

Use this skill when:

Analyzing the /sosreport:analyze command's resource analysis phase
Investigating performance issues or resource bottlenecks
Identifying resource exhaustion problems
Correlating resource usage with system failures

Prerequisites

Sosreport archive must be extracted to a working directory
Path to the sosreport root directory must be known
Understanding of Linux resource management

Key Resource Data Locations in Sosreport

Memory Information:
- sos_commands/memory/free - Memory usage snapshot
- proc/meminfo - Detailed memory statistics
- sos_commands/memory/swapon_-s - Swap usage
- proc/buddyinfo - Memory fragmentation
CPU Information:
- sos_commands/processor/lscpu - CPU architecture and features
- proc/cpuinfo - Detailed CPU information
- sos_commands/processor/turbostat - CPU frequency and power states (if available)
- uptime - Load averages
Disk Information:
- sos_commands/filesys/df_-al - Filesystem usage
- sos_commands/block/lsblk - Block device information
- sos_commands/filesys/mount - Mounted filesystems
- proc/diskstats - Disk I/O statistics
Process Information:
- sos_commands/process/ps_auxwww - Process list with details
- sos_commands/process/top - Process snapshot (if available)
- proc/[pid]/ - Per-process information

Implementation Steps

Step 1: Analyze Memory Usage

Parse free command output:

# Check if free output exists
if [ -f sos_commands/memory/free ]; then
  cat sos_commands/memory/free
fi

Extract memory metrics:

# Parse /proc/meminfo for detailed stats
if [ -f proc/meminfo ]; then
  grep -E "^(MemTotal|MemFree|MemAvailable|Buffers|Cached|SwapTotal|SwapFree|Dirty|Slab):" proc/meminfo
fi

Calculate memory usage percentage:
- Total memory = MemTotal
- Used memory = MemTotal - MemAvailable
- Usage percentage = (Used / Total) * 100
- Parse from free output or calculate from meminfo

Check for memory pressure indicators:

# Look for OOM events in logs
grep -i "out of memory\|oom killer" sos_commands/logs/journalctl_--no-pager 2>/dev/null

# Check swap usage
if [ -f sos_commands/memory/swapon_-s ]; then
  cat sos_commands/memory/swapon_-s
fi

Identify memory issues:
- Memory usage > 90% → Critical
- Memory usage > 80% → Warning
- Heavy swap usage (>50% swap used) → Performance issue
- OOM killer events → Critical memory exhaustion

Step 2: Analyze CPU Usage

Extract CPU information:

# Get CPU count and model
if [ -f sos_commands/processor/lscpu ]; then
  grep -E "^(CPU\(s\)|Model name|Thread|Core|Socket|CPU MHz):" sos_commands/processor/lscpu
fi

Check load averages:

# Parse uptime for load averages
if [ -f uptime ]; then
  cat uptime
fi

# Or from proc/loadavg
if [ -f proc/loadavg ]; then
  cat proc/loadavg
fi

Interpret load averages:
- Load average format: 1-min, 5-min, 15-min
- Compare with CPU count from lscpu
- Load > CPU count → System overloaded
- Load >> CPU count (2x or more) → Critical overload

Check for CPU throttling:

# Look for thermal throttling in logs
grep -i "throttl\|temperature\|thermal" sos_commands/logs/journalctl_--no-pager 2>/dev/null | head -20

Identify CPU issues:
- 1-min load > 2x CPU count → Critical
- 5-min load > CPU count → Warning
- Thermal throttling present → Hardware/cooling issue

Step 3: Analyze Disk Usage

Parse df output for filesystem usage:

if [ -f sos_commands/filesys/df_-al ]; then
  # Skip header and special filesystems, show only regular filesystems
  grep -v "^Filesystem\|tmpfs\|devtmpfs\|overlay" sos_commands/filesys/df_-al | grep -v "^$"
fi

Identify full or nearly-full filesystems:

# Extract filesystems with usage > 85%
if [ -f sos_commands/filesys/df_-al ]; then
  awk 'NR>1 && $5+0 >= 85 {print $5, $6, $1}' sos_commands/filesys/df_-al | grep -v "tmpfs\|devtmpfs"
fi

Check disk I/O errors:

# Look for I/O errors in logs
grep -i "i/o error\|read error\|write error\|bad sector" var/log/dmesg 2>/dev/null
grep -i "i/o error\|read error\|write error" sos_commands/logs/journalctl_--no-pager 2>/dev/null | head -20

Analyze block devices:

if [ -f sos_commands/block/lsblk ]; then
  cat sos_commands/block/lsblk
fi

Identify disk issues:
- Filesystem > 95% full → Critical
- Filesystem > 85% full → Warning
- I/O errors present → Hardware issue
- Root filesystem full → System stability risk

Step 4: Analyze Process Information

Parse ps output:

if [ -f sos_commands/process/ps_auxwww ]; then
  # Show header
  head -1 sos_commands/process/ps_auxwww
fi

Find top CPU consumers:

# Sort by CPU usage (column 3), show top 10
if [ -f sos_commands/process/ps_auxwww ]; then
  tail -n +2 sos_commands/process/ps_auxwww | sort -k3 -rn | head -10
fi

Find top memory consumers:

# Sort by memory usage (column 4), show top 10
if [ -f sos_commands/process/ps_auxwww ]; then
  tail -n +2 sos_commands/process/ps_auxwww | sort -k4 -rn | head -10
fi

Check for zombie processes:

# Look for processes in Z state
if [ -f sos_commands/process/ps_auxwww ]; then
  grep " Z " sos_commands/process/ps_auxwww || echo "No zombie processes found"
fi

Count processes by state:

# Count processes by state (R=running, S=sleeping, D=uninterruptible, Z=zombie, T=stopped)
if [ -f sos_commands/process/ps_auxwww ]; then
  tail -n +2 sos_commands/process/ps_auxwww | awk '{print $8}' | cut -c1 | sort | uniq -c
fi

Identify process issues:
- Zombie processes present → Parent process not reaping children
- Many processes in D state → I/O bottleneck
- Single process using >80% memory → Memory leak or expected behavior
- Many processes using high CPU → CPU contention

Step 5: Correlate Resource Usage with Issues

Cross-reference with logs:
- If high memory usage, check for OOM events in logs
- If high disk usage, check for disk full errors
- If high load, check for performance-related errors
Identify resource exhaustion patterns:
- Memory exhaustion → OOM killer → Service crashes
- Disk full → Write failures → Application errors
- CPU overload → Timeouts → Request failures
Build timeline:
- When did resource issues start?
- Correlate with log timestamps
- Identify triggering event if possible

Step 6: Generate Resource Analysis Summary

Create a structured summary with the following sections:

Memory Summary:
- Total memory
- Used memory (GB and %)
- Available memory
- Swap usage (GB and %)
- Memory pressure indicators (OOM events)
CPU Summary:
- CPU count and model
- Load averages (1-min, 5-min, 15-min)
- Load per CPU
- CPU issues (throttling, overload)
Disk Summary:
- Filesystems and usage percentages
- Full or nearly-full filesystems
- I/O errors count
- Most full filesystem
Process Summary:
- Total process count
- Top CPU consumers (top 5)
- Top memory consumers (top 5)
- Zombie process count
- Processes in uninterruptible sleep (D state)
Critical Resource Issues:
- List issues by severity
- Provide evidence (file paths, metrics)
- Suggest remediation

Error Handling

Missing resource files:
- If free is missing, parse proc/meminfo directly
- If ps is missing, check proc/ for process information
- Document missing data in summary
Parsing errors:
- Handle different output formats (free -h vs free -m)
- Account for locale differences in number formats
- Validate data before calculations
Incomplete data:
- Some sosreports may not include all resource files
- Indicate which metrics are unavailable
- Work with available data only

Output Format

The resource analysis should produce:

RESOURCE USAGE SUMMARY
======================

MEMORY
------
Total:      {total_gb} GB
Used:       {used_gb} GB ({used_pct}%)
Available:  {available_gb} GB ({available_pct}%)
Buffers:    {buffers_gb} GB
Cached:     {cached_gb} GB
Swap Total: {swap_total_gb} GB
Swap Used:  {swap_used_gb} GB ({swap_used_pct}%)

Status: {OK|WARNING|CRITICAL}
Issues:
  - {memory_issue_description}

CPU
---
Model:        {cpu_model}
CPU Count:    {cpu_count}
Threads/Core: {threads_per_core}

Load Averages: {load_1m}, {load_5m}, {load_15m}
Load per CPU:  {load_1m_per_cpu}, {load_5m_per_cpu}, {load_15m_per_cpu}

Status: {OK|WARNING|CRITICAL}
Issues:
  - {cpu_issue_description}

DISK USAGE
----------
Filesystem                    Size  Used  Avail  Use%  Mounted on
{filesystem}                 {size} {used} {avail} {pct}% {mount}

Nearly Full Filesystems (>85%):
  - {mount}: {pct}% full ({available} available)

I/O Errors: {count} errors found in logs

Status: {OK|WARNING|CRITICAL}
Issues:
  - {disk_issue_description}

PROCESSES
---------
Total Processes: {total}
Running:         {running}
Sleeping:        {sleeping}
Zombie:          {zombie}
Uninterruptible: {uninterruptible}

Top CPU Consumers:
  1. {process_name} (PID {pid}): {cpu}% CPU, {mem}% MEM
  2. {process_name} (PID {pid}): {cpu}% CPU, {mem}% MEM
  3. {process_name} (PID {pid}): {cpu}% CPU, {mem}% MEM

Top Memory Consumers:
  1. {process_name} (PID {pid}): {mem}% MEM, {cpu}% CPU
  2. {process_name} (PID {pid}): {mem}% MEM, {cpu}% CPU
  3. {process_name} (PID {pid}): {mem}% MEM, {cpu}% CPU

Status: {OK|WARNING|CRITICAL}
Issues:
  - {process_issue_description}

CRITICAL RESOURCE ISSUES
------------------------
{severity}: {issue_description}
  Evidence: {file_path}
  Impact: {impact_description}
  Recommendation: {remediation_action}

RECOMMENDATIONS
---------------
1. {actionable_recommendation}
2. {actionable_recommendation}

DATA SOURCES
------------
- Memory: {sosreport_path}/sos_commands/memory/free
- Memory: {sosreport_path}/proc/meminfo
- CPU: {sosreport_path}/sos_commands/processor/lscpu
- Load: {sosreport_path}/uptime
- Disk: {sosreport_path}/sos_commands/filesys/df_-al
- Processes: {sosreport_path}/sos_commands/process/ps_auxwww

Examples

Example 1: Memory Analysis

# Parse free command output
$ cat sos_commands/memory/free
              total        used        free      shared  buff/cache   available
Mem:       16277396     8123456     2145678      123456     6008262     7654321
Swap:       8388604      512000     7876604

# Interpretation:
# - Total RAM: ~16 GB
# - Used: ~8 GB (50%)
# - Available: ~7.6 GB (47%)
# - Swap used: ~500 MB (6%)
# Status: OK - healthy memory usage

Example 2: Disk Full Detection

# Find filesystems > 85% full
$ awk 'NR>1 && $5+0 >= 85' sos_commands/filesys/df_-al
/dev/sda1      50G   45G   5G   90%  /
/dev/sdb1      100G  96G   4G   96%  /var/log

# Critical: Root filesystem at 90%, /var/log at 96%
# Action required: Clean up disk space

Example 3: High Load Investigation

# Check load averages
$ cat uptime
14:23:45 up 10 days, 3:42, 2 users, load average: 8.45, 7.23, 6.12

# With lscpu showing 4 CPUs:
# Load per CPU: 2.1, 1.8, 1.5
# System is overloaded (load > 2x CPU count)

Tips for Effective Analysis

Context matters: High resource usage isn't always bad - consider the workload
Look for trends: Compare 1-min, 5-min, 15-min loads to see if issues are growing
Correlate metrics: High load + high memory + disk full = multiple issues
Check ratios: Usage percentages are more meaningful than absolute values
Validate findings: Cross-reference with log analysis for confirmation
Consider capacity: Is the system appropriately sized for its workload?

Common Resource Patterns

Memory leak: Steadily increasing memory usage, eventual OOM
Disk full: Application writes failing, log rotation issues
CPU spike: Load average spike, potentially from runaway process
I/O bottleneck: High load but low CPU usage, many D-state processes
Swap thrashing: High swap usage, poor performance
Zombie accumulation: Parent process bug not reaping children

Severity Classification

Metric	OK	Warning	Critical
Memory Usage	< 80%	80-90%	> 90%
Swap Usage	< 20%	20-50%	> 50%
Disk Usage	< 85%	85-95%	> 95%
Load (per CPU)	< 1.0	1.0-2.0	> 2.0
Root FS Usage	< 80%	80-90%	> 90%

Resource Analysis