LVMS Analyzer Skill

This skill provides detailed guidance for analyzing LVMS (Logical Volume Manager Storage) must-gather data to identify and troubleshoot storage issues.

When to Use This Skill

Use this skill when:

Analyzing LVMS must-gather data offline
Diagnosing PVCs stuck in Pending state
Investigating LVMCluster readiness issues
Troubleshooting volume group creation failures
Debugging TopoLVM CSI driver problems
Checking operator health in LVMS namespace

This skill is automatically invoked by the /lvms:analyze command when working with must-gather data.

Prerequisites

Required:

LVMS must-gather directory extracted and accessible
Must-gather contains LVMS namespace directory:
- namespaces/openshift-lvm-storage/ (newer versions)
- OR namespaces/openshift-storage/ (older versions)
Python 3.6 or higher installed
PyYAML library: pip install pyyaml

Namespace Compatibility:

LVMS namespace changed from openshift-storage to openshift-lvm-storage in recent versions
The analysis script automatically detects which namespace is present
Both namespaces are fully supported for backward compatibility

Must-Gather Structure:

must-gather/
└── registry-{image-registry}-lvms-must-gather-{version}-sha256-{hash}/
    ├── cluster-scoped-resources/
    │   ├── core/
    │   │   └── persistentvolumes/
    │   │       └── pvc-*.yaml               # Individual PV files
    │   ├── storage.k8s.io/
    │   │   └── storageclasses/
    │   │       ├── lvms-vg1.yaml
    │   │       └── lvms-vg1-immediate.yaml
    │   └── security.openshift.io/
    │       └── securitycontextconstraints/
    │           └── lvms-vgmanager.yaml
    ├── namespaces/
    │   └── openshift-lvm-storage/           # or openshift-storage for older versions
    │       ├── oc_output/                   # IMPORTANT: Primary location for LVMS resources
    │       │   ├── lvmcluster.yaml          # Full LVMCluster resource with status
    │       │   ├── lvmcluster               # Text output (oc describe)
    │       │   ├── lvmvolumegroup           # Text output
    │       │   ├── lvmvolumegroupnodestatus # Text output
    │       │   ├── logicalvolume            # Text output
    │       │   ├── pods                     # Text output (oc get pods)
    │       │   └── events                   # Text output
    │       ├── pods/
    │       │   ├── lvms-operator-{hash}/
    │       │   │   └── lvms-operator-{hash}.yaml
    │       │   └── vg-manager-{hash}/
    │       │       └── vg-manager-{hash}.yaml
    │       └── apps/                        # May contain deployments/daemonsets
    └── ...

Key Note: LVMS resources are primarily in the oc_output/ directory, with lvmcluster.yaml being the most important file containing full cluster and node status.

Implementation Steps

Step 1: Validate Must-Gather Path

Before running analysis, verify the must-gather directory structure:

# Check if LVMS namespace directory exists (try both namespaces)
ls {must-gather-path}/namespaces/openshift-lvm-storage 2>/dev/null || \
  ls {must-gather-path}/namespaces/openshift-storage

# Verify required resource directories
ls {must-gather-path}/cluster-scoped-resources/core/persistentvolumes

Namespace Detection: The analysis script automatically detects which namespace is present:

Newer LVMS versions use openshift-lvm-storage
Older LVMS versions use openshift-storage
The script will inform you which namespace was detected

Common Issue: User provides parent directory instead of subdirectory

Must-gather extracts to a directory like must-gather.local.12345/
Inside is a subdirectory like registry-ci-openshift-org-origin-4-18.../
Always use the subdirectory (the one with cluster-scoped-resources/ and namespaces/)

Handling:

# If user provides parent directory, try to find the correct subdirectory
if [ ! -d "{path}/namespaces/openshift-lvm-storage" ] && \
   [ ! -d "{path}/namespaces/openshift-storage" ]; then
    # Try to find either namespace
    find {path} -type d \( -name "openshift-lvm-storage" -o -name "openshift-storage" \) -path "*/namespaces/*"
    # Suggest the correct path to user
fi

Step 2: Run Analysis Script

Use the Python analysis script for structured analysis:

python3 plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py \
    {must-gather-path}

Script Location:

Always use: plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py
Use relative path from repository root
Script is part of the LVMS plugin

Component-Specific Analysis:

For focused analysis on specific components:

# Analyze only storage/PVC issues
python3 plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py \
    {must-gather-path} --component storage

# Analyze only operator health
python3 plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py \
    {must-gather-path} --component operator

# Analyze only volume groups
python3 plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py \
    {must-gather-path} --component volumes

# Analyze only pod logs
python3 plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py \
    {must-gather-path} --component logs

Step 3: Interpret Analysis Results

The script provides structured output across several sections:

1. LVMCluster Status

Key fields to check:

state: Should be "Ready"
ready: Should be true
conditions: All should have status "True"
- ResourcesAvailable: Resources deployed successfully
- VolumeGroupsReady: VGs created on all nodes

Example healthy output:

LVMCluster: lvmcluster-sample
✓ State: Ready
✓ Ready: true

Conditions:
✓ ResourcesAvailable: True
✓ VolumeGroupsReady: True

Example unhealthy output (real case from must-gather):

LVMCluster: my-lvmcluster
❌ State: Degraded
❌ Ready: false

Conditions:
✓ ResourcesAvailable: True
  Reason: ResourcesAvailable
  Message: Reconciliation is complete and all the resources are available
❌ VolumeGroupsReady: False
  Reason: VGsDegraded
  Message: One or more VGs are degraded

2. Volume Group Status

Checks volume group creation per node and device availability:

Example output (real case from must-gather):

Volume Group/Device Class: vg1
Nodes: 3

  Node: ocpnode1.ocpiopex.growipx.com
  ⚠  Status: Progressing

  Devices: /dev/mapper/3600a098038315048302b586c38397562, /dev/mapper/mpatha

  Excluded devices: 24 device(s)
    - /dev/sdb: /dev/sdb has children block devices and could not be considered
    - /dev/sdb4: /dev/sdb4 has an invalid filesystem signature (xfs) and cannot be used
    - /dev/mapper/3600a098038315047433f586c53477272: has an invalid filesystem signature (xfs)
    ... and 21 more excluded devices

  Node: ocpnode2.ocpiopex.growipx.com
  ❌ Status: Degraded

  Reason:
  failed to create/extend volume group vg1: failed to extend volume group vg1:
  WARNING: VG name vg0 is used by VGs VVnkhP-khYQ-blyc-2TNo-d3cv-b6di-4RbSyY and EUV3xv-ft6q-39xK-J3ki-rglf-9H44-rVIHIq.
  Fix duplicate VG names with vgrename uuid, a device filter, or system IDs.
  Physical volume '/dev/mapper/3600a098038315048302b586c38397578p3' is already in volume group 'vg0'
  Unable to add physical volume '/dev/mapper/3600a098038315048302b586c38397578p3' to volume group 'vg0'
  ... (truncated, see LVMCluster status for full details)

  Devices: /dev/mapper/mpatha

This real example shows a common LVMS issue: duplicate volume group names preventing VG extension.

3. Storage (PVC/PV) Status

Lists pending or failed PVCs:

Example output:

Pending PVCs:

database/postgres-data
❌ Status: Pending (10m)
  Storage Class: lvms-vg1
  Requested: 100Gi

  Recent Events:
  ⚠  ProvisioningFailed: no node has enough free space

4. Operator Health

Checks LVMS operator pods, deployments, and daemonsets:

Example issues:

❌ vg-manager-abc123 (worker-0)
  Status: CrashLoopBackOff
  Restarts: 15
  Error: volume group "vg1" not found

5. Pod Logs

Extracts and analyzes error/warning messages from pod logs:

Example output (from real must-gather):

═══════════════════════════════════════════════════════════
POD LOGS ANALYSIS
═══════════════════════════════════════════════════════════

Pod: vg-manager-nz4pc
Unique errors/warnings: 1

❌ 2025-10-28T10:47:28Z: Reconciler error
  Controller: lvmvolumegroup
  Error Details:
    failed to create/extend volume group vg1: failed to extend volume group vg1:
    WARNING: VG name vg0 is used by VGs WsNJwk-DK3q-tSHg-zvQJ-imF1-SdRv-8oh4e0 ...
    Cannot use /dev/dm-10: device is too small (pv_min_size)
    Command requires all devices to be found.

Pod: lvms-operator-65df9f4dbb-92jwl
Unique errors/warnings: 1

❌ 2025-10-28T10:52:48Z: failed to validate device class setup
  Controller: lvmcluster
  Error: VG vg1 on node Degraded is not in ready state (ocpnode1.ocpiopex.growipx.com)

Key Points:

Logs are parsed from JSON format
Errors are deduplicated (same error repeated in reconciliation loops)
Shows unique error messages with first occurrence timestamp
Provides additional context not visible in resource status

Step 4: Analyze Root Causes

Connect related issues to identify root causes:

Common Pattern 1: Device Filesystem Conflict

Chain of failures:
1. Device /dev/sdb has existing ext4 filesystem
2. vg-manager cannot create volume group
3. Volume group missing on node
4. PVCs stuck in Pending

Root cause: Device not properly wiped before LVMS use

Common Pattern 2: Insufficient Capacity

Chain of failures:
1. Thin pool at 95% capacity
2. No free space for new volumes
3. PVCs stuck in Pending

Root cause: Insufficient storage capacity or old volumes not cleaned up

Common Pattern 3: Node-Specific Failures

Chain of failures:
1. Volume group missing on specific node
2. TopoLVM CSI driver not functional on that node
3. PVCs with node affinity to that node stuck Pending

Root cause: Node-specific device configuration issue

Step 5: Generate Remediation Plan

Based on analysis results, provide prioritized recommendations:

CRITICAL Issues (Fix Immediately):

Device Conflicts:

# Clean device on affected node
oc debug node/{node-name}
chroot /host wipefs -a /dev/{device}

# Restart vg-manager to recreate VG
oc delete pod -n openshift-lvm-storage -l app.kubernetes.io/component=vg-manager

Pod Crashes:

# After fixing underlying issue, restart failed pods
oc delete pod -n openshift-lvm-storage {pod-name}

LVMCluster Not Ready:

# Review and fix device configuration
oc edit lvmcluster -n openshift-lvm-storage

# Ensure devices match actual available devices

WARNING Issues (Address Soon):

Capacity Issues:

# Check logical volume usage
oc debug node/{node} -- chroot /host lvs --units g

# Remove unused volumes or expand thin pool

Partial Node Coverage:

# Investigate why daemonsets not on all nodes
oc get nodes --show-labels
oc describe daemonset -n openshift-lvm-storage

Step 6: Provide Next Steps

Always provide clear next steps:

Review logs (if available in must-gather):
- Operator logs: namespaces/openshift-lvm-storage/pods/lvms-operator-*/logs/
- VG-manager logs: namespaces/openshift-lvm-storage/pods/vg-manager-*/logs/
- TopoLVM logs: namespaces/openshift-lvm-storage/pods/topolvm-*/logs/

Verify fixes (if cluster is accessible):

# After implementing fixes, verify:
oc get lvmcluster -n openshift-lvm-storage
oc get lvmvolumegroup -A
oc get pvc -A | grep Pending

Re-collect must-gather (if making changes):

oc adm must-gather --image=quay.io/lvms_dev/lvms-must-gather:latest

Error Handling

Script Execution Errors

Script not found:

# Verify script exists
ls plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py

# Ensure it's executable
chmod +x plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py

Python dependencies missing:

# Install PyYAML
pip install pyyaml

# Or use pip3
pip3 install pyyaml

Invalid YAML in must-gather:

Script handles YAML parsing errors gracefully
Reports which files failed to parse
Continues analysis with available data

Must-Gather Issues

Missing directories:

Script validates required directories exist
Reports missing components
Provides guidance on what's missing

Incomplete must-gather:

If critical resources missing, script reports what it can analyze
Suggests re-collecting must-gather

Examples

Example 1: Full Analysis

# Run comprehensive analysis
python3 plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py \
    ./must-gather/registry-ci-openshift-org-origin-4-18.../

Output:

═══════════════════════════════════════════════════════════
LVMCLUSTER STATUS
═══════════════════════════════════════════════════════════

LVMCluster: lvmcluster-sample
❌ State: Failed
❌ Ready: false
...

═══════════════════════════════════════════════════════════
LVMS ANALYSIS SUMMARY
═══════════════════════════════════════════════════════════

❌ CRITICAL ISSUES: 3
  - LVMCluster not Ready (state: Failed)
  - Volume group vg1 not created on worker-0
  - 3 PVCs stuck in Pending state

Example 2: Storage-Only Analysis

# Focus on PVC issues
python3 plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py \
    ./must-gather/... --component storage

Analyzes only:

PVC/PV status
Storage class configuration
Volume provisioning issues

Example 3: Operator Health Check

# Check operator components
python3 plugins/lvms/skills/lvms-analyzer/scripts/analyze_lvms.py \
    ./must-gather/... --component operator