Skill

openshift-operator-troubleshooting

Deep troubleshooting guide for OpenShift cluster operators and Operator Lifecycle Manager (OLM) including diagnosing degraded operators, failed installations, subscription issues, and operator conflicts. Use when operators are degraded, failing to install/upgrade, or causing cluster issues.

Stats

Actions

OpenShift Operator Troubleshooting

This skill provides comprehensive guidance for troubleshooting OpenShift cluster operators and OLM-managed operators, covering installation, upgrade, and operational issues.

When to Use This Skill

Cluster operators showing DEGRADED or PROGRESSING status
Operator installations or upgrades failing
OLM subscription or install plan issues
Operator conflicts or dependency problems
CSV (ClusterServiceVersion) errors
CRD (Custom Resource Definition) issues
Operator resource constraints or crashes

Understanding OpenShift Operators

Cluster Operators: Core platform operators managed by Cluster Version Operator (CVO) OLM Operators: Add-on operators managed by Operator Lifecycle Manager

# View cluster operators
oc get clusteroperators

# View OLM-managed operators
oc get csv -A
oc get operators -A

Troubleshooting Cluster Operators

Step 1: Identify Degraded Operators

# List all cluster operators with status
oc get clusteroperators

# Look for:
# AVAILABLE=False: Operator not functioning
# PROGRESSING=True: Operator still reconciling
# DEGRADED=True: Operator has issues

# Filter for degraded operators
oc get co -o json | jq -r '.items[] | select(.status.conditions[] | select(.type=="Degraded" and .status=="True")) | .metadata.name'

# Get operator details
oc describe clusteroperator <operator-name>

Step 2: Analyze Operator Status

# Get detailed status
oc get clusteroperator <operator-name> -o yaml

# Check status conditions
oc get co <operator-name> -o jsonpath='{.status.conditions}' | jq

# View operator versions
oc get co <operator-name> -o jsonpath='{.status.versions}'

# Check related objects
oc get co <operator-name> -o jsonpath='{.status.relatedObjects}' | jq

Step 3: Check Operator Pods and Resources

# Find operator namespace
oc get co <operator-name> -o jsonpath='{.status.relatedObjects[].resource}' | grep namespace

# Common operator namespaces:
# openshift-*: Platform operators
# openshift-kube-*: Kubernetes control plane
# openshift-cluster-*: Cluster-wide services

# List pods in operator namespace
oc get pods -n <operator-namespace>

# Check pod status
oc describe pod <pod-name> -n <operator-namespace>

# View operator logs
oc logs -n <operator-namespace> <pod-name>
oc logs -n <operator-namespace> <pod-name> --previous

# For multiple replicas
oc logs -n <operator-namespace> -l app=<operator-app> --tail=100

# Follow logs in real-time
oc logs -n <operator-namespace> -l app=<operator-app> -f

Step 4: Common Cluster Operator Issues

Authentication Operator

# Check OAuth configuration
oc get oauth cluster -o yaml
oc get pods -n openshift-authentication
oc logs -n openshift-authentication -l app=oauth-openshift

Console Operator

# Check console deployment
oc get pods -n openshift-console
oc get route console -n openshift-console
oc logs -n openshift-console -l app=console

DNS Operator

# Check DNS pods
oc get pods -n openshift-dns
oc logs -n openshift-dns -l dns.operator.openshift.io/daemonset-dns=default

Ingress Operator

# Check router pods
oc get pods -n openshift-ingress
oc get ingresscontroller -n openshift-ingress-operator
oc logs -n openshift-ingress-operator -l name=ingress-operator

Monitoring Operator

# Check monitoring stack
oc get pods -n openshift-monitoring
oc get prometheuses -n openshift-monitoring
oc logs -n openshift-monitoring-operator -l app=cluster-monitoring-operator

Network Operator

# Check network operator
oc get network.operator cluster -o yaml
oc get pods -n openshift-network-operator
oc get pods -n openshift-sdn  # or openshift-ovn-kubernetes
oc logs -n openshift-network-operator -l name=network-operator

Storage Operator

# Check storage classes
oc get storageclass
oc get csidriver
oc get pods -n openshift-cluster-storage-operator
oc logs -n openshift-cluster-storage-operator -l name=cluster-storage-operator

Troubleshooting OLM Operators

Step 1: Check Operator Subscriptions

# List all subscriptions
oc get subscription -A

# Get subscription details
oc describe subscription <subscription-name> -n <namespace>

# Check subscription status
oc get subscription <subscription-name> -n <namespace> -o yaml

# View available channels
oc get packagemanifest <operator-package-name> -o yaml

Step 2: Check Install Plans

# List install plans
oc get installplan -A

# Get install plan details
oc describe installplan <installplan-name> -n <namespace>

# Check if approval is required
oc get installplan -n <namespace> -o json | jq '.items[] | select(.spec.approved==false)'

# Approve install plan
oc patch installplan <installplan-name> -n <namespace> --type merge -p '{"spec":{"approved":true}}'

Step 3: Check ClusterServiceVersion (CSV)

# List all CSVs
oc get csv -A

# Get CSV details
oc describe csv <csv-name> -n <namespace>

# Check CSV status
oc get csv <csv-name> -n <namespace> -o jsonpath='{.status.phase}'
# Phases: Pending, InstallReady, Installing, Succeeded, Failed

# Check CSV conditions
oc get csv <csv-name> -n <namespace> -o yaml | grep -A 10 conditions

# View CSV requirements
oc get csv <csv-name> -n <namespace> -o jsonpath='{.spec.install.spec.deployments}'

Step 4: Check Operator Pods

# Find operator pods from CSV
oc get csv <csv-name> -n <namespace> -o jsonpath='{.spec.install.spec.deployments[*].name}'

# Check deployment
oc get deployment -n <namespace>
oc describe deployment <operator-deployment> -n <namespace>

# Check pods
oc get pods -n <namespace>
oc logs -n <namespace> <operator-pod>

# Check events
oc get events -n <namespace> --sort-by='.lastTimestamp'

Step 5: Check CRDs and Custom Resources

# List CRDs installed by operator
oc get crd | grep <operator-domain>

# Get CRD details
oc describe crd <crd-name>

# Check CRD version
oc get crd <crd-name> -o jsonpath='{.spec.versions}'

# List custom resources
oc get <crd-plural> -A

# Validate custom resource
oc get <crd-plural> <resource-name> -n <namespace> -o yaml
oc describe <crd-plural> <resource-name> -n <namespace>

Common OLM Issues and Solutions

Issue: Operator Stuck in "Installing"

# Check install plan
oc get installplan -n <namespace>
oc describe installplan <installplan-name> -n <namespace>

# Check if approval needed
oc patch installplan <installplan-name> -n <namespace> --type merge -p '{"spec":{"approved":true}}'

# Check for resource conflicts
oc get events -n <namespace>

# Check operator deployment
oc get deployment -n <namespace>
oc describe deployment <operator-deployment> -n <namespace>

Issue: CSV in "Failed" State

# Check CSV conditions
oc get csv <csv-name> -n <namespace> -o yaml

# Common causes:
# - Missing CRDs
# - Insufficient permissions
# - Resource conflicts
# - Image pull errors

# Check for missing requirements
oc get csv <csv-name> -n <namespace> -o jsonpath='{.status.requirementStatus}'

# Delete and recreate if necessary
oc delete csv <csv-name> -n <namespace>
# OLM will recreate from subscription

Issue: Operator Image Pull Errors

# Check pod events
oc describe pod <operator-pod> -n <namespace>

# Verify image pull secrets
oc get secrets -n <namespace>
oc describe secret <pull-secret> -n <namespace>

# Link secret to operator service account
oc secrets link <service-account> <pull-secret> --for=pull -n <namespace>

# Check catalog source
oc get catalogsource -n openshift-marketplace
oc describe catalogsource <catalog-name> -n openshift-marketplace

Issue: Multiple Operators Conflicting

# Check for CRD conflicts
oc get crd -o custom-columns=NAME:.metadata.name,CREATED:.metadata.creationTimestamp

# Check which CSV owns CRD
oc get crd <crd-name> -o jsonpath='{.metadata.ownerReferences}'

# List all CSVs that reference the CRD
oc get csv -A -o json | jq -r '.items[] | select(.spec.customresourcedefinitions.owned[]?.name=="<crd-name>") | .metadata.name'

# Remove conflicting operator
oc delete subscription <subscription-name> -n <namespace>
oc delete csv <csv-name> -n <namespace>

Issue: Operator Upgrade Failed

# Check subscription channel
oc get subscription <subscription-name> -n <namespace> -o yaml

# Check install plan for upgrade
oc get installplan -n <namespace>

# Rollback to previous version (if needed)
oc delete installplan <failed-installplan> -n <namespace>

# Update subscription to stable channel
oc patch subscription <subscription-name> -n <namespace> --type='merge' -p '{"spec":{"channel":"stable"}}'

Operator Lifecycle Manager (OLM) Troubleshooting

Check OLM Components

# Check OLM operators
oc get pods -n openshift-operator-lifecycle-manager

# OLM operator logs
oc logs -n openshift-operator-lifecycle-manager -l app=olm-operator

# Catalog operator logs
oc logs -n openshift-operator-lifecycle-manager -l app=catalog-operator

# Package server logs
oc logs -n openshift-operator-lifecycle-manager -l app=packageserver

Check Catalog Sources

# List catalog sources
oc get catalogsource -A

# Check catalog source health
oc get catalogsource -n openshift-marketplace
oc describe catalogsource <catalog-name> -n openshift-marketplace

# Check catalog pod
oc get pods -n openshift-marketplace | grep <catalog-name>
oc logs -n openshift-marketplace <catalog-pod>

# Refresh catalog
oc delete pod -n openshift-marketplace -l olm.catalogSource=<catalog-name>

Check OperatorGroup

# List operator groups
oc get operatorgroup -A

# Check operator group configuration
oc describe operatorgroup <operatorgroup-name> -n <namespace>

# Verify target namespaces
oc get operatorgroup <operatorgroup-name> -n <namespace> -o jsonpath='{.spec.targetNamespaces}'

Advanced Troubleshooting

Enable Operator Debug Logging

# For cluster operators (example: ingress)
oc patch ingresscontroller default -n openshift-ingress-operator --type='merge' -p '{"spec":{"logging":{"access":{"destination":{"type":"Container"}}}}}'

# For OLM operators, check operator-specific documentation
# Many operators support log level configuration via env vars or CR

Collect Must-Gather for Operators

# General must-gather
oc adm must-gather

# Operator-specific must-gather (if available)
oc adm must-gather --image=<operator-must-gather-image>

# For cluster operators
oc adm inspect clusteroperator/<operator-name>
oc adm inspect namespace/<operator-namespace>

Check Operator Metrics

# Port-forward to Prometheus
oc port-forward -n openshift-monitoring prometheus-k8s-0 9090:9090

# Query operator metrics
# Access http://localhost:9090

# Check operator-specific metrics
# Example: up{job="<operator-name>"}

Verify RBAC Permissions

# Check service account
oc get sa -n <namespace>
oc describe sa <operator-sa> -n <namespace>

# Check roles and role bindings
oc get role,rolebinding -n <namespace>
oc get clusterrole,clusterrolebinding | grep <operator-name>

# Check if operator can perform actions
oc auth can-i create pods --as=system:serviceaccount:<namespace>:<operator-sa>

Best Practices

Check Logs First: Operator logs usually contain the root cause
Verify Resource Requirements: Ensure sufficient CPU/memory
Check Dependencies: Operators may depend on other operators or services
Review Operator Documentation: Each operator has specific requirements
Validate CRs: Ensure custom resources meet the CRD schema
Monitor During Changes: Watch operator status during upgrades or config changes
Use Must-Gather: Collect comprehensive diagnostics for complex issues
Check Red Hat Knowledgebase: Many issues have documented solutions
Keep Operators Updated: Use stable channels and approved versions
Test in Non-Production: Validate operator changes before production

Operator Health Checklist

# Quick health check script
echo "=== Cluster Operators ==="
oc get co
echo ""
echo "=== Degraded Operators ==="
oc get co | grep -v "True.*False.*False"
echo ""
echo "=== OLM Operators ==="
oc get csv -A
echo ""
echo "=== Failed CSVs ==="
oc get csv -A | grep -i failed
echo ""
echo "=== Pending Install Plans ==="
oc get installplan -A | grep -i false
echo ""
echo "=== Catalog Sources ==="
oc get catalogsource -A

Useful Commands

# One-liner to check all operator health
oc get co && oc get csv -A && oc get subscription -A

# Find all operator pods
oc get pods -A | grep operator

# Get all operator logs
for ns in $(oc get namespaces -o name | cut -d/ -f2 | grep openshift); do
  echo "=== Namespace: $ns ==="
  oc logs -n $ns -l app=operator --tail=20
done

# Check operator resource usage
oc adm top pods -A | grep operator

# Export operator configuration
oc get co <operator-name> -o yaml > operator-config.yaml
oc get csv <csv-name> -n <namespace> -o yaml > operator-csv.yaml

Related Skills

openshift-debugging - General cluster troubleshooting
openshift-cluster-upgrade - Operators are updated during upgrades
openshift-node-operations - Some operators manage node resources

openshift-operator-troubleshooting

openshift-operator-troubleshooting

Popularity

Invocation

Context Preview

SKILL.md

OpenShift Operator Troubleshooting

When to Use This Skill

Understanding OpenShift Operators

Troubleshooting Cluster Operators

Step 1: Identify Degraded Operators

Step 2: Analyze Operator Status

Step 3: Check Operator Pods and Resources

Step 4: Common Cluster Operator Issues

Troubleshooting OLM Operators

Step 1: Check Operator Subscriptions

Step 2: Check Install Plans

Step 3: Check ClusterServiceVersion (CSV)

Step 4: Check Operator Pods

Step 5: Check CRDs and Custom Resources

Common OLM Issues and Solutions

Issue: Operator Stuck in "Installing"

Issue: CSV in "Failed" State

Issue: Operator Image Pull Errors

Issue: Multiple Operators Conflicting

Issue: Operator Upgrade Failed

Operator Lifecycle Manager (OLM) Troubleshooting

Check OLM Components

Check Catalog Sources

Check OperatorGroup

Advanced Troubleshooting

Enable Operator Debug Logging

Collect Must-Gather for Operators

Check Operator Metrics

Verify RBAC Permissions

Best Practices

Operator Health Checklist

Useful Commands

Related Skills

Similar Skills

Similar Skills