GPU Debugger

Turn errors into solutions.

This skill helps debug failed GPU CLI runs: OOM errors, sync failures, connectivity issues, model loading problems, and more.

When to Use This Skill

Problem	This Skill Helps With
"CUDA out of memory"	OOM diagnosis and fixes
"Connection refused"	Connectivity troubleshooting
"Sync failed"	File sync debugging
"Pod won't start"	Provisioning issues
"Model won't load"	Model loading errors
"Command exited with error"	Exit code analysis
"My run is hanging"	Stuck process diagnosis

Debugging Workflow

Error occurs
     │
     ▼
┌─────────────────────────┐
│ 1. Collect information  │
│    - Error message      │
│    - Daemon logs        │
│    - Exit code          │
│    - VRAM usage         │
└───────────┬─────────────┘
            │
            ▼
┌─────────────────────────┐
│ 2. Identify error type  │
│    - OOM                │
│    - Network            │
│    - Model              │
│    - Sync               │
│    - Permission         │
└───────────┬─────────────┘
            │
            ▼
┌─────────────────────────┐
│ 3. Apply fix            │
│    - Config change      │
│    - Code change        │
│    - Retry              │
└─────────────────────────┘

Information Collection Commands

Check Daemon Logs

# Last 50 log lines
gpu daemon logs --tail 50

# Full logs since last restart
gpu daemon logs

# Follow logs in real-time
gpu daemon logs --follow

Check Pod Status

# Current pod status
gpu status

# Pod details
gpu pods list

Check Job History

# Recent jobs
gpu jobs list

# Specific job details
gpu jobs show <job-id>

Common Errors and Solutions

1. CUDA Out of Memory (OOM)

Error messages:

CUDA out of memory. Tried to allocate X GiB
RuntimeError: CUDA error: out of memory
torch.cuda.OutOfMemoryError

Diagnosis:

# In your script, check VRAM usage
import torch
print(f"VRAM allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"VRAM reserved: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

Solutions by severity:

Solution	VRAM Savings	Effort
Reduce batch size	~Linear	Easy
Enable gradient checkpointing	~40%	Easy
Use FP16/BF16	~50%	Easy
Use INT8 quantization	~50%	Medium
Use INT4 quantization	~75%	Medium
Enable CPU offloading	Variable	Easy
Use larger GPU	Solves it	$$

Quick fixes:

# Reduce batch size
BATCH_SIZE = 1  # Start small, increase until OOM

# Enable gradient checkpointing
model.gradient_checkpointing_enable()

# Use FP16
model = model.half()

# Use INT4 quantization
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config)

# CPU offloading (for diffusers)
pipe.enable_model_cpu_offload()

# Clear cache between batches
torch.cuda.empty_cache()

Config fix:

{
  // Upgrade to larger GPU
  "gpu_type": "A100 PCIe 80GB",  // Instead of RTX 4090
  "min_vram": 80
}

2. Connection Refused / Timeout

Error messages:

Connection refused
Connection timed out
SSH connection failed
Failed to connect to daemon

Diagnosis:

# Check daemon status
gpu daemon status

# Check if daemon is running
ps aux | grep gpud

# Check daemon logs
gpu daemon logs --tail 20

Solutions:

Cause	Solution
Daemon not running	`gpu daemon start`
Daemon crashed	`gpu daemon restart`
Wrong socket	Check `GPU_DAEMON_SOCKET` env var
Port conflict	Kill conflicting process

Restart daemon:

gpu daemon stop
gpu daemon start

3. Pod Won't Start / Provisioning Failed

Error messages:

Failed to create pod
No GPUs available
Insufficient resources
Provisioning timeout

Diagnosis:

# Check available GPUs
gpu machines list

# Check specific GPU availability
gpu machines list --gpu "RTX 4090"

Solutions:

Cause	Solution
GPU type unavailable	Try different GPU type
Region full	Remove region constraint
Price too low	Increase max_price
Volume region mismatch	Use volume's region

Config fixes:

{
  // Use min_vram instead of exact GPU
  "gpu_type": null,
  "min_vram": 24,  // Any GPU with 24GB+

  // Or try different GPU
  "gpu_type": "RTX A6000",  // Alternative to RTX 4090

  // Or relax region constraint
  "region": null,  // Any region

  // Or increase price tolerance
  "max_price": 2.0  // Allow up to $2/hr
}

4. Sync Errors

Error messages:

rsync error
Sync failed
File not found
Permission denied during sync

Diagnosis:

# Check sync status
gpu sync status

# Check .gitignore
cat .gitignore

# Check outputs config
cat gpu.jsonc | grep outputs

Solutions:

Cause	Solution
File too large	Add to .gitignore
Permission issue	Check file permissions
Path not in outputs	Add to outputs config
Disk full on pod	Increase workspace size

Config fixes:

{
  // Ensure outputs are configured
  "outputs": ["output/", "results/", "models/"],

  // Exclude large files
  "exclude_outputs": ["*.tmp", "*.log", "checkpoints/"],

  // Increase storage
  "workspace_size_gb": 100
}

5. Model Loading Errors

Error messages:

Model not found
Could not load model
Safetensors error
HuggingFace rate limit

Diagnosis:

# Check if model is downloading
# Look for download progress in job output

# Check HuggingFace cache on pod
gpu run ls -la ~/.cache/huggingface/hub/

Solutions:

Cause	Solution
Model not downloaded	Add to download spec
Wrong model path	Fix path in code
HF rate limit	Set HF_TOKEN
Network issue	Retry with timeout
Gated model	Accept license on HF

Config fixes:

{
  // Pre-download models
  "download": [
    { "strategy": "hf", "source": "meta-llama/Llama-3.1-8B-Instruct", "timeout": 7200 }
  ],

  // Set HF token in environment
  "environment": {
    "shell": {
      "steps": [
        { "run": "echo 'HF_TOKEN=your_token' >> ~/.bashrc" }
      ]
    }
  }
}

6. Process Hanging / Stuck

Symptoms:

No output for a long time
Process doesn't exit
gpu status shows job running forever

Diagnosis:

# Check if process is actually running
gpu run ps aux | grep python

# Check for infinite loops in logs
gpu jobs logs <job-id> --tail 100

# Check VRAM (might be swapping)
gpu run nvidia-smi

Solutions:

Cause	Solution
Infinite loop	Fix code logic
Waiting for input	Make script non-interactive
VRAM thrashing	Reduce memory usage
Deadlock	Add timeout
Network wait	Add timeout to requests

Code fixes:

# Add timeout to model loading
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    low_cpu_mem_usage=True  # Prevent OOM during loading
)

# Add timeout to HTTP requests
import requests
response = requests.get(url, timeout=30)

# Add progress bars to see activity
from tqdm import tqdm
for item in tqdm(items):
    process(item)

7. Exit Code Errors

Common exit codes:

Code	Meaning	Common Cause
0	Success	-
1	General error	Script exception
2	Misuse of command	Bad arguments
126	Permission denied	Script not executable
127	Command not found	Missing binary
137	Killed (OOM)	Out of memory
139	Segfault	Bad memory access
143	Terminated	Killed by signal

Diagnosis:

# Check last job exit code
gpu jobs list --limit 1

# Get full output including error
gpu jobs logs <job-id>

Solutions:

Exit Code	Solution
1	Check Python traceback, fix exception
126	`chmod +x script.sh`
127	Install missing package
137	Reduce memory usage, bigger GPU
139	Update PyTorch/CUDA versions

8. CUDA Version Mismatch

Error messages:

CUDA error: no kernel image is available
CUDA version mismatch
Torch not compiled with CUDA enabled

Diagnosis:

# Check CUDA version on pod
gpu run nvcc --version
gpu run nvidia-smi | head -3

# Check PyTorch CUDA version
gpu run python -c "import torch; print(torch.version.cuda)"

Solutions:

{
  // Use a known-good base image
  "environment": {
    "base_image": "runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04"
  }
}

Or in requirements.txt:

# Install PyTorch with specific CUDA version
--extra-index-url https://download.pytorch.org/whl/cu124
torch==2.4.0

Debug Script Template

Add this to your projects for better error info:

#!/usr/bin/env python3
"""Wrapper script with debugging info."""

import sys
import traceback
import torch

def print_system_info():
    """Print system info for debugging."""
    print("=" * 50)
    print("SYSTEM INFO")
    print("=" * 50)
    print(f"Python: {sys.version}")
    print(f"PyTorch: {torch.__version__}")
    print(f"CUDA available: {torch.cuda.is_available()}")
    if torch.cuda.is_available():
        print(f"CUDA version: {torch.version.cuda}")
        print(f"GPU: {torch.cuda.get_device_name(0)}")
        print(f"VRAM total: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
        print(f"VRAM allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
    print("=" * 50)

def main():
    # Your actual code here
    pass

if __name__ == "__main__":
    print_system_info()
    try:
        main()
    except Exception as e:
        print("\n" + "=" * 50)
        print("ERROR OCCURRED")
        print("=" * 50)
        print(f"Error type: {type(e).__name__}")
        print(f"Error message: {str(e)}")
        print("\nFull traceback:")
        traceback.print_exc()

        # Print memory info for OOM debugging
        if torch.cuda.is_available():
            print(f"\nVRAM at error: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

        sys.exit(1)

Quick Reference: Error → Solution

Error Contains	Likely Cause	Quick Fix
`CUDA out of memory`	OOM	Reduce batch size, use quantization
`Connection refused`	Daemon down	`gpu daemon restart`
`No GPUs available`	Supply shortage	Try different GPU type
`Model not found`	Not downloaded	Add to download spec
`Permission denied`	File permissions	`chmod +x` or check path
`Killed`	OOM (exit 137)	Bigger GPU
`Timeout`	Network/hanging	Add timeout, check code
`CUDA version`	Version mismatch	Use compatible base image
`rsync error`	Sync issue	Check .gitignore, outputs
`rate limit`	HuggingFace limit	Set HF_TOKEN

Output Format

When debugging:

## Debug Analysis

### Error Identified

**Type**: [OOM/Network/Model/Sync/etc.]
**Message**: `[exact error message]`

### Root Cause

[Explanation of why this happened]

### Solution

**Option 1** (Recommended): [solution]
```[code/config change]```

**Option 2** (Alternative): [solution]
```[code/config change]```

### Prevention

To avoid this in the future:
1. [Prevention tip 1]
2. [Prevention tip 2]

gpu-debugger

GPU Debugger

When to Use This Skill

Debugging Workflow

Information Collection Commands

Check Daemon Logs

Check Pod Status

Check Job History

Common Errors and Solutions

1. CUDA Out of Memory (OOM)

2. Connection Refused / Timeout

3. Pod Won't Start / Provisioning Failed

4. Sync Errors

5. Model Loading Errors

6. Process Hanging / Stuck

7. Exit Code Errors

8. CUDA Version Mismatch

Debug Script Template

Quick Reference: Error → Solution

Output Format

Similar Skills