Skill

proxmox

Proxmox VE administration: VM/LXC/OCI container provisioning, storage backends, networking/SDN, clustering, high availability, API automation, cloud-init templates, backups/PBS, PCIe passthrough, and vGPU. Invoke whenever task involves any interaction with Proxmox VE — configuring hosts, managing guests, designing storage or networking, writing automation scripts, planning clusters, troubleshooting, or reviewing PVE configurations.

npx claudepluginhub xobotyi/cc-foundry --plugin infrastructure

Popularity

Parent stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/infrastructure:proxmox

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

**Production infrastructure demands production discipline. Every Proxmox configuration must be secure by default,

Supporting Files

references/api-automation.mdreferences/backup-strategies.mdreferences/clustering-and-ha.mdreferences/networking.mdreferences/storage-backends.mdreferences/vm-and-lxc.md

SKILL.md

470 lines · ~5.8k tokens(exceeds 5k compaction limit)

Similar Skills

ansible-proxmox

Automates Proxmox VE with Ansible's community.proxmox collection for creating VMs/templates, managing clusters/users/ACLs/storage, preferring native modules over CLI like pvecm/qm.

4 files

ansible-workflows

unraid

Unraid server management: array configuration, Docker containers, VMs, shares, plugins, user scripts, backup strategy, and security hardening. Invoke whenever task involves any interaction with Unraid — configuring storage, deploying containers, setting up VMs, managing shares, writing user scripts, planning backups, reviewing configurations, or troubleshooting Unraid systems.

6 files

infrastructure

omni-talos

Provides scripts and guidance for operating Sidero Omni Proxmox providers in Talos Kubernetes clusters: check status, restart, view logs, debug registration, create machine classes, configure storage selectors.

17 files

omni-scale

Stats

LanguageJavaScript

Parent stars11

MaintenanceGood

Last CommitMar 14, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Help us improve

Share bugs, ideas, or general feedback.

Proxmox VE

Production infrastructure demands production discipline. Every Proxmox configuration must be secure by default, redundant where it matters, and automated where possible.

Route to Reference

VM, LXC, and OCI container management — [${CLAUDE_SKILL_DIR}/references/vm-and-lxc.md]: VM vs LXC vs OCI comparison, OCI container support (PVE 9.1 tech preview), Docker-on-Proxmox decision guidance, configuration options, template workflows, linked vs full clone trade-offs
Storage backends — [${CLAUDE_SKILL_DIR}/references/storage-backends.md]: Backend capability matrix, ZFS tuning (ARC/L2ARC/SLOG/volblocksize), Ceph configuration, LVM-Thin monitoring, storage selection decision tree
Networking — [${CLAUDE_SKILL_DIR}/references/networking.md]: Bridge configuration, VLAN layout, bonding modes, SDN zones (VXLAN/EVPN), MTU considerations, OVS vs Linux bridge, firewall lockout prevention
Clustering and HA — [${CLAUDE_SKILL_DIR}/references/clustering-and-ha.md]: Corosync configuration, quorum math, QDevice setup, split-brain prevention, fencing methods, HA groups, migration, quorum loss recovery
API and automation — [${CLAUDE_SKILL_DIR}/references/api-automation.md]: REST API architecture, pvesh/qm/pct reference, Terraform patterns, cloud-init customization (cicustom, network v2), hookscript lifecycle, CI/CD pipelines
Backup strategies — [${CLAUDE_SKILL_DIR}/references/backup-strategies.md]: vzdump modes, PBS architecture, encryption key management, garbage collection safety, verification jobs, retention policies, off-site sync patterns

Guest Management

VM vs LXC vs OCI Decision

Default to LXC for trusted Linux workloads — near-zero overhead, high density
Use VMs when the workload requires: non-Linux OS, full kernel isolation, PCIe passthrough, or live migration
Use OCI containers (PVE 9.1+, tech preview) for single-purpose microservices from Docker Hub/GHCR — lightweight deployment without a Docker VM. Not suitable for multi-container stacks or workloads requiring Docker Compose
Use a Docker VM for full Docker/Compose/Kubernetes workflows, multi-container stacks, or advanced networking (macvlan, overlay). Still the most flexible and compatible option for container-heavy workloads
Use unprivileged containers (default) — they map container UID 0 to a non-privileged host UID, preventing container escape attacks
Only use privileged containers when unprivileged mode is incompatible (specific device access, certain NFS mounts)

VM Configuration

Use VirtIO drivers for disk and network — mandatory for production performance
Enable QEMU Guest Agent inside every VM for graceful shutdown, snapshot consistency, and IP address reporting
Use OVMF (UEFI) + Q35 machine type for PCIe passthrough and Secure Boot
Set CPU type to host for maximum performance in homogeneous clusters; use x86-64-v2-AES or the lowest common CPU generation when live migration across different CPU generations is required
Enable memory ballooning for dynamic RAM management
Enable NUMA topology for VMs with many cores on multi-socket hosts

Container Configuration

Set explicit memory limits — containers without limits can exhaust host RAM
Enable nesting (features: nesting=1) only when required (Docker inside LXC)
For Docker workloads in LXC: unprivileged + nesting=1 + keyctl=1 — but note this is unsupported and can break on host updates (CVE-2025-52881 broke Docker-in-LXC setups; workaround: lxc.apparmor.profile: unconfined)
Use bind mounts to share host directories, not NFS/CIFS mounts inside the container
PVE 9.0 removed cgroup v1 entirely — containers requiring cgroup v1 must move to VMs

OCI Application Containers (PVE 9.1 — Tech Preview)

Pull OCI images from Docker Hub/GHCR/Quay and run as LXC containers — no Docker engine required
Limitations: no in-place updates, no Docker Compose, no orchestration, no shell in most containers
Use for: single-purpose lightweight services; use a Docker VM for multi-container stacks
See [${CLAUDE_SKILL_DIR}/references/vm-and-lxc.md] for full OCI details, Docker-on-Proxmox decision guide, and Docker VM best practices

Templates and Cloning

Build templates with the guest agent and cloud-init pre-installed
Convert to template: qm template <vmid> or backup as .tar.gz for containers
Use linked clones for development/ephemeral workloads (fast, shared base disk)
Use full clones for production (independent, no template dependency)

Storage

Backend Selection

Local redundancy, data integrity, snapshots → ZFS
Local snapshots/clones without ZFS overhead → LVM-Thin
Shared storage for HA clusters (3+ nodes) → Ceph
Simple shared storage (existing NAS/SAN) → NFS or iSCSI
Deduplicated backups → PBS

ZFS Rules

Use HBA in IT mode — never hardware RAID controllers with write-back cache
ARC sizing: cap explicitly in /etc/modprobe.d/zfs.conf; 25-30% of host RAM for mixed workloads. PVE 8.1+ defaults to 10% (max 16GB) for new installs. Rule of thumb: 2GB base + 1GB per TB of storage
Set ashift=12 for 4K sector disks (most modern drives); incorrect ashift halves IOPS due to sector misalignment
Volblocksize (VMs on zvols): 16k for mirrors/RAID10; increase for wide RAIDZ to reduce write amplification. Can only be set at zvol creation
Recordsize (containers on datasets): tune per workload — 8k for Postgres, 16k for MariaDB, 1M for large sequential files
Enable compression=lz4 — can actually increase I/O performance by writing less data to disk
SLOG: 8-32GB enterprise NVMe with power-loss protection; only holds ~5s of write data. Benefits sync-heavy workloads (databases, NFS)
L2ARC: only if ARC hit ratio is low and adding RAM is not possible; size 5-20x RAM; budget 1GB RAM per 50GB of L2ARC for metadata
Schedule regular scrubs (weekly or monthly)
Plan capacity upfront — ZFS pools cannot be shrunk
Never create swap on a ZFS zvol — causes blocking I/O during backups

Ceph Rules

Minimum 3 nodes for proper quorum and data distribution
Dedicated 10GbE+ network for Ceph traffic (25GbE+ recommended for NVMe OSDs), separate from Corosync — Ceph rebalance traffic saturates links and causes Corosync instability
Use SSD/NVMe for OSD WAL/DB
Dedicated disks for OSDs — never share with the host OS
PVE 9.0 defaults to Ceph Squid (v19.2)

Storage Anti-Patterns

Running ZFS or Ceph on top of hardware RAID with write-back cache — defeats data integrity guarantees
Overprovisioning LVM-Thin without monitoring — a full thin pool causes I/O errors for all guests on that pool
Storing backups on the same physical disks as production data
Letting ZFS ARC consume unbounded RAM — causes VM crashes when the hypervisor and VMs compete for memory

Networking

Core Model

Every guest connects to a Linux bridge. Use VLAN-aware bridges (single bridge with 802.1Q tagging) instead of per-VLAN bridges. The VLAN-aware checkbox must be explicitly enabled — it is off by default.

VLAN Best Practices

Place the management interface on a dedicated VLAN — never share with guest traffic
Configure trunk ports on physical switches for the Proxmox host — frames with VLAN IDs not allowed by the switch are silently dropped
Assign VLAN tags per guest in the network configuration
Verify no double-tagging mismatch between Proxmox and switch native VLAN

Traffic Separation

Management / Corosync — 1GbE (dedicated)
Ceph cluster + public — 10GbE (dedicated), 25GbE recommended
Migration — 10GbE (recommended)
Guest — depends on workload

Critical rule: Never combine Corosync traffic with high-bandwidth Ceph or migration traffic on a single 1GbE link. Corosync is latency-sensitive — network contention causes cluster instability and false fencing.

SDN

Use SDN (VXLAN zones) for overlay networking across nodes without physical switch changes. Use EVPN for advanced multi-tenant setups with BGP routing. SDN is fully supported and installed by default since PVE 8.1.

SDN gotchas:

VXLAN adds 50-byte header — set VNet MTU to 1450 (or 1370 with IPSEC)
SDN changes are staged, not live — click Apply at Datacenter level
DHCP requires a gateway configured on the subnet
Multiple EVPN exit nodes require disabling rp_filter in sysctl
OVS vs Linux bridge: OVS is automatically VLAN-aware and may resolve 10GbE throughput bottlenecks seen with native Linux bridge

SDN-Firewall integration (PVE 8.3+): SDN automatically generates IPSets for VNets and IPAM-managed guests — use these in firewall rules for simplified maintenance. The nftables firewall can filter forwarded traffic at host and VNet levels (e.g., restrict SNAT or inter-zone traffic).

Fabrics (PVE 9.0+): Automated routing between cluster nodes using FRRouting with OpenFabric (IS-IS-based) or OSPF. Fabrics simplify underlay network configuration for Ceph full-mesh and EVPN/VXLAN deployments.

Clustering and High Availability

Cluster Rules

Use a dedicated network for Corosync — latency under 5ms required
Configure redundant Corosync links (up to 8 supported via Kronosnet) on separate physical networks
For 2-node clusters, deploy a QDevice on a third machine for quorum — a 2-node cluster without QDevice is a split-brain generator
QDevice is discouraged for odd-numbered clusters — it becomes a single point of failure due to (N-1) vote allocation
If using LACP bonds for Corosync, set bond-lacp-rate fast on both node and switch — default slow rate has 90s failover, causing fencing after ~60s
Avoid balance-rr, balance-xor, balance-tlb, balance-alb bond modes for Corosync — they cause asymmetric connectivity and mass fencing
Never join a node with existing VMs/containers to a cluster — start fresh
Update nodes one at a time — the LRM requests a service freeze from the CRM during updates; if both are updating, the watchdog fences the node

HA Requirements

Shared or replicated storage accessible from all HA nodes
Working fencing mechanism — test before relying on HA
Minimum 3 quorum votes (3 nodes, or 2 nodes + QDevice)
Configure HA groups with node priorities for controlled failover

Fencing

Fencing guarantees a failed node is offline before its services restart elsewhere. HA without fencing is a data corruption risk — two nodes writing to shared storage simultaneously causes irrecoverable damage.

Verify watchdog status: ha-manager status
Test fencing by simulating node failure before going to production
Use hardware watchdog (iTCO_wdt via /etc/default/pve-ha-manager) when available, software watchdog (softdog) as fallback
Configure WATCHDOG_MODULE=iTCO_wdt in /etc/default/pve-ha-manager

Quorum Loss Recovery

If the cluster loses quorum, pmxcfs becomes read-only — no VM operations are possible. Emergency recovery:

pvecm expected 1 forces single-node quorum — use only to restore vital guests or fix the quorum issue itself
Never make cluster changes (add/remove nodes, storage, guests) while expected votes are overridden

Migration

Live migration (VMs only): requires shared/replicated storage, brief pause at cutover
vGPU live migration (PVE 8.4+): VMs using NVIDIA vGPU (mediated devices) can now be live-migrated between nodes with compatible GPU hardware — previously required shutdown
VMs with full PCIe passthrough devices still cannot be live-migrated — use cluster-wide resource mappings for HA
Offline migration (VMs and containers): guest stops, data transfers, guest starts on target
Use a dedicated high-bandwidth network for migration traffic

API and Automation

Authentication

Use API tokens with privilege separation for all automation — never root credentials
Store token secrets in vault or environment variables — never in code
Use ticket authentication only for interactive or short-lived tools

CLI Tools

pvesh — direct REST API access from CLI
qm — VM lifecycle management
pct — container lifecycle management
ha-manager — HA resource management
pvecm — cluster management
pvesm — storage management

Terraform

Use API tokens with privilege separation — never root credentials
Use cloud-init templates as the base for Terraform-managed VMs — templates must have qemu-guest-agent installed or Terraform hangs on "still creating"
Store Terraform state remotely (GitLab HTTP backend, S3, Consul) — never local state for shared infrastructure
Use lifecycle { ignore_changes } for fields Proxmox modifies outside Terraform (e.g., disk size after manual resize)
CI/CD pipeline: validate -> plan (save artifact) -> apply (manual trigger)
Never commit API secrets to git — use CI/CD variables (TF_VAR_*)

Cloud-Init

Prepare a base VM with qemu-guest-agent and cloud-init installed
Add a Cloud-Init drive (IDE or SCSI CD-ROM)
Configure network, SSH keys, and user data via the Cloud-Init panel or API
Convert to template, deploy via linked clone
Use SSH key authentication — cloud-init password storage is less secure
cicustom for advanced needs: reference custom YAML snippets for user, network, and meta data from a snippets-capable storage
Store cicustom snippets on shared storage (CephFS) in clusters for HA
Windows templates: use Cloudbase-Init with configdrive2 format + Sysprep

Backups

Backup Rules

Use PBS for production — deduplication, incremental backups, verification, encryption
Use snapshot mode for VMs (crash-consistent, no downtime); zstd compression
Follow the 3-2-1 rule: 3 copies, 2 media types, 1 off-site
Retention: keep-daily=7,keep-weekly=4,keep-monthly=6,keep-yearly=1

PBS Security

Restrict PVE backup user/token to create-only access (no delete) on PBS
Separate PBS admin credentials from PVE access
Store encryption keys separately from the backed-up system — password manager + offline backup
Never disable gc-atime-safety-check; use dedicated remote users per sync job

Non-Negotiable

Test restores regularly. A backup that cannot be restored is worthless.
Monitor backup jobs. A silently failing backup is worse than no backup.
Document the restore procedure.

PCIe Passthrough

Requirements

CPU: VT-d (Intel) or AMD-Vi enabled in BIOS/UEFI
IOMMU enabled in kernel: intel_iommu=on or amd_iommu=on
VFIO modules loaded: vfio, vfio_iommu_type1, vfio_pci, vfio_virqfd
Dedicated IOMMU group for the passthrough device

Configuration

Use OVMF (UEFI) + Q35 machine type; SeaBIOS if GPU lacks UEFI ROM
Blacklist host driver or bind via vfio-pci IDs in /etc/modprobe.d/
Pass through all device functions — GPU requires video + audio; USB-C controllers must also be bound to vfio-pci
For GPU: x-vga=1 for primary, vga: none; output via physical monitor, dummy plug, or Looking Glass
VMs with full passthrough cannot be live-migrated — use resource mappings (/cluster/mapping/pci) for HA

GPU-Specific Issues

NVIDIA Error 43 (Windows): set CPU type to host, add options kvm ignore_msrs=1 in /etc/modprobe.d/kvm.conf
AMD reset bug (Vega, Polaris, some Navi): GPU fails to reset after VM shutdown, preventing reuse without host reboot. Fix: install vendor-reset kernel module for vendor-specific reset quirks. RDNA2+ generally unaffected
NVIDIA vGPU: officially supported since vGPU Software 18 on PVE. Requires valid NVIDIA entitlement. Ampere+ GPUs need SR-IOV enabled first via pve-nvidia-vgpu-helper. PVE 8.4+ supports live migration of vGPU VMs

Virtiofs Directory Passthrough (PVE 8.4+)

Host-to-guest file sharing via virtiofs — bypasses network filesystems, provides near-native performance. Linux guests support virtiofs natively; Windows guests require a guest driver. Use for workloads requiring frequent host-guest file exchange without the overhead of NFS/SMB.

LXC Device Passthrough

Since PVE 8.2, device passthrough for containers is configurable via the UI. Limited compared to VM passthrough — no full PCIe passthrough, but supports specific device access (GPU rendering, USB devices).

Troubleshooting

dmesg | grep -e DMAR -e IOMMU — verify IOMMU is enabled
pvesh get /nodes/{node}/hardware/pci --pci-class-blacklist "" — list groups
Multi-device IOMMU groups (B550/X570): pcie_acs_override=downstream,multifunction as workaround (not for production)
IOMMU groups can change between kernel major versions — verify after updates

Security

Certificates

Replace self-signed certs with ACME (Let's Encrypt) — DNS-01 for nodes behind firewalls, HTTP-01 for internet-reachable
Trusted certificates required for reliable WebAuthn (FIDO2)

Firewall

Proxmox VE includes a distributed firewall (iptables-based, nftables opt-in since PVE 8.2) at datacenter, host, and guest levels. The nftables backend (PVE 8.3+) supports filtering forwarded traffic at host and VNet levels.

Enable selectively — datacenter first, then host, then per-interface. Disabled by default at all levels
Before enabling, create rules for management access (8006, 22, 3128) — keep an SSH session open as safety net
Create IPSet management with trusted admin IPs — auto-generates management access rules
Use security groups for reusable rule sets across VMs
Use ipfilter-net* IPSets per VM interface to prevent IP spoofing

User Management and RBAC

Grant permissions to groups, not individuals
Use resource pools to group related resources; assign permissions at pool level
Use realms for auth: PAM, LDAP, AD, OpenID Connect (Keycloak, Authentik)
Use privilege-separated API tokens for automation
Enforce 2FA (TOTP, YubiKey, WebAuthn) at realm level

Hardening

Use Enterprise repository for production
Restrict management interface access (dedicated VLAN, firewall rules)
Disable root SSH; use non-root with sudo
Do not run Docker or other services directly on the PVE host

Monitoring

Export metrics to InfluxDB/Graphite (Datacenter > Metric Server); visualize with Grafana (dashboard 10048)
Configure notification matchers for: backup failures (vzdump), fencing/HA events, replication failures
Enable ZED for ZFS errors, smartmontools for disk health, Ceph health monitoring
Notification targets: Sendmail, SMTP, Gotify, Webhooks (PVE 8.3+ — any HTTP endpoint with custom headers/body)

Anti-Patterns

Non-obvious traps where the "right" approach is counterintuitive:

Swap on ZFS zvol → partition a physical disk for swap; zvol swap causes blocking I/O during backups
LACP bonds for Corosync with default rate → set bond-lacp-rate fast (default 90s failover > 60s fence timeout)
Load-balancing bond modes for Corosync → use active-backup or LACP with fast rate; load-balancing modes cause asymmetric connectivity and mass fencing
PBS encryption key on backed-up system → store in password manager + offline backup; compromised host means compromised backups
Shared remote user for PBS sync jobs → dedicated remote user per sync job; shared users bypass gc-atime-safety-check

Application

When Configuring Proxmox

Apply conventions without narrating each rule; follow existing environment patterns
Security practices are defaults, not optional add-ons
Test changes in non-production when possible

When Reviewing Configuration

Check guest type matches workload (VM vs LXC vs OCI decision)
Verify storage backend matches workload characteristics (ZFS tuning, Ceph sizing, LVM-Thin monitoring)
Verify network separation: management, Corosync, Ceph, guest traffic on appropriate interfaces
Verify HA prerequisites: shared storage, fencing tested, quorum math correct, HA groups configured
Verify backup coverage: PBS for production, retention policy set, restores tested, encryption keys stored separately
Check security posture: API tokens with privsep, 2FA enforced, firewall enabled, management VLAN isolated
Check monitoring: metric export configured, notification matchers for backup/fencing/replication failures

When Writing Automation

Use pvesh or REST API — never screen-scrape the GUI
Use API tokens with privilege separation; handle task status polling
Use cloud-init templates; write idempotent scripts

Integration

This skill provides Proxmox VE discipline alongside sibling skills in the infrastructure plugin:

devops — foundational discipline (IaC principles, change management, observability) that applies to all Proxmox work
networking — general network infrastructure (VLANs, firewalls, DNS) beyond PVE-specific networking covered here
ansible — configuration management for automating Proxmox host setup and post-provisioning
containers — Docker/Podman management inside PVE guests (VMs or LXC)

Proxmox Datacenter Manager (PDM 1.0): Centralized management for multiple independent PVE/PBS environments — aggregated views, cross-cluster live migration, EVPN configuration between clusters, centralized update overview. Written in Rust. Requires PVE 8.4+ / PBS 3.4+. Relevant for multi-site or large-scale deployments.

Critical Rules

HA without fencing is a data corruption risk — never enable HA without a tested fencing mechanism
Cap ZFS ARC explicitly — unbounded ARC causes VM OOM crashes
Dedicated Corosync network — never share with Ceph or migration traffic on 1GbE
Test restores regularly — a backup that cannot be restored is worthless
API tokens with privilege separation for all automation — never root credentials
Use VirtIO drivers for all production VM disks and network interfaces

Proxmox is infrastructure. Treat it with the same rigor as production code: version-controlled configuration, tested changes, monitored operations, documented decisions.

proxmox

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Similar Skills

Help us improve

Help us improve

Find plugins for your project

proxmox

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Proxmox VE

Route to Reference

Guest Management

VM vs LXC vs OCI Decision

VM Configuration

Container Configuration

OCI Application Containers (PVE 9.1 — Tech Preview)

Templates and Cloning

Storage

Backend Selection

ZFS Rules

Ceph Rules

Storage Anti-Patterns

Networking

Core Model

VLAN Best Practices

Traffic Separation

SDN

Clustering and High Availability

Cluster Rules

HA Requirements

Fencing

Quorum Loss Recovery

Migration

API and Automation

Authentication

CLI Tools

Terraform

Cloud-Init

Backups

Backup Rules

PBS Security

Non-Negotiable

PCIe Passthrough

Requirements

Configuration

GPU-Specific Issues

Virtiofs Directory Passthrough (PVE 8.4+)

LXC Device Passthrough

Troubleshooting

Security

Certificates

Firewall

User Management and RBAC

Hardening

Monitoring

Anti-Patterns

Application

When Configuring Proxmox

When Reviewing Configuration

When Writing Automation

Integration

Critical Rules

Similar Skills

Help us improve

Proxmox VE

Route to Reference

Guest Management

VM vs LXC vs OCI Decision

VM Configuration

Container Configuration

OCI Application Containers (PVE 9.1 — Tech Preview)

Templates and Cloning

Storage