From nexus-agents
Manages physical server infrastructure and bare metal systems with iDRAC/iLO/IPMI access, SSH connectivity checks, boot time estimation, and hardware health monitoring.
npx claudepluginhub williamzujkowski/nexus-agentsThis skill is limited to using the following tools:
Manages physical and SBC infrastructure with awareness of hardware boot times,
Manage NinjaOne RMM devices: list, search, control services, view hardware inventory, schedule maintenance, monitor health and alerts on Windows, Mac, Linux endpoints.
Analyzes live Infrahub data via MCP server to answer operational questions, detect drift, check compliance, investigate change impact, and produce ad-hoc reports.
Manages hosts, groups, and variables in Ansible inventory files using INI and YAML formats for infrastructure organization across environments.
Share bugs, ideas, or general feedback.
Manages physical and SBC infrastructure with awareness of hardware boot times, access hierarchies, and out-of-band management capabilities.
Always maintain at least two working access methods per host.
For each managed host, check access:
# SSH connectivity check (2s timeout)
ssh -o ConnectTimeout=2 -o BatchMode=yes USER@HOST "echo ok" 2>&1
# Check SSH via password (if key fails)
# NOTE: sshpass usage requires explicit user approval
# Check if OOB/iDRAC is reachable
curl -sk --connect-timeout 5 https://IDRAC_IP/data?get=pwState 2>&1 || echo "iDRAC unreachable"
# IPMI ping check
ipmitool -I lanplus -H IPMI_IP -U root -P PASSWORD power status 2>&1
Report format:
Host: hostname (IP)
SSH Key: OK | FAIL (reason)
SSH Pass: OK | FAIL | NOT_TESTED
OOB: OK (iDRAC6/iLO4/IPMI) | UNREACHABLE
Boot Time: ~30s (SBC) | ~3min (desktop) | ~10min (enterprise)
Status: HEALTHY | DEGRADED | UNREACHABLE
Query available health data from each host:
# Temperature (via SSH)
ssh HOST "cat /sys/class/thermal/thermal_zone*/temp 2>/dev/null || sensors 2>/dev/null"
# Disk health
ssh HOST "df -h && smartctl -a /dev/sda 2>/dev/null | grep -E 'Health|Temperature|Reallocated'"
# Memory
ssh HOST "free -h"
# Uptime and load
ssh HOST "uptime"
# Docker status (if applicable)
ssh HOST "docker ps --format 'table {{.Names}}\t{{.Status}}\t{{.Ports}}' 2>/dev/null"
For iDRAC-equipped servers:
# Sensor readings via REST API
curl -sk --cookie "session_cookie" "https://IDRAC_IP/data?get=tempprobes"
curl -sk --cookie "session_cookie" "https://IDRAC_IP/data?get=fanstatus"
# System Event Log
ssh IDRAC_IP "racadm getsel" 2>/dev/null
When a host is unreachable:
ipmitool power cycle or iDRAC web/API| Hardware Type | Expected Boot Time |
|---|---|
| Raspberry Pi / SBC | 30-60 seconds |
| Desktop / small server | 1-3 minutes |
| 1U/2U rack server (≤64GB) | 3-5 minutes |
| Enterprise server (128GB+) | 8-15 minutes |
| High-memory (512GB+) | 12-20 minutes |
Do NOT declare a server failed until at least 2x the expected boot time has passed.
For SBC hosts (Raspberry Pi):
sudo dmesg | grep -i "mmc\|error\|read-only"vcgencmd measure_voltsvcgencmd measure_tempFor enterprise servers:
ssh HOST "cat /proc/mdstat 2>/dev/null || megacli -LDInfo -Lall -aALL 2>/dev/null"Produce a summary with:
## Infrastructure Status Report
### Hosts Summary
| Host | IP | SSH | OOB | Health | Boot Est. |
|------|-----|-----|-----|--------|-----------|
| ... | ... | ... | ... | ... | ... |
### Findings
- [CRITICAL] Host X unreachable via all methods
- [WARNING] Host Y disk SMART warning
- [INFO] Host Z uptime 45 days, consider updates
### Recommended Actions
1. ...
2. ...
For BOSH-managed infrastructure:
# Verify director health
source ~/deployments/bosh/env.sh
bosh env # Director reachable?
# Check all VMs running
bosh vms # All instances "running"?
# Check director processes
ssh -i <key> jumpbox@DIRECTOR_IP "sudo monit summary"
# Expected: nats, postgres, blobstore_nginx, director, workers, health_monitor, lxd_cpi
# Verify CredHub on director
curl -sk https://DIRECTOR_IP:8844/info # Should return JSON with app name "CredHub"
credhub find # Should list credentials
# BBR readiness
bbr director --host DIRECTOR_IP --username bbr --private-key-path bbr.pem pre-backup-check
After any bosh create-env or bosh deploy:
monit summary on affected VMs — all processes "running"bosh vms matches expected countbosh run-errand smoke-tests if availablebbr pre-backup-check still passes| Ops File | Depends On | Provides |
|---|---|---|
credhub.yml | uaa.yml | CredHub on director (:8844) |
uaa.yml | (base) | UAA on director (:8443) |
bbr.yml | (base) | backup-and-restore-sdk |
| CPI ops (e.g., Incus) | (base) | VM lifecycle management |
CRITICAL: Missing uaa.yml when credhub.yml is included causes CredHub to silently not start. Always check monit summary after ops file changes.
Verify documentation against live system:
# VM count
bosh vms 2>/dev/null | grep -c running # Compare against README
# Service inventory
systemctl list-units --state=running --type=service | grep -E "podman|grafana|loki"
# Tool availability (verify before referencing in docs)
which terraform terragrunt make 2>/dev/null
# Network topology
ip -br addr show | grep -E "bond|vlan"
# Storage
zpool list; df -h /srv/nfs
Flag any discrepancies between docs and live output. Live system is always authoritative.
| Excuse | Counter |
|---|---|
| "Skip the OOB check, just SSH" | OOB is the source of truth — SSH state can disagree with hardware reality (BMC firmware, fan curves, thermal). Verify both. |
| "It's just the homelab" | Per the UFW + Podman incident (March 2026) — homelab outages cost real time and cascade across services. Apply production discipline. |
| "I'll restart and see" | Restart-and-see destroys the diagnostic window. Capture state first (logs, dmesg, OOB sensor data), then restart if the issue allows. |
dmesg and journal captured before any restart