From sjh-skills
Manages remote GPU clusters via rca CLI: run commands, batch jobs, GPU/node inspection, file sync with mutagen. For training, remote exec, status checks.
npx claudepluginhub jiahao-shao1/sjh-skills --plugin sjh-skillsThis skill uses the workspace's default tool permissions.
Local control plane for remote GPU clusters: a Go daemon + `rca` CLI. All commands go through the daemon's Unix socket to a cluster-side Python `agent.py` over persistent SSH, ~0.1s/command latency.
MUTAGEN.mdMakefileREADME.mdREADME.zh-CN.mdVERSIONcluster-agent/agent.pycmd/rca/cmd_agent.gocmd/rca/cmd_batch.gocmd/rca/cmd_config.gocmd/rca/cmd_connect.gocmd/rca/cmd_cp.gocmd/rca/cmd_daemon_register.gocmd/rca/cmd_exec.gocmd/rca/cmd_nodes.gocmd/rca/main.godocs/architecture.htmldocs/architecture.pnggo.modgo.suminternal/agent/connection.goExecutes commands, transfers files to/from SageMaker HyperPod cluster nodes via AWS SSM scripts. Use for shell access, diagnostics, package installs without SSH.
Launches GPU/TPU clusters, training jobs, and inference servers across 25+ clouds, Kubernetes, Slurm using SkyPilot; debugs YAML, optimizes costs.
Guides deploying and managing OpenClaw AI agent systems on cloud VMs (AWS/GCP/Azure), managed platforms (Railway/Fly.io), bare metal (Hetzner/OVH), and serverless (Vercel/Cloudflare). Compares CLI/API/MCP management.
Share bugs, ideas, or general feedback.
Local control plane for remote GPU clusters: a Go daemon + rca CLI. All commands go through the daemon's Unix socket to a cluster-side Python agent.py over persistent SSH, ~0.1s/command latency.
rca is installed: run which rca.
not found → run the first-time install flow (below), then continue to step 2.rca daemon status.
daemon request failed → run rca daemon start.rca nodes.
rca connect <node>. Still dead → re-establish whatever SSH tunneling / VPN / jump host you depend on, then retry.rca daemon logs -f for error detail.Don't write raw ssh commands, always use rca exec / rca batch.
When which rca returns not found, execute in order:
cd <skill_dir>
make install # builds rca into ~/go/bin (requires Go 1.21+)
rca config init # migrates legacy ~/.config/remote-cluster-agent/*.md, or generates blank template
Then ask the user for each node's SSH command, and write them into ~/.config/rca/config.toml under [nodes.*]. Example:
[nodes.train]
ssh = "ssh gpu-train"
[nodes.eval]
ssh = "ssh -p 2222 gpu-eval"
dir = "/home/user/project"
agent_path = "/shared/.agent/agent.py"
Optional per-node overrides: dir (default working directory) and agent_path (where agent.py lives on that node). Both fall back to globals if unset.
Finally start and verify:
rca daemon start
rca daemon status # confirm running
rca nodes # confirm nodes connected
rca agent check # if agent missing → rca agent deploy
rca exec -n train "nvidia-smi"
rca exec -n train -d /home/user/project "git pull"
rca exec -t 600 -n train "python train.py"
# heredoc (bypasses shell escaping, recommended for multi-line or special chars)
rca exec --stdin -n train <<'EOF'
cd /home/user/project
python -c "import json; print(json.dumps({'k':'v \"q\"'}))"
EOF
rca batch "nvidia-smi | head -20"
rca batch -n train,eval "df -h /home"
rca batch --json "hostname" | jq -r '.results[] | "\(.node): \(.output)"'
rca nodes # current connection state
rca nodes --check # deep ping (with latency)
rca nodes --health # latency history (monitor-tracked)
rca connect train # manually reconnect a dead node
rca disconnect train # actively close a node's connection
rca cp moves files through the agent's JSON-Lines channel (base64-encoded, 50 MB/file limit). Works on any SSH setup — no separate SCP/rsync path needed.
rca cp train:/home/user/logs/train.log ./
rca cp ./config.yaml train:/home/user/project/config.yaml
rca cp -r train:/home/user/checkpoints/exp07 ./local-copy/
For very large files (multi-GB checkpoints), use shared filesystem or object storage — rca cp peaks around 2–3 MB/s.
rca exec -s -n train "tail -f /var/log/train.log"
rca exec -s -n train "python train.py"
Triggered when the user says "cluster status", "which node is free", "cluster health", "GPU usage". Use rca batch for parallel sampling:
rca nodes --check
rca batch "nvidia-smi --query-gpu=name,memory.used,memory.total,utilization.gpu --format=csv"
rca batch "df -h /home | tail -1"
rca batch "uptime && tmux ls 2>/dev/null | wc -l"
Summarize into a table and recommend the most idle node. See reference/cluster-health.md for detailed probe commands, parse rules, and report format.
rca exec is for executing commands (starting training, killing processes, checking process state, installing deps) — not reading file contents.
If mutagen real-time sync is configured, the cluster's outputs/ appears locally automatically. For logs, JSON results, CSV — use the native Read tool on the local path. That's ~20x faster than piping through the SSH connection.
Decision rule:
Read("outputs/...") locallyrca execls outputs/ or Glob("outputs/**/<pattern>") firstTypical mistake: rca exec -n train "cat /home/user/outputs/.../log.txt" to read a log. That file is already synced — just Read it.
When rca nodes --check shows dead, or rca exec reports connect failed:
rca connect <node> once — most transient issues resolve here.rca itself doesn't manage tunnels / VPN / jump hosts — re-establish those through your usual workflow (ssh config, tunnel script, corporate VPN, etc.), then retry step 1.rca daemon logs -f.Report failures grouped by cause — don't just say "all nodes failed". Which nodes recovered? Which need tunnel re-establishment? Which look like SSH config issues?
When the user mentions "skill updated" or "redeploy", check mutagen session config:
mutagen sync list 2>/dev/null
For each session:
one-way-replica. If one-way-safe or two-way-resolved, prompt to rebuild..git should NOT be ignored (v0.4.0+ syncs .git to keep cluster-side git status clean).Rebuild (only after user confirms):
mutagen sync list <name> --long # record current params first
mutagen sync terminate <name>
bash <skill_dir>/mutagen-setup.sh <ssh_host> <local_dir> <remote_dir> <session_name>
Symptom: mutagen sync list shows Beta Connected: No; mutagen sync resume reports server magic number incorrect.
Root cause: the cluster container restarted and lost ~/.mutagen/ (container home is ephemeral). Mutagen's agent binary and staging data live in a persistent path (typically mounted storage like /shared/.mutagen/).
Fix:
# Recreate the symlink on the cluster node
rca exec -n <node> "ln -sf /shared/.mutagen ~/.mutagen"
# Then resume locally
mutagen sync pause <session-name>
mutagen sync resume <session-name>
Prevention: have the container init script create this symlink on startup.
See reference/mutagen-troubleshooting.md for more failure modes.
Main config: ~/.config/rca/config.toml (edit via rca config edit).
First-time install flow is in "What to do when triggered → First-time install". Day-to-day, just keep the daemon running (rca daemon status).
rca agent check # show agent version per node
rca agent deploy # copy local agent.py (only where missing)
rca agent deploy --force # force overwrite
Note:
rca agent deployuses direct SSH (bypasses the daemon) and iterates all nodes inconfig.tomlserially. Disconnected nodes will block for SSH's default connect timeout (minutes). Workaround — use a temp config with only live nodes:cp ~/.config/rca/config.toml /tmp/rca_deploy.toml # manually delete [nodes.*] sections for dead nodes rca --config /tmp/rca_deploy.toml agent deploy --force
rca daemon register # register launchd agent (requires App Management permission)
Without this, start the daemon manually with rca daemon start whenever you reboot.
| Symptom | Fix |
|---|---|
daemon request failed | rca daemon start; if still failing rca daemon logs -f |
| Node status=dead | rca connect <node>; if still dead, re-establish SSH tunneling |
| tunnel down | Reconnect through your usual workflow (ssh config, VPN, tunnel script) |
| agent missing | rca agent deploy |
| Command special chars | Use rca exec --stdin <<'EOF' ... EOF |
| daemon crash | rca daemon start (launchd auto-restarts if registered) |
master/main — avoid shipping unreviewed code to teammates.pkill -f must use the bracket trick: pkill -f "[s]glang.launch_server" not pkill -f "sglang.launch_server" — because the SSH process command line contains the kill pattern, pkill -f would match SSH itself and tear the connection down. Same applies to pgrep -f, grep over process lists, etc.nohup ... & or tmux new-session -d, with echo placed after the background command using ;:
# Correct: nohup background + echo outside
nohup python -m sglang.launch_server ... > /tmp/log 2>&1 & echo "PID=$!"
# Correct: tmux detach + echo outside
tmux new-session -d -s sglang "python -m sglang.launch_server ..."; echo "started"
# Wrong: echo inside tmux's && chain, never executes
tmux new-session -d -s sglang "python ... && echo started"
The daemon runs independently of Claude Code (Unix socket), so every session shares one connection pool. The CLI is standalone too — usable in shells, scripts, or cron. Cluster-side agent.py speaks JSON-Lines v2.1.0 (streaming, cancel, batch, file transfer).
Upgrading from MCP v0.3.x:
remote_bash(node, cmd)→rca exec -n node "cmd",remote_bash_batch→rca batch. Legacy~/.config/remote-cluster-agent/*.mdmarkdown config auto-migrates onrca config init.