Skill

cloud-instance-ops

Diagnoses cloud instance access issues and performs SSH/CLI operations on Linux servers, AWS EC2, and Aliyun ECS, including file transfers, service checks, logs, and blockers like bastions or MFA.

AWS

Linux

Bash

devops

infrastructure

npx claudepluginhub agenticaiplan/agenticaiskills --plugin agentic-ai-skills

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Use this skill to manage cloud servers and instances like a careful human operator **within the current authorization boundary**. Prefer SSH and official cloud CLIs. Do **not** try to bypass MFA, SSO, approval workflows, captchas, IP allowlists, bastions, or least-privilege controls.

Supporting Assets

agents/openai.yamlreferences/aliyun.mdreferences/aws.mdreferences/generic-linux.mdreferences/install-and-examples.mdreferences/interrupted-run-recovery.mdreferences/model-deployment.mdreferences/restricted-access.mdreferences/shared-runtime-hygiene.mdscripts/artifact_check.pyscripts/deployment_final_check.pyscripts/normalize_result.pyscripts/password_ssh.pyscripts/preflight.pyscripts/remote_port_owner.pyscripts/ssh_probe.py

SKILL.md

Similar Skills

aws-containers

533

Deploys and operates containerized workloads on AWS ECS, Fargate, and ECR. Covers task definitions, services, debugging with ECS Exec, scaling, load balancers, and image management for AWS container optimization.

10 files1 tool

aws-core

cloud-penetration-testing

37.1k

Performs authorized security assessments on Azure, AWS, and GCP infrastructure via reconnaissance, authentication testing, enumeration, privilege escalation, and data extraction.

1 file

antigravity-awesome-skills

memstack-deployment-hetzner-setup

323

Provisions, hardens, and deploys apps to Hetzner Cloud VPS with Docker, Nginx/Caddy reverse proxy, SSL certs, database setup, monitoring, and backups.

memstack

Stats

Stars10

Forks53

Last CommitApr 20, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Cloud Instance Ops

Overview

Use this skill to manage cloud servers and instances like a careful human operator within the current authorization boundary. Prefer SSH and official cloud CLIs. Do not try to bypass MFA, SSO, approval workflows, captchas, IP allowlists, bastions, or least-privilege controls.

This skill is optimized for a shared core workflow that can be reused in Codex, OpenClaw, and Claude Code. Codex-specific UI metadata lives in agents/openai.yaml; the operational logic stays in this SKILL.md plus the bundled references and scripts.

Use this skill when

You need to connect to a Linux instance through SSH, password SSH, ProxyJump, or a bastion.
You need file transfer, service/process checks, log inspection, restart operations, or access troubleshooting on an existing instance.
You need AWS EC2 or Aliyun ECS instance inspection or other cloud-side checks through the official CLI.
You need to deploy a model, stand up an API, validate output artifacts, or run cloud-side mutating actions and the user has explicitly asked for that path.

Required inputs

Normalize each request into these five inputs before acting:

Target object: host/IP, instance ID, region, environment, bastion
Cloud environment: generic Linux, AWS, or Aliyun
Authentication method: SSH key, password SSH, agent, AWS profile, Aliyun profile, short-lived token, SSO
Restrictions: MFA, approval, allowlist, no public route, read-only, captcha, missing CLI, shared GPU/port use
Requested action: connect, transfer, inspect, logs, restart, deploy, single-run, serve, accept, instance-check, cloud-op

Treat deploy, single-run, serve, accept, and cloud-op as explicit-action modes. Do not enter them from a vague “帮我看看机器” style request without confirming intent from the user prompt or environment.

If an input is missing, infer it from the environment first. Ask the user only for the minimum missing detail.

Core operating model

1. Preflight

Run a lightweight local check before attempting privileged or remote work.

Typical commands:

python3 scripts/preflight.py --cloud generic --action inspect --host 10.0.0.8 --check-port
python3 scripts/preflight.py --cloud aws --action instance-check --profile prod --region ap-southeast-1
python3 scripts/preflight.py --cloud aliyun --action cloud-op --profile default --region cn-hangzhou

Preflight should verify only what is safe to verify locally:

required binaries exist
obvious credential sources exist
referenced files exist
target port is reachable when requested
explicit restrictions imply blocked, denied, or needs_user

2. Execution

Use the narrowest command that answers the request.

Prefer read-only checks before mutating actions.
Prefer official cloud CLI for cloud-level state changes.
For instance lifecycle operations on EC2/ECS, prefer provider CLI/API over in-guest reboot when cloud state matters.

Useful wrappers:

python3 scripts/ssh_probe.py --host 10.0.0.8 --user ec2-user --identity-file ~/.ssh/prod.pem
python3 scripts/remote_port_owner.py --port 8080

3. Restriction handling

If the environment requires human involvement, stop and return a structured handoff instead of guessing.

Common stop conditions:

MFA / SSO / captcha required
explicit approval required
IP allowlist or private-network-only path not yet satisfied
read-only permission for a mutating action
missing cloud CLI or profile
SSH key or bastion path unavailable
shared port/GPU is still occupied by another workload

4. Handoff

Return a stable contract for both success and blocked cases so the result can be reused in later handoff, approval, or postmortem steps.

Minimum success shape:

{
  "status": "ok",
  "action": "inspect",
  "target": "10.0.0.8",
  "evidence": {
    "network": {"10.0.0.8:22": "open"}
  },
  "next_step": "Proceed with the requested read-only inspection."
}

When blocked, return a concrete next step instead of guessing.

Expected shape:

{
  "status": "blocked",
  "reason": "Target port 22 is unreachable from the current machine.",
  "next_step": "Connect through the approved bastion or ask for the current IP to be allowlisted.",
  "evidence": {
    "network": {"10.0.0.8:22": "closed"}
  }
}

Allowed status values:

ok
blocked
denied
needs_user
unsupported

Model deployment workflow

Use this workflow whenever the task is “部署模型 / 起 API / 做题面验收 / 输出文件检查” instead of plain server ops.

Principle: single-run before serving

Always verify the underlying model or command once before wrapping it in an API. Do not start with service orchestration unless the user explicitly asks for service-first diagnosis.

Recommended sequence

Read-only preflight: model path, input path, GPU/CPU, memory, ports, dependency versions, existing services
Single-run check: run one minimal inference locally or remotely and verify the base pipeline works
Service deployment: only after single-run succeeds, expose HTTP/API/worker entrypoint
Example-request acceptance: rerun the exact original curl or request shape from the task prompt
Output artifact check: confirm files exist, size is reasonable, type matches expectation, contents are not obviously broken
Repeated-request stability: repeat at least once to catch cold-start-only success or stale previous services

Final acceptance checklist

Do not claim “deployed successfully” until all applicable checks pass:

health check or docs endpoint check
original example request check
output artifact check
repeated request stability check
service log sanity check

Use the bundled scripts when helpful:

curl -fsS http://127.0.0.1:8080/docs
python3 scripts/artifact_check.py --path /data/exam/output.wav --kind audio
python3 scripts/deployment_final_check.py --url http://127.0.0.1:8080/health --repeat 2

Shared runtime hygiene

In exam, benchmark, and shared-machine environments, assume the host may already have leftovers.

Before reusing a port, GPU, or task directory:

identify the current port owner
identify old model/server processes
verify the endpoint is serving the intended workload, not a stale process
verify the current environment has enough free memory/disk/GPU for the next task

Never assume “service started” means “new service is handling requests”. Always verify the response path.

Environment isolation

Prefer one environment per task (venv or conda). If isolation is impossible:

record key dependency versions before installation
assess whether new packages may break existing services
avoid in-place upgrades of shared production environments unless the user explicitly accepts the risk

Password SSH guidance

Password SSH is common in exam or public-IP environments. Use it carefully:

prefer a local secure input UI for interactive password entry
for interactive operator workflows, default to secure prompt + SSH master-connection reuse
password source priority for Codex-style tool execution:
1. a local secure input UI (prompt-password) for the current task
2. a user-provided local temp file path
3. an environment variable that has been verified visible to the current tool process
the recommended flow is: bootstrap the SSH master session first, then execute the real remote command
after the first successful login, reuse the SSH master connection for the same thread_id + user@host:port so later commands in the same thread do not re-prompt for the password
by default, a reused session stays alive until the operator explicitly closes it
if the first business command fails after bootstrap succeeds, do not ask for the password again; reuse the existing session for retries in the same thread
unless the user explicitly insists, do not ask them to paste passwords or other secrets directly into chat
before asking the user to export a password variable, run a tiny visibility check first; if the current tool process still cannot read it, do not ask the user to repeat the export—switch to secure prompt input or a temp file instead
never write passwords into the repo, shell history, or final logs
distinguish authentication failure, session-bootstrap failure, and business-command failure
prefer a reusable helper script over ad-hoc multi-layer quoting when automation is needed
when expect or a similar wrapper is still required, keep the wrapper as small as possible and keep logs password-free

Recommended visibility check before relying on an env var:

if [ -n "$CLOUD_INSTANCE_PASSWORD" ]; then echo set; else echo missing; fi

Recommended helper usage:

Bootstrap a session once per thread:

python3 scripts/password_ssh.py   --host 10.0.0.8   --user root   --port 22   --prompt-password   --ensure-session

Reuse that session for real commands:

python3 scripts/password_ssh.py   --host 10.0.0.8   --user root   --port 22   --remote-command "hostname && whoami"

Close the reused session explicitly when you are done:

python3 scripts/password_ssh.py   --host 10.0.0.8   --user root   --port 22   --close-session

Fallback modes:

for unattended automation: use --password-file or --password-env
for interactive operator workflows: prefer --prompt-password and --ensure-session
if the user wants one-shot behavior, disable session reuse explicitly with --no-reuse-session
sessions are thread-scoped by default; other threads do not reuse them unless a shared --session-namespace is explicitly provided

Interrupted run recovery

When the task was interrupted, never restart blindly. First check:

residual processes
listening ports
output file timestamps and sizes
last service logs
whether to resume or rebuild

If artifacts and downloads are already present, prefer resume. If state is ambiguous or contaminated, clean and restart only the affected slice.

Remote script layout

For longer remote tasks, prefer a predictable work layout such as:

01_single_run.py
02_api.py
03_start.sh
04_selftest.sh
*.log
*.pid

Keep logs, PID files, and outputs in one task directory so interrupted runs are easy to inspect.

Guardrails

Never store secrets, tokens, or private keys in the repo.
Never claim success unless a command or probe actually succeeded.
Never attempt to bypass MFA, SSO, approval, captcha, or network policy.
Never execute destructive or high-blast-radius actions without explicit confirmation.
For deployment tasks, never claim success based only on “command exited 0”; validate the artifact or endpoint.

Confirmation required

Treat these as confirmation-required even if technically possible:

terminating or releasing an instance
resetting or recreating disks, snapshots, images, or security groups
firewall/public exposure changes
forced restart/stop that may cause downtime
any action outside the user’s stated target environment

References

Load only the reference relevant to the current task:

Generic Linux and SSH patterns: references/generic-linux.md
AWS EC2 patterns: references/aws.md
Aliyun ECS patterns: references/aliyun.md
Restriction diagnosis and handoff language: references/restricted-access.md
Installation and invocation checklist: references/install-and-examples.md
Model deployment workflow and acceptance: references/model-deployment.md
Shared runtime hygiene and environment isolation: references/shared-runtime-hygiene.md
Interrupted-run recovery patterns: references/interrupted-run-recovery.md

Scripts

scripts/preflight.py: local preflight checks and initial status classification
scripts/ssh_probe.py: non-destructive SSH connectivity/auth probe with optional bastion
scripts/password_ssh.py: password-based SSH helper with secure prompt input, optional thread-scoped session reuse, and safer defaults for host key checking
scripts/normalize_result.py: normalize command outcomes into the shared status contract
scripts/remote_port_owner.py: inspect which local process is listening on a port
scripts/artifact_check.py: verify output artifacts exist and look reasonable
scripts/deployment_final_check.py: verify health/example endpoints and optional artifacts with repeat checks