From sagemaker-ai
Generates issue reports for HyperPod (EKS/Slurm) clusters by collecting node diagnostics via SSM, storing in S3 for troubleshooting and AWS Support.
npx claudepluginhub awslabs/agent-plugins --plugin sagemaker-aiThis skill uses the workspace's default tool permissions.
Collect diagnostic logs from HyperPod cluster nodes via SSM, store results in S3. Supports both EKS and Slurm clusters with auto-detection. Uses the bundled `scripts/hyperpod_issue_report.py` for reliable parallel collection.
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
Collect diagnostic logs from HyperPod cluster nodes via SSM, store results in S3. Supports both EKS and Slurm clusters with auto-detection. Uses the bundled scripts/hyperpod_issue_report.py for reliable parallel collection.
sagemaker:DescribeCluster, sagemaker:ListClusterNodes, ssm:StartSession, s3:PutObject, s3:GetObject, eks:DescribeClusters3:GetObject/s3:PutObject on the report bucketCollect from the user:
arn:aws:sagemaker:us-west-2:123456789012:cluster/abc123)s3://bucket/prefix). If the user doesn't have a bucket, create one (e.g., s3://hyperpod-diagnostics-<account-id>-<region>)aws sts get-caller-identity
aws sagemaker describe-cluster --cluster-name <name-or-arn> --region <region>
If the S3 bucket doesn't exist, create it:
aws s3 mb s3://<bucket-name> --region <region>
For EKS clusters (check Orchestrator.Eks in describe-cluster output):
Ensure kubectl is installed (which kubectl). If missing, install it for the current platform.
Configure kubeconfig using the EKS cluster name from the describe-cluster response:
aws eks update-kubeconfig --name <eks-cluster-name> --region <region>
uv run scripts/hyperpod_issue_report.py \
--cluster <cluster-name-or-arn> \
--region <region> \
--s3-path s3://<bucket>[/prefix]
Use --help for all options including --instance-groups, --nodes, --command, --max-workers, and --debug. Note: --instance-groups and --nodes are mutually exclusive. Node identifiers accept instance IDs (i-*), EKS names (hyperpod-i-*), or Slurm names (ip-*).
After collection, the script shows statistics and offers interactive download. Report the S3 location and offer to:
See references/troubleshooting.md for error handling, large cluster tuning, and known limitations.