From magic-powers
Use when building ML training/serving pipelines on AWS SageMaker, implementing MLOps with SageMaker Pipelines and Model Registry, monitoring models in production, or optimizing training costs with Spot instances. Covers AWS MLA-C01 exam domains.
npx claudepluginhub kienbui1995/magic-powers --plugin magic-powersThis skill uses the workspace's default tool permissions.
- Designing ML training and serving infrastructure on AWS SageMaker
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
| Option | Cost | Best For |
|---|---|---|
| On-Demand instances | Full price | Short jobs, time-critical, no interruption risk |
| Spot training | Up to 90% savings | Long batch jobs; must use checkpointing |
| SageMaker Training Warm Pools | Reserve compute between runs | Iterative development (reduces startup time) |
Spot training requirements:
s3://bucket/checkpoints/job-name/Managed Spot Training code:
estimator = Estimator(
...
use_spot_instances=True,
max_run=3600, # max total training time (seconds)
max_wait=7200, # max wait including interruptions
checkpoint_s3_uri="s3://bucket/checkpoints/",
checkpoint_local_path="/opt/ml/checkpoints"
)
Built-in algorithms vs custom containers:
| Approach | Use Case | Example |
|---|---|---|
| Built-in algorithms | Common ML tasks, fast start | XGBoost, Linear Learner, K-Means, BlazingText |
| Script mode | Familiar framework (TF/PyTorch/sklearn), custom code | Bring your own training script |
| Custom container | Exotic runtime, custom dependencies | Custom C++ inference, specialized research |
| Pre-trained model (Jumpstart) | Fine-tune foundation models | LLMs, BERT, ResNet |
| Endpoint Type | Latency | Payload Size | Use Case |
|---|---|---|---|
| Real-time endpoint | Synchronous, milliseconds | < 6MB | Interactive APIs, recommendations, fraud detection |
| Serverless endpoint | Cold start possible | < 4MB (request), < 20MB (model) | Infrequent traffic (cost savings, no idle cost) |
| Async endpoint | Minutes (result to S3) | Up to 1GB | Large payloads, long processing (NLP, video) |
| Batch Transform | Offline, hours | Entire dataset | Offline scoring, pre-computation, bulk inference |
Async endpoint: request queued in SQS; processing result written to S3; notification via SNS/EventBridge.
Batch Transform: no endpoint needed; input from S3; output to S3; best for periodic bulk scoring.
Multi-model endpoint (MME): host thousands of models on a single endpoint; SageMaker loads/unloads models from S3 to GPU/CPU memory dynamically. Cost-effective for many similar models.
Multi-container endpoint: run different models/containers on one endpoint; invoke a specific container. Use for A/B testing or ensemble inference.
Supported step types:
| Step Type | Purpose |
|---|---|
ProcessingStep | Data preprocessing, feature engineering, evaluation |
TrainingStep | Model training job |
TuningStep | Hyperparameter optimization (HPO) |
TransformStep | Batch inference |
RegisterModel | Register model version in Model Registry |
ConditionStep | Branch pipeline based on evaluation metrics |
CreateModelStep | Create SageMaker model from training artifacts |
LambdaStep | Invoke Lambda function (custom logic) |
ClarifyCheckStep | Bias/explainability analysis |
Example pipeline flow:
ProcessingStep (feature engineering)
↓
TrainingStep (train XGBoost)
↓
ProcessingStep (evaluate on test set)
↓
ConditionStep (accuracy > 0.9?)
├── Yes → RegisterModel (Approved)
└── No → RegisterModel (Rejected)
SageMaker Pipelines vs Step Functions:
Workflow:
Continuously monitors deployed endpoint data for:
| Monitor Type | What It Detects | Baseline |
|---|---|---|
| Data quality | Feature distribution drift (input data statistics change) | Baseline from training data |
| Model quality | Accuracy/precision drift (compare predictions vs ground truth) | Baseline from training evaluation |
| Bias drift | Fairness metric changes (demographic parity, etc.) | Baseline from Clarify bias analysis |
| Feature attribution drift | SHAP value changes (important features changing) | Baseline from Clarify explainability analysis |
Setup requirements:
| Store Type | Latency | Backed By | Best For |
|---|---|---|---|
| Online store | Milliseconds | In-memory cache | Real-time inference (serving) |
| Offline store | Seconds-minutes | S3 (Parquet, Iceberg) | Model training, batch queries |
Feature reuse: compute features once, store in Feature Store, reuse across multiple models and teams. Point-in-time queries: offline store supports time-travel queries (get feature values as of specific timestamp) — prevents training/serving skew.