From aws-core
Deploys and operates containerized workloads on AWS ECS, Fargate, and ECR. Covers task definitions, services, debugging with ECS Exec, scaling, load balancers, and image management for AWS container optimization.
npx claudepluginhub aws/agent-toolkit-for-aws --plugin aws-coreThis skill is limited to using the following tools:
| Developer Need | Recommend | Key CLI / CDK |
references/app-runner-guide.mdreferences/ecr-repository-management.mdreferences/ecs-exec-debugging.mdreferences/ecs-infrastructure-patterns.mdreferences/ecs-logging-and-firelens.mdreferences/ecs-troubleshooting-guide.mdreferences/fargate-service-deployment.mdreferences/fargate-spot.mdreferences/service-scaling-and-updates.mdreferences/task-definition-authoring.mdProvides AWS CloudFormation patterns for ECS clusters, task definitions, services, auto scaling, blue/green deployments, ALB/NLB integration, and monitoring. Use for Fargate/EC2 setups and best practices.
Deploys backend apps to EC2 instances using Docker containers, GitHub Actions CI/CD pipelines, and Tailscale for secure access. Supports NestJS, Next.js, Express.
Researches infrastructure best practices and generates Terraform modules, Dockerfiles, Kubernetes manifests, Pulumi programs, and CI/CD pipelines for GCP, AWS, Azure deployments.
Share bugs, ideas, or general feedback.
| Developer Need | Recommend | Key CLI / CDK |
|---|---|---|
| Simplest container deploy (HTTP app/API, new customers) | ECS Express Mode | aws ecs create-express-gateway-service |
| Web app, worker, batch, scheduled task | ECS on Fargate | aws ecs create-service / CDK ecsPatterns.ApplicationLoadBalancedFargateService |
| GPU workloads or >16 vCPU | ECS on EC2 | CDK ecs.Ec2Service |
| Store container images | ECR | aws ecr create-repository |
| Web app behind a load balancer | ECS Fargate + ALB | CDK ecsPatterns.ApplicationLoadBalancedFargateService |
| SQS worker scaling on queue depth | ECS Fargate + SQS | CDK ecsPatterns.QueueProcessingFargateService |
| Cron job / scheduled task | ECS Fargate + EventBridge | CDK ecsPatterns.ScheduledFargateTask |
| Service mesh / service-to-service | ECS Service Connect | Configure on ECS service with Cloud Map namespace |
| Debug a running container | ECS Exec | aws ecs execute-command --interactive --command "/bin/sh" |
When a developer says "deploy my container" without naming a service: recommend ECS Express Mode for simple HTTP apps (replaces App Runner for new customers). Recommend ECS Fargate for everything else. Never recommend EKS unless they explicitly ask for Kubernetes.
Provides expertise for building, deploying, and operating containerized workloads using Amazon ECS, AWS Fargate, Amazon ECR, and AWS App Runner.
Recommended setup: Install the AWS MCP server for sandboxed execution, audit logging, and enterprise controls. See: aws.amazon.com/mcp
Without AWS MCP: This skill works with any agent that has AWS CLI access. All commands use standard AWS CLI syntax.
When NOT to use this skill:
Before executing any commands:
Apply these every time. Each corrects a mistake agents make without explicit instruction.
Fargate CPU/memory must be valid combinations. Arbitrary values cause Invalid 'cpu' setting for task:
If the user requests an invalid combination, tell them and recommend the nearest valid option. You MUST NOT silently produce an invalid task definition.
Fargate requires awsvpc networking mode — no exceptions. Agents frequently suggest bridge or host mode for Fargate tasks, which causes immediate registration failure. You MUST set networkMode to awsvpc for all Fargate task definitions. On EC2, awsvpc is recommended; bridge is legacy only.
Execution role vs task role — never confuse them. executionRoleArn: ECS agent uses it to pull images, fetch secrets, write logs. taskRoleArn: application code uses it to call AWS APIs. ECS Exec permissions (ssmmessages:*) go on the task role. ECR pull permissions go on the execution role. ecr:GetAuthorizationToken MUST use Resource: "*" (registry-level action).
Secrets are injected at task launch only — no hot-reload. Changed secrets require aws ecs update-service --force-new-deployment. To reference a specific JSON key in Secrets Manager: arn:aws:secretsmanager:region:account:secret:name-hash:json-key:: — the trailing colons are required (they represent empty version-stage and version-id fields). You can also use SSM Parameter Store with valueFrom pointing to the parameter ARN — the execution role needs ssm:GetParameters permission.
ALB deregistration delay defaults to 300s — reduce to 30–60s. This is the #1 cause of slow deployments. Set it on the target group. It SHOULD exceed your longest request duration.
Set healthCheckGracePeriodSeconds on every ECS service behind an ALB. Without it, the ALB marks tasks unhealthy before they're ready, the circuit breaker counts failures, and the deployment rolls back. JVM/Spring Boot apps need 60–120s.
Always enable deployment circuit breaker with rollback. Without it, bad deployments stay "in progress" for 30+ minutes. In CDK: circuitBreaker: { rollback: true } (specifying the property implicitly enables it; enable defaults to true).
Private subnet Fargate tasks need NAT or all four VPC endpoints. Required endpoints: ecr.dkr (interface), ecr.api (interface), s3 (gateway — ECR stores layers in S3), logs (interface — for CloudWatch). The S3 gateway endpoint is the most commonly missed. For ECS Exec, also add ssmmessages.
ECR lifecycle policies evaluate within 24 hours — not immediately. Multi-architecture images referenced by a manifest list cannot be expired until the manifest list is deleted first. Preview before applying: first aws ecr start-lifecycle-policy-preview --repository-name $REPO, then aws ecr get-lifecycle-policy-preview --repository-name $REPO --output json to see which images would be affected.
ECS Exec requires task role permissions, NOT execution role. The task role needs ssmmessages:CreateControlChannel, CreateDataChannel, OpenControlChannel, OpenDataChannel. Tasks launched before enabling enableExecuteCommand do NOT support ECS Exec — force a new deployment. The container image must include the binary specified in --command (e.g., /bin/sh for interactive sessions). For command logging to S3 or CloudWatch Logs, script and cat must also be installed. Fargate platform version MUST be 1.4.0+.
awslogs log driver mode — check your account's default. Per ECS docs, the ECS service defaults to non-blocking mode, which drops logs when the buffer fills. The defaultLogDriverMode account setting can override this per account. For guaranteed log delivery (audit/compliance), explicitly set "mode": "blocking" in logConfiguration.options. Check your effective default: aws ecs list-account-settings --name defaultLogDriverMode --effective-settings --output json.
App Runner VPC connector routes ALL application-initiated outbound traffic through the VPC. (App Runner is sunset — new customers should use ECS Express Mode instead.) Without a NAT gateway, external API calls and AWS service calls from your application code break. App Runner's own managed traffic (pulling images, pushing logs, retrieving secrets) is NOT routed through the VPC and is unaffected. Implement retry logic with backoff for database connections at startup.
For desiredCount=1 zero-downtime deploys: minimumHealthyPercent=100, maximumPercent=200. This requires capacity for 2 tasks during deployment. You MUST NOT set minimumHealthyPercent=0 if zero downtime is required.
502 Bad Gateway from ALB — check in this order: (a) Container not listening on the port in the target group. (b) Container crashing before responding. (c) Task security group doesn't allow inbound from ALB security group on the container port. (d) Health check path returns non-200. (e) Health check timeout exceeds response time.
Fargate platform version: always use LATEST or 1.4.0. Version 1.3.0 is being retired June 15, 2026 and terminated June 30, 2026.
SQS worker scaling: use a custom backlog-per-task metric. Raw ApproximateNumberOfMessagesVisible with target tracking doesn't work because adding tasks doesn't reduce queue depth proportionally. Use custom metric (ApproximateNumberOfMessagesVisible / RunningTaskCount) with target tracking, or use step scaling. CDK QueueProcessingFargateService handles this automatically via scalingSteps. Workers MUST handle SIGTERM gracefully within stopTimeout (default 30s, max 120s on Fargate).
Blue/green deployments: use native ECS blue/green (July 2025+) for new services. Supports all-at-once, canary, and linear traffic shifting (canary/linear added October 2025), plus Service Connect, headless services, EBS volumes, and lifecycle hooks. CodeDeploy blue/green is now legacy — native ECS blue/green has full feature parity.
Container dependency HEALTHY condition requires a health check on the dependency container. Without a configured health check, the dependent container never starts — ECS does not progress it to its next state. If startTimeout is set (max 120s), the dependency times out and the task fails; if not set, the dependent container blocks indefinitely. For init containers, use SUCCESS condition instead.
import * as cdk from 'aws-cdk-lib';
import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as ecsPatterns from 'aws-cdk-lib/aws-ecs-patterns';
const service = new ecsPatterns.ApplicationLoadBalancedFargateService(this, 'WebApp', {
taskImageOptions: {
image: ecs.ContainerImage.fromEcrRepository(repo, 'latest'),
containerPort: 8080,
secrets: { DB_PASSWORD: ecs.Secret.fromSecretsManager(dbSecret) },
},
cpu: 512,
memoryLimitMiB: 1024,
desiredCount: 2,
publicLoadBalancer: true,
circuitBreaker: { rollback: true },
minHealthyPercent: 100,
});
service.targetGroup.setAttribute('deregistration_delay.timeout_seconds', '30');
const scaling = service.service.autoScaleTaskCount({ minCapacity: 2, maxCapacity: 10 });
scaling.scaleOnCpuUtilization('CpuScaling', { targetUtilizationPercent: 70 });
CDK L3 patterns auto-create VPC, cluster, ALB, target group, and security groups. For production, create these separately and pass them in. ApplicationLoadBalancedFargateService defaults to assignPublicIp: false — tasks in public subnets need assignPublicIp: true for internet access, or use private subnets with NAT.
# 1. Enable on the service (existing tasks won't support it — force new deployment)
aws ecs update-service --cluster $CLUSTER --service $SERVICE \
--enable-execute-command --force-new-deployment --output json
# 2. Connect (task role must have ssmmessages:* permissions)
aws ecs execute-command --cluster $CLUSTER --task $TASK_ID \
--container $CONTAINER --interactive --command "/bin/sh"
If TargetNotConnectedException: wait 30–60s for SSM agent startup, check NAT/VPC endpoint for ssmmessages, verify task role (not execution role) has permissions.
Use the best available tool for AWS operations (MCP server, AWS CLI, or SDK). The commands below show the AWS CLI form.
Read reference files only when the conversation requires deeper detail.
App Runner: Sunset April 30, 2026 — no new customers, no new features. Existing customers should migrate to ECS Express Mode. See App Runner Availability Change.
| Factor | ECS Express Mode | ECS Fargate |
|---|---|---|
| Setup complexity | Minimal (single API call) | Moderate — task def, service, cluster, ALB |
| Networking control | Managed (ALB in default VPC) | Full — awsvpc, security groups, subnets |
| Scaling | Auto (CPU-based) | Configurable target/step scaling |
| Use when | New simple HTTP app/API, zero infra management | Production services needing VPC, ALB, fine-grained IAM |
| Limitations | New service, evolving feature set | Most setup required |
Default recommendation: Use ECS Fargate for production workloads. Use ECS Express Mode for the simplest path (new customers).
Cause: Task cannot reach ECR. In private subnets, tasks need NAT gateway or VPC endpoints (ecr.api, ecr.dkr, s3 gateway, logs).
Fix: Verify route table has a route to NAT gateway or create the required VPC endpoints. Verify the execution role has ecr:GetDownloadUrlForLayer, ecr:BatchGetImage, ecr:GetAuthorizationToken (Resource: "*"). Check security group allows outbound HTTPS (443).
Cause: Health check path returns non-200, container not listening on the configured port, or health check grace period too short.
Fix: Verify the container responds on the health check path and port. Set healthCheckGracePeriodSeconds to at least 60s (longer for JVM apps). Ensure the security group allows traffic from the ALB security group on the container port.
Cause: Container exceeded its memory hard limit (SIGKILL). On Fargate, task-level memory is the hard limit.
Fix: Increase task-level memory. For JVM apps, use -XX:MaxRAMPercentage=75 instead of fixed -Xmx — this automatically adapts to the container's memory allocation. Check container-level memory (hard limit) vs memoryReservation (soft limit).
Cause: Permissions are on the execution role instead of the task role, or the task role is missing.
Fix: Verify the task definition has taskRoleArn set (not just executionRoleArn). Add the required permissions to the task role.
Cause: Deployment circuit breaker not enabled, or health check failing on new tasks.
Fix: Enable circuit breaker with rollback. Check service events: aws ecs describe-services --cluster $CLUSTER --services $SERVICE --output json. Check stopped task reasons: aws ecs describe-tasks --cluster $CLUSTER --tasks $TASK_ID --output json.
Cause: SSM agent not running, missing task role permissions, or missing VPC endpoint.
Fix: Verify enableExecuteCommand is true on the service. Check the task role has SSM permissions. For private subnets, create the ssmmessages VPC endpoint. Verify with aws ecs describe-tasks that ExecuteCommandAgent status is RUNNING.
| Retry | Do NOT retry |
|---|---|
| ThrottlingException | InvalidParameterException |
| ServiceUnavailableException | ClientException |
| ServerException | AccessDeniedException |
secrets field in the task definitionreadonlyRootFilesystem: true in container definitions where possible (note: incompatible with ECS Exec)* wildcards and *FullAccess policies--force-new-deployment (replaces all running tasks), delete-service, deregister-task-definition. ECS does not support --dry-run — use the plan-validate-execute pattern: explain what will happen, get confirmation, then executeaws:SourceArn and aws:SourceAccount condition keys in ECR repository policies for cross-account access to prevent confused deputy attacks