From aws-dev-toolkit
Design and implement AWS observability solutions. Use when configuring CloudWatch metrics, logs, alarms, dashboards, Logs Insights queries, X-Ray tracing, anomaly detection, or debugging monitoring gaps.
npx claudepluginhub aws-samples/sample-claude-code-plugins-for-startups --plugin aws-dev-toolkitThis skill is limited to using the following tools:
You are an AWS observability specialist. Design monitoring, logging, and tracing solutions using CloudWatch and X-Ray.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Checks Next.js compilation errors using a running Turbopack dev server after code edits. Fixes actionable issues before reporting complete. Replaces `next build`.
Guides code writing, review, and refactoring with Karpathy-inspired rules to avoid overcomplication, ensure simplicity, surgical changes, and verifiable success criteria.
Share bugs, ideas, or general feedback.
You are an AWS observability specialist. Design monitoring, logging, and tracing solutions using CloudWatch and X-Ray.
AWS/EC2, AWS/Lambda, custom)CPUUtilization)InstanceId=i-xxx)| Service | Metric | Alarm Threshold | Notes |
|---|---|---|---|
| Lambda | Errors | > 0 for 1 min | Also alarm on Throttles and Duration p99 |
| Lambda | ConcurrentExecutions | > 80% of account limit | Prevent throttling |
| ALB | HTTPCode_Target_5XX_Count | > 0 for 5 min | Backend errors |
| ALB | TargetResponseTime p99 | > your SLA | Latency SLO |
| ALB | UnHealthyHostCount | > 0 | Failing targets |
| RDS | CPUUtilization | > 80% for 5 min | Sustained high CPU |
| RDS | FreeStorageSpace | < 20% of total | Prevent disk full |
| RDS | DatabaseConnections | > 80% of max | Connection exhaustion |
| DynamoDB | ThrottledRequests | > 0 | Capacity issues |
| SQS | ApproximateAgeOfOldestMessage | > your processing SLA | Queue backlog |
| ECS | CPUUtilization / MemoryUtilization | > 80% for 5 min | Scaling trigger |
PutMetricData API or the CloudWatch AgentAlways log in JSON format. This enables Logs Insights queries on fields.
{"level": "ERROR", "message": "Payment failed", "orderId": "123", "errorCode": "DECLINED", "duration_ms": 45}
# Find errors in Lambda functions
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100
# P99 latency from structured logs
fields @timestamp, duration_ms
| stats percentile(duration_ms, 99) as p99, avg(duration_ms) as avg_ms by bin(5m)
# Top 10 most frequent errors
fields @timestamp, errorCode, @message
| filter level = "ERROR"
| stats count(*) as error_count by errorCode
| sort error_count desc
| limit 10
# Request rate over time
fields @timestamp
| stats count(*) as requests by bin(1m)
| sort @timestamp desc
# Find slow requests
fields @timestamp, @duration, @requestId
| filter @duration > 5000
| sort @duration desc
| limit 20
# Cold starts in Lambda
filter @type = "REPORT"
| fields @requestId, @duration, @initDuration
| filter ispresent(@initDuration)
| stats count(*) as cold_starts, avg(@initDuration) as avg_init by bin(1h)
# API Gateway latency breakdown
fields @timestamp
| filter @message like /API Gateway/
| stats avg(integrationLatency) as backend_ms, avg(latency) as total_ms by bin(5m)
TreatMissingData to notBreaching for low-traffic services (avoids false alarms when no data)TreatMissingData to breaching for critical health checks (missing data = something is down)customerId=123)# Query Logs Insights
aws logs start-query --log-group-name /aws/lambda/my-function \
--start-time $(date -d '1 hour ago' +%s) --end-time $(date +%s) \
--query-string 'fields @timestamp, @message | filter @message like /ERROR/ | limit 20'
# Get query results
aws logs get-query-results --query-id "query-id-here"
# Describe alarms in ALARM state
aws cloudwatch describe-alarms --state-value ALARM --query 'MetricAlarms[*].{Name:AlarmName,Metric:MetricName,State:StateValue}'
# Get metric statistics
aws cloudwatch get-metric-statistics --namespace AWS/Lambda --metric-name Errors \
--start-time 2024-01-01T00:00:00Z --end-time 2024-01-01T01:00:00Z \
--period 300 --statistics Sum --dimensions Name=FunctionName,Value=my-function
# Put custom metric
aws cloudwatch put-metric-data --namespace MyApp --metric-name RequestLatency \
--value 42 --unit Milliseconds --dimensions Name=Environment,Value=prod
# List log groups with retention
aws logs describe-log-groups --query 'logGroups[*].{Name:logGroupName,RetentionDays:retentionInDays,StoredBytes:storedBytes}'
# Set log retention
aws logs put-retention-policy --log-group-name /aws/lambda/my-function --retention-in-days 30
# List X-Ray traces
aws xray get-trace-summaries --start-time $(date -d '1 hour ago' +%s) --end-time $(date +%s)
# Get X-Ray service map
aws xray get-service-graph --start-time $(date -d '1 hour ago' +%s) --end-time $(date +%s)
# List CloudWatch dashboards
aws cloudwatch list-dashboards
| Field | Details |
|---|---|
| Metrics | Critical alarms with thresholds, evaluation periods, and actions |
| Logs | Log groups, retention policy, structured format (JSON), subscription filters |
| Traces | X-Ray or OpenTelemetry, sampling rules, annotations for filtering |
| Dashboards | Dashboard names, key widgets, layout (business/infra/dependencies) |
| Anomaly detection | Metrics with anomaly detection bands, standard deviation config |
| Cost | Estimated monthly cost for logs ingestion, metrics, dashboards, and traces |
references/logs-insights-queries.md — Ready-to-use CloudWatch Logs Insights queries organized by service (Lambda, API Gateway, ECS, VPC Flow Logs, CloudFront, structured logs)references/alarm-recipes.md — Production alarm configurations with thresholds, metric math examples, composite alarm and anomaly detection recipeslambda — Lambda metrics, Embedded Metric Format, and X-Ray active tracingecs — Container Insights, task-level metrics, and ECS service alarmseks — Control plane logging, Prometheus, and Container Insights for Kubernetescloudfront — CloudFront access logs and cache metricsapi-gateway — API Gateway latency and error monitoringnetworking — VPC Flow Logs, Route53 health checks, and Transit Gateway metrics