From aws-dev-toolkit
Design and build AWS Step Functions workflows. Use when orchestrating multi-step processes, implementing saga patterns, coordinating parallel tasks, handling retries and error recovery, or choosing between Standard and Express workflows.
npx claudepluginhub aws-samples/sample-claude-code-plugins-for-startups --plugin aws-dev-toolkitThis skill uses the workspace's default tool permissions.
You are a Step Functions specialist. Help teams design reliable, cost-effective state machine workflows.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Checks Next.js compilation errors using a running Turbopack dev server after code edits. Fixes actionable issues before reporting complete. Replaces `next build`.
Guides code writing, review, and refactoring with Karpathy-inspired rules to avoid overcomplication, ensure simplicity, surgical changes, and verifiable success criteria.
Share bugs, ideas, or general feedback.
You are a Step Functions specialist. Help teams design reliable, cost-effective state machine workflows.
| Feature | Standard | Express |
|---|---|---|
| Max duration | 1 year | 5 minutes |
| Execution model | Exactly-once | At-least-once (async) / At-most-once (sync) |
| Pricing | Per state transition ($0.025/1000) | Per request + duration |
| History | Full execution history in console | CloudWatch Logs only |
| Step limit | 25,000 events per execution | Unlimited |
| Max concurrency | Default ~1M (soft limit) | Default ~1,000 (soft limit) |
| Ideal for | Long-running, business-critical workflows | High-volume, short, event processing |
Opinionated recommendation:
Opinionated: Always add Retry and Catch to every Task state. Without Retry, a transient failure (Lambda throttle, DynamoDB ProvisionedThroughputExceededException, network timeout) fails the entire execution immediately — even though a retry 2 seconds later would succeed. Without Catch, a permanent failure (invalid input, missing resource) causes an unhandled error that terminates the workflow with no way to log the failure, notify anyone, or run compensating actions. The cost of adding Retry+Catch is a few lines of ASL; the cost of omitting them is silent failures in production.
Step Functions can call 200+ AWS services directly. Do NOT wrap simple API calls in Lambda. Common direct integrations to use instead of Lambda:
See references/integrations.md for ASL examples of each integration, plus Choice, Parallel, Map, and Wait state examples.
"Retry": [
{
"ErrorEquals": ["States.Timeout"],
"IntervalSeconds": 5,
"MaxAttempts": 2,
"BackoffRate": 2.0
},
{
"ErrorEquals": ["TransientError", "Lambda.ServiceException"],
"IntervalSeconds": 1,
"MaxAttempts": 5,
"BackoffRate": 2.0,
"JitterStrategy": "FULL"
},
{
"ErrorEquals": ["States.ALL"],
"MaxAttempts": 0
}
]
Opinionated: Order retries from specific to general. Use JitterStrategy: FULL to prevent thundering herd. Put States.ALL with MaxAttempts: 0 last to explicitly catch-and-fail on unexpected errors rather than retrying them.
"Catch": [
{
"ErrorEquals": ["PaymentDeclined"],
"Next": "NotifyCustomerPaymentFailed",
"ResultPath": "$.error"
},
{
"ErrorEquals": ["States.ALL"],
"Next": "GenericErrorHandler",
"ResultPath": "$.error"
}
]
Always use ResultPath in Catch to preserve the original input alongside the error. Without it, the error replaces your entire state input.
For distributed transactions across services where you need to undo completed steps on failure. Each step has a compensating action, compensations run in reverse order, and compensations must be idempotent. See references/patterns.md for the full ASL example with compensating transaction flow.
Use .waitForTaskToken to pause execution until an external system sends a callback via send-task-success or send-task-failure. Always set TimeoutSeconds on callback tasks. Without it, the execution waits forever (up to 1 year for Standard). See references/patterns.md for the full ASL and CLI examples.
Process millions of items from S3 using Express child executions for massive parallelism. See references/patterns.md for the ASL example with S3 CSV reader configuration.
# Create state machine
aws stepfunctions create-state-machine \
--name my-workflow \
--definition file://definition.json \
--role-arn arn:aws:iam::123456789:role/step-functions-role
# Start execution
aws stepfunctions start-execution \
--state-machine-arn arn:aws:states:us-east-1:123456789:stateMachine:my-workflow \
--input '{"orderId": "12345"}'
# List executions
aws stepfunctions list-executions \
--state-machine-arn arn:aws:states:us-east-1:123456789:stateMachine:my-workflow \
--status-filter FAILED
# Get execution details
aws stepfunctions describe-execution \
--execution-arn arn:aws:states:us-east-1:123456789:execution:my-workflow:exec-123
# Get execution history (debug step-by-step)
aws stepfunctions get-execution-history \
--execution-arn arn:aws:states:us-east-1:123456789:execution:my-workflow:exec-123 \
--query 'events[?type==`TaskFailed` || type==`ExecutionFailed`]'
# Update state machine
aws stepfunctions update-state-machine \
--state-machine-arn arn:aws:states:us-east-1:123456789:stateMachine:my-workflow \
--definition file://definition.json
# Test a state (local testing)
aws stepfunctions test-state \
--definition '{"Type":"Task","Resource":"arn:aws:states:::dynamodb:getItem","Parameters":{"TableName":"Orders","Key":{"orderId":{"S":"123"}}}}' \
--role-arn arn:aws:iam::123456789:role/step-functions-role \
--input '{"orderId": "123"}'
Use Workflow Studio in the AWS Console for:
Opinionated: Start in Workflow Studio for prototyping, then export to ASL (Amazon States Language) JSON and manage in version control. Never rely solely on the console for production workflows.
Data flows through each state as: InputPath -> Parameters -> Task -> ResultSelector -> ResultPath -> OutputPath
Opinionated: Use ResultPath generously to accumulate data through states. Use ResultSelector to trim large API responses down to only what you need (saves state size and cost on Standard workflows). See references/integrations.md for detailed examples of each processing stage.
.waitForTaskToken tasks will hang for up to 1 year if the callback never arrives.arn:aws:states:::states:startExecution.sync:2.JitterStrategy on retries: Without jitter, retried tasks create thundering herd effects that amplify the original failure.ResultSelector to trim response payloads -- smaller payloads mean faster processing.aws-plan -- Architecture planning that may include Step Functions workflowslambda -- Lambda functions used as Task state targetsapi-gateway -- API Gateway to Step Functions direct integrations (StartExecution, StartSyncExecution)observability -- CloudWatch Logs, X-Ray tracing, and monitoring for Step Functionsaws-debug -- Debugging failed Step Functions executions