Step Functions is a serverless workflow orchestrator. You define a state machine (a series of steps), and Step Functions handles executing them, passing data between steps, handling retries, and managing state.
Real-World: An e-commerce order flow:
- Validate order
- Check inventory
- Charge customer (with retry if payment fails)
- Update inventory
- Send confirmation email
- If any step fails → refund, notify ops team
Without Step Functions: you'd build this with SQS queues, Lambda tracking state in DynamoDB, custom retry logic. With Step Functions: define the flow in JSON, Step Functions handles everything.
{
"Comment": "Order Processing Workflow",
"StartAt": "ValidateOrder",
"States": {
"ValidateOrder": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:ValidateOrder",
"Next": "CheckInventory",
"Catch": [{
"ErrorEquals": ["InvalidOrderError"],
"Next": "OrderFailed"
}]
},
"CheckInventory": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:CheckInventory",
"Next": "ChargeCustomer",
"Retry": [{
"ErrorEquals": ["States.TaskFailed"],
"IntervalSeconds": 2,
"MaxAttempts": 3,
"BackoffRate": 2.0
}],
"Catch": [{
"ErrorEquals": ["OutOfStockError"],
"Next": "NotifyOutOfStock"
}]
},
"ChargeCustomer": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
"Parameters": {
"FunctionName": "ChargeCustomer",
"Payload": {
"taskToken.$": "$$.Task.Token",
"orderId.$": "$.orderId"
}
},
"Next": "ProcessOrder",
"TimeoutSeconds": 300
},
"ProcessOrder": {
"Type": "Parallel",
"Branches": [
{
"StartAt": "UpdateInventory",
"States": {
"UpdateInventory": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:UpdateInventory",
"End": true
}
}
},
{
"StartAt": "SendConfirmation",
"States": {
"SendConfirmation": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:SendEmail",
"End": true
}
}
}
],
"Next": "OrderComplete"
},
"OrderComplete": {
"Type": "Succeed"
},
"OrderFailed": {
"Type": "Fail",
"Error": "OrderValidationFailed",
"Cause": "Order failed validation checks"
},
"NotifyOutOfStock": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:NotifyOutOfStock",
"End": true
}
}
}| State Type | What it does |
|---|---|
| Task | Invoke Lambda, call AWS service, or wait |
| Pass | Pass input to output (data transform, no work) |
| Wait | Pause for N seconds or until timestamp |
| Choice | Conditional branching (if/else logic) |
| Parallel | Execute branches simultaneously |
| Map | Iterate over array items |
| Succeed | Successful end |
| Fail | Error end with cause |
{
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:myFunction",
"Parameters": {
"orderId.$": "$.orderId",
"userId.$": "$.userId"
},
"ResultPath": "$.lambdaResult"
}Call AWS services directly without Lambda:
// Call DynamoDB directly (no Lambda needed!)
{
"Type": "Task",
"Resource": "arn:aws:states:::dynamodb:putItem",
"Parameters": {
"TableName": "orders",
"Item": {
"orderId": {"S.$": "$.orderId"},
"status": {"S": "PROCESSING"}
}
}
}
// Send SQS message
{
"Type": "Task",
"Resource": "arn:aws:states:::sqs:sendMessage",
"Parameters": {
"QueueUrl": "https://sqs.us-east-1.amazonaws.com/123/orders",
"MessageBody.$": "States.JsonToString($.order)"
}
}
// Invoke another Step Functions workflow
{
"Type": "Task",
"Resource": "arn:aws:states:::states:startExecution.sync:2",
"Parameters": {
"StateMachineArn": "arn:aws:states:us-east-1:123:stateMachine:SubWorkflow",
"Input.$": "$"
}
}Send request, don't wait for completion. Move to next state immediately.
{"Resource": "arn:aws:lambda:...:function:MyFunc"}Wait for the job to complete before moving to next state.
{"Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken"}Pause workflow and wait for external system to call back.
Real-World: Human approval step.
{
"Type": "Task",
"Resource": "arn:aws:states:::sqs:sendMessage.waitForTaskToken",
"Parameters": {
"QueueUrl": "https://sqs.us-east-1.amazonaws.com/123/approvals",
"MessageBody": {
"taskToken.$": "$$.Task.Token",
"orderId.$": "$.orderId",
"amount.$": "$.amount"
}
},
"HeartbeatSeconds": 3600 // Fail if no response in 1 hour
}Human reviews in UI → clicks approve → their system calls:
sfn_client.send_task_success(
taskToken=token,
output=json.dumps({"approved": True})
)
# OR
sfn_client.send_task_failure(
taskToken=token,
error="RejectedByHuman",
cause="Amount too high"
){
"Retry": [{
"ErrorEquals": ["Lambda.ServiceException", "Lambda.AWSLambdaException"],
"IntervalSeconds": 2, // Wait 2s before first retry
"MaxAttempts": 3, // Try 3 times
"BackoffRate": 2.0, // Double wait each retry: 2s, 4s, 8s
"JitterStrategy": "FULL" // Add randomness to prevent thundering herd
}]
}{
"Catch": [
{
"ErrorEquals": ["PaymentDeclinedError"],
"Next": "HandleDeclinedPayment",
"ResultPath": "$.error" // Put error in $.error, keep original input
},
{
"ErrorEquals": ["States.ALL"], // Catch anything else
"Next": "HandleGenericError"
}
]
}States.ALL— catch any errorStates.Timeout— task timed outStates.TaskFailed— task threw an errorStates.Permissions— insufficient permissions
{
"Type": "Choice",
"Choices": [
{
"Variable": "$.orderTotal",
"NumericGreaterThan": 1000,
"Next": "RequireManagerApproval"
},
{
"Variable": "$.customerType",
"StringEquals": "premium",
"Next": "PremiumProcessing"
},
{
"And": [
{"Variable": "$.country", "StringEquals": "US"},
{"Variable": "$.expressShipping", "BooleanEquals": true}
],
"Next": "ExpressUSShipping"
}
],
"Default": "StandardProcessing"
}Process each item in an array:
{
"Type": "Map",
"ItemsPath": "$.orders", // Array to iterate over
"MaxConcurrency": 5, // Max parallel executions
"Iterator": {
"StartAt": "ProcessSingleOrder",
"States": {
"ProcessSingleOrder": {
"Type": "Task",
"Resource": "arn:aws:lambda:...:function:ProcessOrder",
"End": true
}
}
}
}Real-World: Process 100 orders from a batch file — Map iterates over each, processes up to 5 simultaneously.
| Feature | Standard | Express |
|---|---|---|
| Duration | Up to 1 year | Up to 5 minutes |
| Execution model | At-least-once | At-least-once (async) / At-most-once (sync) |
| Pricing | Per state transition | Per duration + invocations |
| Execution history | Full history in console | CloudWatch Logs |
| Use for | Long-running, human approval | High-volume, short workflows |
Standard: Order processing, data pipelines, ETL
Express: IoT data processing, streaming ETL, microservice orchestration
| Practice | Reason |
|---|---|
| Use optimized SDK integrations over Lambda | Lower cost, lower latency, less code |
| Add Retry with exponential backoff | Handle transient failures automatically |
| Use Catch for specific error types | Different error paths for different failures |
Use .waitForTaskToken for human approval |
Clean async callback pattern |
| Use Parallel for independent steps | Reduce total execution time |
| Use Express Workflow for high-volume short tasks | 10x cheaper than Standard |
| Anti-Pattern | Impact | Fix |
|---|---|---|
| Using Lambda for everything that can use SDK integration | Extra cost + cold starts | Use optimized integrations (DynamoDB, SQS, SNS, etc.) |
| No retry on Task states | Transient failures cause workflow failure | Add Retry blocks |
| No timeout on waitForTaskToken | Workflow hangs forever | Set HeartbeatSeconds |
| Standard workflow for high-volume IoT (>1M/day) | Very expensive | Use Express workflow |
- Step Functions solves the "Lambda can only run 15 min" problem — workflows can run for 1 year.
- waitForTaskToken is the pattern for human approval and async integrations.
- Parallel state = branches run simultaneously. Map state = iterate over array.
- Choice state = if/else branching — no Lambda needed.
- Express vs Standard: Express = high-volume, short. Standard = long-running, full history.
$$.Task.Token= the task token inwaitForTaskTokenpattern.$$= context object.ResultPath= where to put the task output in the state data.
Q: Order processing needs 15+ step workflow with human approval?
→ Step Functions Standard workflow with waitForTaskToken state for approval.
Q: Process each item in a DynamoDB scan result with Lambda? → Map state with Lambda integration — iterate over array, up to MaxConcurrency parallel.
Q: Handle payment API being flaky (occasional 500 errors)? → Add Retry to the payment Task state with exponential backoff.
Q: IoT device sends 1 million events/day, each needs 3 processing steps? → Step Functions Express Workflow (cheaper than Standard for high volume).
Q: Lambda takes too long — what's the alternative for long-running orchestration? → Use Step Functions to orchestrate multiple Lambda calls instead of one long Lambda.