Skip to content

Latest commit

 

History

History
397 lines (333 loc) · 10.5 KB

File metadata and controls

397 lines (333 loc) · 10.5 KB

Step Functions — Serverless Workflows

What Is It?

Step Functions is a serverless workflow orchestrator. You define a state machine (a series of steps), and Step Functions handles executing them, passing data between steps, handling retries, and managing state.

Real-World: An e-commerce order flow:

  1. Validate order
  2. Check inventory
  3. Charge customer (with retry if payment fails)
  4. Update inventory
  5. Send confirmation email
  6. If any step fails → refund, notify ops team

Without Step Functions: you'd build this with SQS queues, Lambda tracking state in DynamoDB, custom retry logic. With Step Functions: define the flow in JSON, Step Functions handles everything.


State Machine Example

{
  "Comment": "Order Processing Workflow",
  "StartAt": "ValidateOrder",
  "States": {
    "ValidateOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123:function:ValidateOrder",
      "Next": "CheckInventory",
      "Catch": [{
        "ErrorEquals": ["InvalidOrderError"],
        "Next": "OrderFailed"
      }]
    },
    
    "CheckInventory": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123:function:CheckInventory",
      "Next": "ChargeCustomer",
      "Retry": [{
        "ErrorEquals": ["States.TaskFailed"],
        "IntervalSeconds": 2,
        "MaxAttempts": 3,
        "BackoffRate": 2.0
      }],
      "Catch": [{
        "ErrorEquals": ["OutOfStockError"],
        "Next": "NotifyOutOfStock"
      }]
    },
    
    "ChargeCustomer": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
      "Parameters": {
        "FunctionName": "ChargeCustomer",
        "Payload": {
          "taskToken.$": "$$.Task.Token",
          "orderId.$": "$.orderId"
        }
      },
      "Next": "ProcessOrder",
      "TimeoutSeconds": 300
    },
    
    "ProcessOrder": {
      "Type": "Parallel",
      "Branches": [
        {
          "StartAt": "UpdateInventory",
          "States": {
            "UpdateInventory": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:us-east-1:123:function:UpdateInventory",
              "End": true
            }
          }
        },
        {
          "StartAt": "SendConfirmation",
          "States": {
            "SendConfirmation": {
              "Type": "Task", 
              "Resource": "arn:aws:lambda:us-east-1:123:function:SendEmail",
              "End": true
            }
          }
        }
      ],
      "Next": "OrderComplete"
    },
    
    "OrderComplete": {
      "Type": "Succeed"
    },
    
    "OrderFailed": {
      "Type": "Fail",
      "Error": "OrderValidationFailed",
      "Cause": "Order failed validation checks"
    },
    
    "NotifyOutOfStock": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123:function:NotifyOutOfStock",
      "End": true
    }
  }
}

State Types

State Type What it does
Task Invoke Lambda, call AWS service, or wait
Pass Pass input to output (data transform, no work)
Wait Pause for N seconds or until timestamp
Choice Conditional branching (if/else logic)
Parallel Execute branches simultaneously
Map Iterate over array items
Succeed Successful end
Fail Error end with cause

Task State — Calling AWS Services

Calling Lambda

{
  "Type": "Task",
  "Resource": "arn:aws:lambda:us-east-1:123:function:myFunction",
  "Parameters": {
    "orderId.$": "$.orderId",
    "userId.$": "$.userId"
  },
  "ResultPath": "$.lambdaResult"
}

Optimized Integrations (SDK Service Integrations)

Call AWS services directly without Lambda:

// Call DynamoDB directly (no Lambda needed!)
{
  "Type": "Task",
  "Resource": "arn:aws:states:::dynamodb:putItem",
  "Parameters": {
    "TableName": "orders",
    "Item": {
      "orderId": {"S.$": "$.orderId"},
      "status": {"S": "PROCESSING"}
    }
  }
}

// Send SQS message
{
  "Type": "Task",
  "Resource": "arn:aws:states:::sqs:sendMessage",
  "Parameters": {
    "QueueUrl": "https://sqs.us-east-1.amazonaws.com/123/orders",
    "MessageBody.$": "States.JsonToString($.order)"
  }
}

// Invoke another Step Functions workflow
{
  "Type": "Task",
  "Resource": "arn:aws:states:::states:startExecution.sync:2",
  "Parameters": {
    "StateMachineArn": "arn:aws:states:us-east-1:123:stateMachine:SubWorkflow",
    "Input.$": "$"
  }
}

Integration Patterns

Request/Response (default)

Send request, don't wait for completion. Move to next state immediately.

{"Resource": "arn:aws:lambda:...:function:MyFunc"}

Synchronous (.sync)

Wait for the job to complete before moving to next state.

{"Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken"}

Wait for Task Token (callback pattern)

Pause workflow and wait for external system to call back.

Real-World: Human approval step.

{
  "Type": "Task",
  "Resource": "arn:aws:states:::sqs:sendMessage.waitForTaskToken",
  "Parameters": {
    "QueueUrl": "https://sqs.us-east-1.amazonaws.com/123/approvals",
    "MessageBody": {
      "taskToken.$": "$$.Task.Token",
      "orderId.$": "$.orderId",
      "amount.$": "$.amount"
    }
  },
  "HeartbeatSeconds": 3600  // Fail if no response in 1 hour
}

Human reviews in UI → clicks approve → their system calls:

sfn_client.send_task_success(
    taskToken=token,
    output=json.dumps({"approved": True})
)
# OR
sfn_client.send_task_failure(
    taskToken=token,
    error="RejectedByHuman",
    cause="Amount too high"
)

Error Handling

Retry

{
  "Retry": [{
    "ErrorEquals": ["Lambda.ServiceException", "Lambda.AWSLambdaException"],
    "IntervalSeconds": 2,     // Wait 2s before first retry
    "MaxAttempts": 3,         // Try 3 times
    "BackoffRate": 2.0,       // Double wait each retry: 2s, 4s, 8s
    "JitterStrategy": "FULL"  // Add randomness to prevent thundering herd
  }]
}

Catch

{
  "Catch": [
    {
      "ErrorEquals": ["PaymentDeclinedError"],
      "Next": "HandleDeclinedPayment",
      "ResultPath": "$.error"  // Put error in $.error, keep original input
    },
    {
      "ErrorEquals": ["States.ALL"],  // Catch anything else
      "Next": "HandleGenericError"
    }
  ]
}

Built-in Error Types

  • States.ALL — catch any error
  • States.Timeout — task timed out
  • States.TaskFailed — task threw an error
  • States.Permissions — insufficient permissions

Choice State (Branching Logic)

{
  "Type": "Choice",
  "Choices": [
    {
      "Variable": "$.orderTotal",
      "NumericGreaterThan": 1000,
      "Next": "RequireManagerApproval"
    },
    {
      "Variable": "$.customerType",
      "StringEquals": "premium",
      "Next": "PremiumProcessing"
    },
    {
      "And": [
        {"Variable": "$.country", "StringEquals": "US"},
        {"Variable": "$.expressShipping", "BooleanEquals": true}
      ],
      "Next": "ExpressUSShipping"
    }
  ],
  "Default": "StandardProcessing"
}

Map State (Parallel Processing of Arrays)

Process each item in an array:

{
  "Type": "Map",
  "ItemsPath": "$.orders",           // Array to iterate over
  "MaxConcurrency": 5,               // Max parallel executions
  "Iterator": {
    "StartAt": "ProcessSingleOrder",
    "States": {
      "ProcessSingleOrder": {
        "Type": "Task",
        "Resource": "arn:aws:lambda:...:function:ProcessOrder",
        "End": true
      }
    }
  }
}

Real-World: Process 100 orders from a batch file — Map iterates over each, processes up to 5 simultaneously.


Workflow Types

Feature Standard Express
Duration Up to 1 year Up to 5 minutes
Execution model At-least-once At-least-once (async) / At-most-once (sync)
Pricing Per state transition Per duration + invocations
Execution history Full history in console CloudWatch Logs
Use for Long-running, human approval High-volume, short workflows

Standard: Order processing, data pipelines, ETL
Express: IoT data processing, streaming ETL, microservice orchestration


Good Practices

Practice Reason
Use optimized SDK integrations over Lambda Lower cost, lower latency, less code
Add Retry with exponential backoff Handle transient failures automatically
Use Catch for specific error types Different error paths for different failures
Use .waitForTaskToken for human approval Clean async callback pattern
Use Parallel for independent steps Reduce total execution time
Use Express Workflow for high-volume short tasks 10x cheaper than Standard

Bad Practices

Anti-Pattern Impact Fix
Using Lambda for everything that can use SDK integration Extra cost + cold starts Use optimized integrations (DynamoDB, SQS, SNS, etc.)
No retry on Task states Transient failures cause workflow failure Add Retry blocks
No timeout on waitForTaskToken Workflow hangs forever Set HeartbeatSeconds
Standard workflow for high-volume IoT (>1M/day) Very expensive Use Express workflow

Exam Tips

  1. Step Functions solves the "Lambda can only run 15 min" problem — workflows can run for 1 year.
  2. waitForTaskToken is the pattern for human approval and async integrations.
  3. Parallel state = branches run simultaneously. Map state = iterate over array.
  4. Choice state = if/else branching — no Lambda needed.
  5. Express vs Standard: Express = high-volume, short. Standard = long-running, full history.
  6. $$.Task.Token = the task token in waitForTaskToken pattern. $$ = context object.
  7. ResultPath = where to put the task output in the state data.

Common Exam Scenarios

Q: Order processing needs 15+ step workflow with human approval?Step Functions Standard workflow with waitForTaskToken state for approval.

Q: Process each item in a DynamoDB scan result with Lambda?Map state with Lambda integration — iterate over array, up to MaxConcurrency parallel.

Q: Handle payment API being flaky (occasional 500 errors)? → Add Retry to the payment Task state with exponential backoff.

Q: IoT device sends 1 million events/day, each needs 3 processing steps?Step Functions Express Workflow (cheaper than Standard for high volume).

Q: Lambda takes too long — what's the alternative for long-running orchestration? → Use Step Functions to orchestrate multiple Lambda calls instead of one long Lambda.