Skip to content

[BUG] cancel_node in BeforeNodeCallEvent raises RuntimeError that kills the entire graph on resume #2240

@yananym

Description

@yananym

Checks

  • I have updated to the lastest minor and patch version of Strands
  • I have checked the documentation and this is not expected behavior
  • I have searched ./issues and there are no duplicates of my issue

Strands Version

1.32.0

Python Version

3.13

Operating System

15.6.1

Installation Method

pip

Steps to Reproduce

Steps to Reproduce

  1. Create a linear graph with 3 nodes: step_a → step_b → step_c
  2. step_a is an INPUT agent that calls interrupt() to pause for user input
  3. step_b has a BeforeNodeCallEvent hook that sets cancel_node = True based on runtime state
  4. step_c is a normal agent that should execute after step_b is skipped
  5. Add a FileSessionManager or S3SessionManager for persistence
  6. Turn 1: Call graph("task")step_a interrupts, graph pauses. Works fine.
  7. Turn 2: Resume with graph(responses, invocation_state={"extracted": {"skip_step_b": True}})step_a completes, graph reaches step_b, hook sets cancel_node = True
  8. Result: RuntimeError("node cancelled by user") is raised at graph.py:~896, propagates through _execute_nodes_parallel, and kills the entire graph. step_c never executes. The graph status becomes FAILED instead of continuing.

The issue only manifests on resume (Turn 2). On a fresh start without interrupts, cancel_node also raises but the graph hasn't persisted state yet so there's nothing to corrupt. On resume, the crash leaves the workflow in a FAILED state with no recovery path.

Expected Behavior

Expected Behavior

When BeforeNodeCallEvent.cancel_node = True is set:

  1. The node should be treated as successfully completed (or a new SKIPPED status) for dependency resolution purposes
  2. Downstream nodes (step_c) should execute normally — the cancelled node should not block the graph
  3. The graph should continue to completion or the next interrupt point
  4. execution_order should either omit the skipped node or include it with a distinguishable status
  5. No exception should propagate — cancel_node is an intentional control flow decision, not an error

Actual Behavior

Actual Behavior

Setting cancel_node = True raises RuntimeError that terminates the entire graph:

# graph.py, _execute_node(), line ~896
if before_event.cancel_node:
    cancel_message = (
        before_event.cancel_node if isinstance(before_event.cancel_node, str) 
        else "node cancelled by user"
    )
    yield MultiAgentNodeCancelEvent(node.node_id, cancel_message)
    raise RuntimeError(cancel_message)  # ← kills the graph

The RuntimeError propagates:

  • _execute_node_stream_node_to_queue (line ~790) → _execute_nodes_parallel (line ~752) → raise event
  • The graph catches this as an unrecoverable failure
  • record.status becomes FAILED
  • All downstream nodes are abandoned
  • The workflow cannot be resumed — the next user message starts a brand new workflow, losing all accumulated state

Additional Context

Additional Context

  • The cancel_node feature was introduced to support the BeforeNodeCallEvent hook, but its current implementation treats cancellation as a fatal error rather than a control flow mechanism.
  • This behavior is consistent across versions 1.32.0 through 1.38.0.
  • The related feature request [FEATURE] Pass invocation_state to edge condition call #1346 (pass invocation_state to edge conditions) would provide an alternative path for conditional routing, but cancel_node should still work as a valid skip mechanism since it's exposed as a public API on the event object.
  • Our production workaround wraps skippable nodes in a no-op AgentBase implementation that checks the condition at call time and returns an empty AgentResult. This avoids cancel_node entirely but adds complexity and prevents proper skip tracking in execution_order.

Possible Solution

Possible Solution

Replace the RuntimeError in _execute_node() with graceful completion. In graph.py line ~896:

Current:

if before_event.cancel_node:
    cancel_message = (
        before_event.cancel_node if isinstance(before_event.cancel_node, str) 
        else "node cancelled by user"
    )
    yield MultiAgentNodeCancelEvent(node.node_id, cancel_message)
    raise RuntimeError(cancel_message)

Proposed:

if before_event.cancel_node:
    cancel_message = (
        before_event.cancel_node if isinstance(before_event.cancel_node, str) 
        else "node cancelled by user"
    )
    logger.debug("reason=<%s> | skipping node execution", cancel_message)
    yield MultiAgentNodeCancelEvent(node.node_id, cancel_message)
    
    # Mark as completed so downstream nodes can proceed
    node.execution_status = Status.COMPLETED
    
    # Yield a minimal result so the graph can continue
    yield MultiAgentNodeCompleteEvent(
        node_id=node.node_id,
        result=AgentResult(
            stop_reason="end_turn",
            message={"role": "assistant", "content": [{"text": cancel_message}]},
            metrics=EventLoopMetrics(),
            state={},
        ),
    )
    return  # Exit cleanly instead of raising

This ensures:

  • The cancelled node is treated as completed for dependency resolution
  • Downstream nodes execute normally
  • execution_order includes the node (consumers can check MultiAgentNodeCancelEvent to distinguish skipped from executed)
  • No RuntimeError propagation — the graph continues

An alternative would be adding a Status.SKIPPED enum value that the graph treats identically to COMPLETED for edge traversal but is distinguishable in execution_order for observability.

Related Issues

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions