Fix pdb / breakpoint() hang in workflow code (temporalio#1104)

elidlocke · elidlocke · commit 67fe256895da · 2026-05-22T15:15:26.000-04:00
Closes temporalio#1104. breakpoint() and pdb.set_trace() inside workflow code silently hang even with debug_mode=True and an unsandboxed runner. Three orthogonal issues contribute; this PR addresses all three behind the existing debug_mode flag so production behavior is unchanged. 1. Thread placement. Activations run on a ThreadPoolExecutor worker thread, so pdb's cmdloop() calls input() from a thread that doesn't own the controlling TTY. Fixed by scheduling the activation as a loop.call_soon callback and awaiting a future the callback completes. The dispatch task suspends at the await so it's no longer mid-__step() when the workflow's internal task stepping happens. (A direct synchronous call ran afoul of Python 3.14's tightened asyncio task-entry validation: "Cannot enter into task while another task is being executed.") 2. Sandbox restriction. The sandbox flags `breakpoint` and `input` as non-deterministic builtins. With debug_mode=True the user has explicitly accepted non-determinism for the debugging session, so we relax those two specific restrictions when the runner is a SandboxedWorkflowRunner. Other sandbox checks remain intact. 3. Silent-hang failure mode. Installs a process-wide sys.breakpointhook at worker startup that raises a clear RuntimeError when breakpoint() is called from a workflow worker thread without debug_mode, replacing the silent hang. Adds a "Debugging Workflows with breakpoint() / pdb" subsection to the README under Workflow Sandbox, including a runnable example and the caveats around workflow task timeouts. Tests at tests/worker/test_breakpoint_hang.py cover thread placement in both modes, the sandboxed-workflow path, and the defensive hook. Verified on Python 3.13 and 3.14 locally; CI matrix green on fork. The load-bearing observation for the dispatch fix: `await future` suspends the dispatch task such that asyncio no longer considers it "currently executing," even though it's still in a pending state. That's what lets workflow.activate(act) step the workflow's internal task without 3.14's task-entry error.
diff --git a/README.md b/README.md
@@ -82,6 +82,7 @@ informal introduction to the features and their implementation.
         - [Customizing the Sandbox](#customizing-the-sandbox)
           - [Passthrough Modules](#passthrough-modules)
           - [Invalid Module Members](#invalid-module-members)
+        - [Debugging Workflows with `breakpoint()` / `pdb`](#debugging-workflows-with-breakpoint--pdb)
         - [Known Sandbox Issues](#known-sandbox-issues)
           - [Global Import/Builtins](#global-importbuiltins)
           - [Sandbox is not Secure](#sandbox-is-not-secure)
@@ -1241,6 +1242,75 @@ my_worker = Worker(..., workflow_runner=SandboxedWorkflowRunner(restrictions=my_
 
 See the API for more details on exact fields and their meaning.
 
+##### Debugging Workflows with `breakpoint()` / `pdb`
+
+Setting `debug_mode=True` on the `Worker` (or `TEMPORAL_DEBUG=1` in the environment) routes workflow activations
+onto the asyncio main thread instead of a worker thread pool. This lets `breakpoint()` and `pdb.set_trace()`
+inside workflow code open an interactive REPL — without it, pdb hangs because its `input()` call would run on a
+thread that does not own the controlling TTY.
+
+A minimal runnable example:
+
+```python
+import asyncio
+from datetime import timedelta
+
+from temporalio import workflow
+from temporalio.client import Client
+from temporalio.worker import Worker
+
+
+@workflow.defn(sandboxed=False)
+class DebugMeWorkflow:
+    @workflow.run
+    async def run(self) -> str:
+        x = 42
+        breakpoint()  # interactive pdb prompt opens here
+        return f"x was {x}"
+
+
+async def main() -> None:
+    client = await Client.connect("localhost:7233")
+    async with Worker(
+        client,
+        task_queue="debug-me",
+        workflows=[DebugMeWorkflow],
+        debug_mode=True,
+    ):
+        result = await client.execute_workflow(
+            DebugMeWorkflow.run,
+            id="debug-me-wf",
+            task_queue="debug-me",
+            task_timeout=timedelta(minutes=10),  # see caveat below
+        )
+        print(result)
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+Run with `python debug_me.py` (not under pytest, which captures stdin and breaks the REPL). At the `(Pdb)`
+prompt try `p x`, `n`, `c`, `q`.
+
+Two caveats when pausing at a breakpoint inside a workflow:
+
+1. **Workflow task timeout.** Temporal expires a workflow task after ~10 seconds by default. If you sit at the
+   `(Pdb)` prompt longer than that, the server reassigns the task and your workflow replays from the start when
+   you continue — re-hitting the breakpoint. Pass `task_timeout=timedelta(minutes=N)` to `execute_workflow` /
+   `start_workflow` to give yourself debugging headroom:
+
+   ```python
+   await client.execute_workflow(MyWorkflow.run, ..., task_timeout=timedelta(minutes=10))
+   ```
+
+2. **Deterministic replay.** Workflows are deterministic and replay from history; any wall-clock pause violates
+   that contract. For post-mortem debugging without these caveats, use the [Replayer](#replayer) on a recorded
+   history instead of live debugging.
+
+A `breakpoint()` call from workflow code without `debug_mode` enabled raises a `RuntimeError` with a pointer to
+this section, so the failure mode is loud rather than a silent hang.
+
 ##### Known Sandbox Issues
 
 Below are known sandbox issues. As the sandbox is developed and matures, some may be resolved.
diff --git a/temporalio/worker/_workflow.py b/temporalio/worker/_workflow.py
@@ -48,6 +48,59 @@
 # Set to true to log all activations and completions
 LOG_PROTOS = False
 
+# Prefix used to detect threads in the workflow task ThreadPoolExecutor.
+_WORKFLOW_THREAD_NAME_PREFIX = "temporal_workflow_"
+
+_ORIGINAL_BREAKPOINTHOOK = sys.breakpointhook
+
+
+def _temporal_workflow_breakpoint_hook(*args: object, **kwargs: object) -> object:
+    if threading.current_thread().name.startswith(_WORKFLOW_THREAD_NAME_PREFIX):
+        raise RuntimeError(
+            "breakpoint() / pdb.set_trace() inside workflow code requires "
+            "debug_mode=True (or the TEMPORAL_DEBUG environment variable) on "
+            "the Worker. Without it the workflow runs on a thread pool and "
+            "pdb's interactive REPL cannot read stdin."
+        )
+    return _ORIGINAL_BREAKPOINTHOOK(*args, **kwargs)
+
+
+def _install_workflow_breakpoint_hook() -> None:
+    if sys.breakpointhook is not _temporal_workflow_breakpoint_hook:
+        sys.breakpointhook = _temporal_workflow_breakpoint_hook
+
+
+def _relax_sandbox_for_debugger(workflow_runner: WorkflowRunner) -> WorkflowRunner:
+    """Lift sandbox restrictions on `breakpoint` and `input` for debug_mode.
+
+    Both are flagged as non-deterministic by default. Users opting into
+    debug_mode have accepted non-determinism for the session, so a targeted
+    relaxation beats forcing them to swap to UnsandboxedWorkflowRunner.
+    """
+    from temporalio.worker.workflow_sandbox._runner import SandboxedWorkflowRunner
+
+    if not isinstance(workflow_runner, SandboxedWorkflowRunner):
+        return workflow_runner
+
+    restrictions = workflow_runner.restrictions
+    invalid = restrictions.invalid_module_members
+    builtins_matcher = invalid.children.get("__builtins__")
+    if builtins_matcher is None or not (
+        "breakpoint" in builtins_matcher.use or "input" in builtins_matcher.use
+    ):
+        return workflow_runner
+
+    new_use = set(builtins_matcher.use) - {"breakpoint", "input"}
+    new_builtins = dataclasses.replace(builtins_matcher, use=new_use)
+    new_invalid = dataclasses.replace(
+        invalid, children={**invalid.children, "__builtins__": new_builtins}
+    )
+    new_restrictions = dataclasses.replace(
+        restrictions, invalid_module_members=new_invalid
+    )
+    return dataclasses.replace(workflow_runner, restrictions=new_restrictions)
+
+
 # Value was chosen abitrarily as a small number that allows some concurrency and prevents
 # large numbers of concurrent external storage operations causing resource contention.
 # This default limit is per workflow task activation and does not limit the total number
@@ -96,6 +149,13 @@ def __init__(
             )
         )
         self._workflow_task_executor_user_provided = workflow_task_executor is not None
+
+        # Debug mode (also enabled by TEMPORAL_DEBUG) disables deadlock
+        # detection, runs activations inline on the main thread, and lifts
+        # the sandbox restriction on breakpoint()/input() so pdb works.
+        self._debug_mode = bool(debug_mode or os.environ.get("TEMPORAL_DEBUG"))
+        if self._debug_mode:
+            workflow_runner = _relax_sandbox_for_debugger(workflow_runner)
         self._workflow_runner = workflow_runner
         self._unsandboxed_workflow_runner = unsandboxed_workflow_runner
         self._data_converter = data_converter
@@ -127,11 +187,9 @@ def __init__(
         )
         self._throw_after_activation: Exception | None = None
 
-        # If there's a debug mode or a truthy TEMPORAL_DEBUG env var, disable
-        # deadlock detection, otherwise set to 2 seconds
-        self._deadlock_timeout_seconds = (
-            None if debug_mode or os.environ.get("TEMPORAL_DEBUG") else 2
-        )
+        self._deadlock_timeout_seconds = None if self._debug_mode else 2
+
+        _install_workflow_breakpoint_hook()
 
         # Keep track of workflows that could not be evicted
         self._could_not_evict_count = 0
@@ -241,6 +299,34 @@ async def drain_poll_queue(self) -> None:
             except PollShutdownError:
                 return
 
+    async def _activate_inline_for_debug(
+        self,
+        loop: asyncio.AbstractEventLoop,
+        workflow: _RunningWorkflow,
+        act: temporalio.bridge.proto.workflow_activation.WorkflowActivation,
+    ) -> temporalio.bridge.proto.workflow_completion.WorkflowActivationCompletion:
+        # Indirect through call_soon + a future so the activation runs outside
+        # the dispatch task's __step() context. Python 3.14 refuses to enter a
+        # task while another on the same thread is mid-step; suspending at the
+        # await below clears that state so workflow.activate can step its own
+        # task without collision.
+        future: asyncio.Future = loop.create_future()
+
+        def run_inline() -> None:
+            # _run_once clears the running-loop registration on exit; restore
+            # the main loop so later code sees the right one.
+            main_loop = asyncio._get_running_loop()
+            try:
+                completion = workflow.activate(act)
+                future.set_result(completion)
+            except BaseException as e:
+                future.set_exception(e)
+            finally:
+                asyncio._set_running_loop(main_loop)
+
+        loop.call_soon(run_inline)
+        return await future
+
     async def _handle_activation(
         self, act: temporalio.bridge.proto.workflow_activation.WorkflowActivation
     ) -> None:
@@ -330,35 +416,43 @@ async def _handle_activation(
                 )
                 self._running_workflows[act.run_id] = workflow
 
-            # Run activation in separate thread so we can check if it's
-            # deadlocked
-            activate_task = asyncio.get_running_loop().run_in_executor(
-                self._workflow_task_executor,
-                workflow.activate,
-                act,
-            )
-
-            # Run activation task with deadlock timeout
-            try:
-                completion = await asyncio.wait_for(
-                    activate_task, self._deadlock_timeout_seconds
+            if self._debug_mode:
+                # Inline on the main thread so pdb / breakpoint() can read
+                # stdin. The loop blocks during the activation — that's the
+                # intended single-stepping semantic.
+                completion = await self._activate_inline_for_debug(
+                    asyncio.get_running_loop(), workflow, act
                 )
-            except asyncio.TimeoutError:
-                # Need to create the deadlock exception up here so it
-                # captures the trace now instead of later after we may have
-                # interrupted it
-                deadlock_exc = _DeadlockError.from_deadlocked_workflow(
-                    workflow.instance, self._deadlock_timeout_seconds
+            else:
+                # Run activation in separate thread so we can check if it's
+                # deadlocked
+                activate_task = asyncio.get_running_loop().run_in_executor(
+                    self._workflow_task_executor,
+                    workflow.activate,
+                    act,
                 )
-                # When we deadlock, we will raise an exception to fail
-                # the task. But before we do that, we want to try to
-                # interrupt the thread and put this activation task on
-                # the workflow so that the successive eviction can wait
-                # on it before trying to evict.
-                workflow.attempt_deadlock_interruption()
-                # Set the task and raise
-                workflow.deadlocked_activation_task = activate_task
-                raise deadlock_exc from None
+
+                # Run activation task with deadlock timeout
+                try:
+                    completion = await asyncio.wait_for(
+                        activate_task, self._deadlock_timeout_seconds
+                    )
+                except asyncio.TimeoutError:
+                    # Need to create the deadlock exception up here so it
+                    # captures the trace now instead of later after we may have
+                    # interrupted it
+                    deadlock_exc = _DeadlockError.from_deadlocked_workflow(
+                        workflow.instance, self._deadlock_timeout_seconds
+                    )
+                    # When we deadlock, we will raise an exception to fail
+                    # the task. But before we do that, we want to try to
+                    # interrupt the thread and put this activation task on
+                    # the workflow so that the successive eviction can wait
+                    # on it before trying to evict.
+                    workflow.attempt_deadlock_interruption()
+                    # Set the task and raise
+                    workflow.deadlocked_activation_task = activate_task
+                    raise deadlock_exc from None
 
         except Exception as err:
             if isinstance(err, _DeadlockError):
@@ -576,22 +670,27 @@ async def _handle_cache_eviction(
             handle_eviction_task: asyncio.Future | None = None
             while True:
                 try:
-                    # We only create the eviction task if we haven't already or
-                    # it is done. This is because if it already is running and
-                    # timed out, it's still running (and holding on to a
-                    # thread). But if did complete running but failed with
-                    # another error, we want to re-create the task.
-                    if not handle_eviction_task or handle_eviction_task.done():
-                        handle_eviction_task = (
-                            asyncio.get_running_loop().run_in_executor(
-                                self._workflow_task_executor,
-                                workflow.activate,
-                                act,
+                    if self._debug_mode:
+                        await self._activate_inline_for_debug(
+                            asyncio.get_running_loop(), workflow, act
+                        )
+                    else:
+                        # We only create the eviction task if we haven't already or
+                        # it is done. This is because if it already is running and
+                        # timed out, it's still running (and holding on to a
+                        # thread). But if did complete running but failed with
+                        # another error, we want to re-create the task.
+                        if not handle_eviction_task or handle_eviction_task.done():
+                            handle_eviction_task = (
+                                asyncio.get_running_loop().run_in_executor(
+                                    self._workflow_task_executor,
+                                    workflow.activate,
+                                    act,
+                                )
                             )
+                        await asyncio.wait_for(
+                            handle_eviction_task, self._deadlock_timeout_seconds
                         )
-                    await asyncio.wait_for(
-                        handle_eviction_task, self._deadlock_timeout_seconds
-                    )
                     # Break if it succeeds
                     break
                 except BaseException as err:
diff --git a/tests/worker/test_breakpoint_hang.py b/tests/worker/test_breakpoint_hang.py