Merge pull request #408 from conductor-oss/docs/execute-running-behavior-and-debug

nthmost-orkes · web-flow · commit 0d2a62df3b30 · 2026-04-28T14:12:59.000-07:00
docs: explain execute() RUNNING-after-timeout behavior and how to debug stuck workflows
diff --git a/README.md b/README.md
@@ -490,6 +490,37 @@ Yes. Conductor ensures workflows complete reliably even in the face of infrastru
 
 No. While Conductor excels at asynchronous orchestration, it also supports synchronous workflow execution when immediate results are required.
 
+**Why did `execute()` return `status: RUNNING` with no output?**
+
+`execute()` blocks until the workflow finishes **or** `wait_for_seconds` elapses (default: 10 s),
+whichever comes first. If it times out, you get `status='RUNNING'` — that is correct behavior,
+not a bug.
+
+The most common cause: your worker raised an exception. Conductor marks the task FAILED and
+schedules a retry after `retryDelaySeconds` (default: **60 s**). The default 10 s wait expires
+while the retry is pending, so `execute()` returns before the workflow completes.
+
+**To fix**: increase `wait_for_seconds` to outlast the retry cycle:
+
+```python
+# default retryDelaySeconds is 60 — wait long enough to cover one retry
+run = executor.execute(name='my_workflow', version=1, workflow_input={...}, wait_for_seconds=70)
+```
+
+**To debug** when a workflow is stuck:
+
+```python
+# Inspect task statuses and failure reasons
+wf = executor.get_workflow(run.workflow_id, include_tasks=True)
+for task in wf.tasks:
+    if task.status in ('FAILED', 'FAILED_WITH_TERMINAL_ERROR'):
+        print(task.reference_task_name, task.reason_for_incompletion)
+```
+
+You can also open the Conductor UI at `<server>/execution/<workflow_id>` — it shows each task's
+status, retry count, and the worker exception message directly. Worker tracebacks are also logged
+at ERROR level by the SDK in the `TaskHandler` process.
+
 **Do I need to use a Conductor-specific framework?**
 
 No. Conductor is language and framework agnostic. Use your preferred language and framework — the [SDKs](https://github.com/conductor-oss/conductor#conductor-sdks) provide native integration for Python, Java, JavaScript, Go, C#, and more.
diff --git a/docs/WORKFLOW.md b/docs/WORKFLOW.md
@@ -61,6 +61,55 @@ workflow_id = workflow_client.execute_workflow(
 )
 ```
 
+> **`wait_for_seconds` and task retries**
+>
+> `execute()` / `execute_workflow()` block for at most `wait_for_seconds` (default: **10 s**).
+> If the workflow is still running when the timer fires, the call returns with
+> `status='RUNNING'` and empty output — this is expected behavior, not an error.
+>
+> The most common trigger: a worker exception. Conductor marks the task FAILED and waits
+> `retryDelaySeconds` (default: **60 s**) before retrying. The default 10 s timeout expires
+> during that wait, so you see `RUNNING`. Set `wait_for_seconds` to a value larger than
+> `retryDelaySeconds` to ensure the call waits through at least one retry cycle:
+>
+> ```python
+> run = executor.execute(
+>     name='my_workflow', version=1, workflow_input={...},
+>     wait_for_seconds=70  # covers one retry at the default 60 s delay
+> )
+> ```
+
+#### Debugging a stuck workflow
+
+When a workflow returns `RUNNING` or never completes, use these steps to find out why.
+
+**1. Check the Conductor UI**
+
+Open `<server>/execution/<workflow_id>`. The timeline view shows each task's status, retry
+count, and the worker exception message — usually the fastest way to diagnose a failure.
+
+**2. Inspect task statuses programmatically**
+
+`get_workflow` with `include_tasks=True` returns the full task list. Check failed tasks for
+their `reason_for_incompletion`:
+
+```python
+wf = executor.get_workflow(workflow_id, include_tasks=True)
+for task in wf.tasks:
+    print(task.reference_task_name, task.status, task.reason_for_incompletion)
+```
+
+**3. Read the worker logs**
+
+When a worker function raises an exception, the SDK catches it, logs the traceback at ERROR
+level, and reports the task as FAILED. Worker logs come from the `TaskHandler` process — check
+the terminal output or your process manager's log stream.
+
+**Note on `reason_for_incompletion` on `WorkflowRun`**
+
+`WorkflowRun.reason_for_incompletion` is deprecated. Use `get_workflow(id, include_tasks=True)`
+and read `task.reason_for_incompletion` on the specific failed task instead (see step 2 above).
+
 ### Fetch a workflow execution
 
 #### Exclude tasks
diff --git a/src/conductor/client/workflow/executor/workflow_executor.py b/src/conductor/client/workflow/executor/workflow_executor.py
@@ -91,8 +91,21 @@ def execute_workflow_with_return_strategy(self, request: StartWorkflowRequest, w
     def execute(self, name: str, version: Optional[int] = None, workflow_input: Any = None,
                 wait_until_task_ref: Optional[str] = None, wait_for_seconds: int = 10,
                 request_id: Optional[str] = None, correlation_id: Optional[str] = None, domain: Optional[str] = None) -> WorkflowRun:
-        """Executes a workflow with StartWorkflowRequest and waits for the completion of the workflow or until a
-        specific task in the workflow """
+        """Execute a workflow synchronously and wait for it to complete.
+
+        Returns when the workflow reaches a terminal state or ``wait_for_seconds`` elapses.
+        If the timeout fires first, returns ``status='RUNNING'`` with empty output — not an error.
+
+        **Getting RUNNING with no output after a worker exception?** The default ``wait_for_seconds=10`` is shorter than the
+        default task ``retryDelaySeconds=60``. A failing worker triggers a 60 s retry wait,
+        so the 10 s timeout fires while the retry is pending. Raise ``wait_for_seconds``
+        (e.g. 70) or inspect failed tasks::
+
+            wf = executor.get_workflow(run.workflow_id, include_tasks=True)
+            for task in wf.tasks:
+                if task.status in ('FAILED', 'FAILED_WITH_TERMINAL_ERROR'):
+                    print(task.reference_task_name, task.reason_for_incompletion)
+        """
         workflow_input = workflow_input or {}
         if request_id is None:
             request_id = str(uuid.uuid4())