|
| 1 | +# Lease Extension (Automatic Heartbeat) |
| 2 | + |
| 3 | +When a worker picks up a task, the Conductor server starts a `responseTimeoutSeconds` timer. If the worker doesn't send an update before the timer expires, the server marks the task as timed out and re-queues it for retry. |
| 4 | + |
| 5 | +For long-running tasks (agent tool calls, LLM inference, data processing, batch jobs), the worker is actively executing but the server thinks it's dead. **Lease extension** solves this by automatically sending heartbeats that reset the timeout timer. |
| 6 | + |
| 7 | +## How It Works |
| 8 | + |
| 9 | +When `lease_extend_enabled=True`: |
| 10 | + |
| 11 | +1. Worker picks up a task with `responseTimeoutSeconds > 0` |
| 12 | +2. SDK starts tracking the task for heartbeats |
| 13 | +3. At **80% of `responseTimeoutSeconds`**, SDK sends a heartbeat (`TaskResult.extend_lease=True`) |
| 14 | +4. Server resets the task's `updateTime` to now, giving a fresh `responseTimeoutSeconds` window |
| 15 | +5. Heartbeats continue until the task completes, fails, or the worker shuts down |
| 16 | + |
| 17 | +``` |
| 18 | +Timeline (responseTimeoutSeconds=120s): |
| 19 | + 0s 96s 192s 288s |
| 20 | + |-----------|-----------|-----------|--→ task completes |
| 21 | + poll heartbeat heartbeat heartbeat |
| 22 | + (80%) (80%) (80%) |
| 23 | +``` |
| 24 | + |
| 25 | +The heartbeat fires at 80% of `responseTimeoutSeconds` (matching the Java SDK). This gives a 20% safety margin — if a heartbeat is slightly delayed, the task still has time before the server times it out. |
| 26 | + |
| 27 | +## Quick Start |
| 28 | + |
| 29 | +```python |
| 30 | +from conductor.client.worker.worker_task import worker_task |
| 31 | + |
| 32 | +@worker_task( |
| 33 | + task_definition_name='long_running_analysis', |
| 34 | + lease_extend_enabled=True, # Enable automatic heartbeat |
| 35 | +) |
| 36 | +def analyze_dataset(dataset_id: str) -> dict: |
| 37 | + """This task takes 5 minutes but responseTimeoutSeconds is 60s. |
| 38 | + Heartbeats keep it alive automatically.""" |
| 39 | + results = run_expensive_analysis(dataset_id) |
| 40 | + return {'results': results} |
| 41 | +``` |
| 42 | + |
| 43 | +That's it. The SDK handles heartbeats automatically in the background. |
| 44 | + |
| 45 | +## Enabling Lease Extension |
| 46 | + |
| 47 | +Lease extension is **disabled by default** (matching the Java SDK). Enable it per-worker or globally: |
| 48 | + |
| 49 | +### Per-Worker (Decorator) |
| 50 | + |
| 51 | +```python |
| 52 | +@worker_task( |
| 53 | + task_definition_name='my_task', |
| 54 | + lease_extend_enabled=True, |
| 55 | +) |
| 56 | +def my_task(data: str) -> dict: |
| 57 | + ... |
| 58 | +``` |
| 59 | + |
| 60 | +### Per-Worker (Class) |
| 61 | + |
| 62 | +```python |
| 63 | +from conductor.client.worker.worker import Worker |
| 64 | + |
| 65 | +worker = Worker( |
| 66 | + task_definition_name='my_task', |
| 67 | + execute_function=my_function, |
| 68 | + lease_extend_enabled=True, |
| 69 | +) |
| 70 | +``` |
| 71 | + |
| 72 | +### Per-Worker (Environment Variable) |
| 73 | + |
| 74 | +```shell |
| 75 | +export conductor_worker_my_task_lease_extend_enabled=true |
| 76 | +``` |
| 77 | + |
| 78 | +### Global (All Workers) |
| 79 | + |
| 80 | +```shell |
| 81 | +export conductor_worker_all_lease_extend_enabled=true |
| 82 | +``` |
| 83 | + |
| 84 | +### Precedence |
| 85 | + |
| 86 | +Environment variables override decorator/constructor arguments: |
| 87 | + |
| 88 | +1. Task-specific env var (`conductor_worker_<task>_lease_extend_enabled`) |
| 89 | +2. Global env var (`conductor_worker_all_lease_extend_enabled`) |
| 90 | +3. Worker constructor / decorator argument |
| 91 | + |
| 92 | +## When to Use |
| 93 | + |
| 94 | +**Enable lease extension when:** |
| 95 | +- Task execution time may exceed `responseTimeoutSeconds` |
| 96 | +- Tasks involve external calls with unpredictable latency (LLM APIs, data pipelines) |
| 97 | +- You want the worker to hold the task continuously (not yield and re-poll) |
| 98 | + |
| 99 | +**You don't need lease extension when:** |
| 100 | +- Tasks always complete within `responseTimeoutSeconds` |
| 101 | +- You're using `TaskInProgress` with `callbackAfterSeconds` (the task is yielded back to the queue) |
| 102 | +- `responseTimeoutSeconds` is 0 (no timeout configured) |
| 103 | + |
| 104 | +## Lease Extension vs TaskInProgress |
| 105 | + |
| 106 | +These are two different strategies for long-running tasks: |
| 107 | + |
| 108 | +| | Lease Extension | TaskInProgress | |
| 109 | +|---|---|---| |
| 110 | +| **How it works** | Worker holds the task, heartbeats keep it alive | Worker yields the task, re-polls later | |
| 111 | +| **Task state** | IN_PROGRESS the whole time | Returned to queue between polls | |
| 112 | +| **When to use** | Continuous execution (LLM calls, streaming) | Incremental processing (batch chunks, polling external status) | |
| 113 | +| **Enable with** | `lease_extend_enabled=True` | Return `TaskInProgress(callback_after_seconds=N)` | |
| 114 | +| **Worker memory** | Task stays in worker memory | Task is released, re-polled with fresh context | |
| 115 | + |
| 116 | +You can combine both — enable `lease_extend_enabled` for safety while also using `TaskInProgress` for incremental polling. |
| 117 | + |
| 118 | +## Important Constraints |
| 119 | + |
| 120 | +- **`responseTimeoutSeconds`** is the time between updates. This is what heartbeats reset. |
| 121 | +- **`timeoutSeconds`** is the overall SLA wall-clock ceiling. **Cannot be extended by heartbeat.** Once exceeded, the task is TIMED_OUT regardless of heartbeats. |
| 122 | +- Heartbeats only fire when `responseTimeoutSeconds > 0` and `lease_extend_enabled = True`. |
| 123 | +- If the heartbeat interval would be less than 1 second (i.e., `responseTimeoutSeconds < 1.25`), heartbeats are skipped. |
| 124 | + |
| 125 | +## Retry on Failure |
| 126 | + |
| 127 | +If a heartbeat API call fails, the SDK retries up to 3 times with backoff (`1s`, `1.5s`, `2s`). If all retries fail, the error is logged and the SDK tries again on the next poll loop iteration. If the network is truly partitioned, the server will eventually time out the task — this is correct behavior. |
| 128 | + |
| 129 | +## Example |
| 130 | + |
| 131 | +See [examples/lease_extension_example.py](examples/lease_extension_example.py) for a complete runnable example that: |
| 132 | +- Defines a long-running worker with `lease_extend_enabled=True` |
| 133 | +- Creates a workflow with a short `responseTimeoutSeconds` |
| 134 | +- Runs the workflow and proves the task completes despite sleeping longer than the timeout |
0 commit comments