Commit ac611c5
committed
fix(deriver): break worker-pool deadlock on hung LLM calls
Two compounding bugs caused the deriver worker pool to wedge after a
single CF-Gateway-streamed Gemini response failed to terminate:
1. process_work_unit holds `async with self.semaphore` across the
inner LLM call (process_representation_batch / process_item). With
no asyncio-level timeout, a hung HTTP read held the slot forever.
Eight workers x one hung call each = pool fully locked.
2. polling_loop gated cleanup_stale_work_units behind
`if self.semaphore.locked(): continue`, so once the pool was full
the stale-AQS cleanup never ran. STALE_SESSION_TIMEOUT_MINUTES
became dead-lettered. Pod restarts didn't help: new pods reclaimed
the same poisoned work_unit_keys and re-wedged within minutes.
Fixes:
- Add DERIVER_WORK_UNIT_TIMEOUT_SECONDS (default 600s) and wrap both
process_representation_batch and process_item in asyncio.wait_for.
TimeoutError propagates to _handle_processing_error, the
`async with` unwinds, the semaphore slot releases.
- Move cleanup_stale_work_units above the semaphore-locked check so
AQS rows always get reaped on every poll tick, even with a full
pool. Cleanup is cheap; running it unconditionally costs one
index scan per poll.
Symptoms before fix: active_queue_sessions rows aging past
STALE_SESSION_TIMEOUT_MINUTES, queue.processed=false count climbing
into thousands across all task types, deriver pod alive (PID 1 ok)
but log output silent for hours.1 parent 7fec670 commit ac611c5
2 files changed
Lines changed: 48 additions & 9 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
713 | 713 | | |
714 | 714 | | |
715 | 715 | | |
| 716 | + | |
| 717 | + | |
| 718 | + | |
| 719 | + | |
| 720 | + | |
| 721 | + | |
| 722 | + | |
| 723 | + | |
| 724 | + | |
| 725 | + | |
| 726 | + | |
716 | 727 | | |
717 | 728 | | |
718 | 729 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
384 | 384 | | |
385 | 385 | | |
386 | 386 | | |
387 | | - | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
388 | 401 | | |
389 | 402 | | |
390 | 403 | | |
391 | 404 | | |
392 | 405 | | |
393 | 406 | | |
394 | | - | |
395 | 407 | | |
396 | 408 | | |
397 | 409 | | |
| |||
843 | 855 | | |
844 | 856 | | |
845 | 857 | | |
846 | | - | |
847 | | - | |
848 | | - | |
849 | | - | |
850 | | - | |
851 | | - | |
| 858 | + | |
| 859 | + | |
| 860 | + | |
| 861 | + | |
| 862 | + | |
| 863 | + | |
| 864 | + | |
| 865 | + | |
| 866 | + | |
| 867 | + | |
| 868 | + | |
| 869 | + | |
| 870 | + | |
| 871 | + | |
| 872 | + | |
852 | 873 | | |
853 | 874 | | |
854 | 875 | | |
| |||
873 | 894 | | |
874 | 895 | | |
875 | 896 | | |
876 | | - | |
| 897 | + | |
| 898 | + | |
| 899 | + | |
| 900 | + | |
| 901 | + | |
| 902 | + | |
| 903 | + | |
| 904 | + | |
877 | 905 | | |
878 | 906 | | |
879 | 907 | | |
| |||
0 commit comments