Commit 9e903b3
Add a 60-minute active-compute timeout per fragment
run_job now takes a `timeout` and checks it between tiles (paused time
excluded), aborting a fragment that never converges so a runaway or
mis-sized job can't pin a GPU indefinitely. A timed-out fragment flows
through the existing run-loop error path: logged, dropped from in-flight
(not submitted), then the worker backs off and claims the next.
The bound is per-tile-granular — a tile already in flight runs to
completion (a blocking cuCtxSynchronize can't be interrupted) — so it's
"60 min + one tile" in the worst case.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>1 parent 728019e commit 9e903b3
2 files changed
Lines changed: 29 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
16 | 16 | | |
17 | 17 | | |
18 | 18 | | |
| 19 | + | |
19 | 20 | | |
20 | 21 | | |
21 | 22 | | |
| |||
296 | 297 | | |
297 | 298 | | |
298 | 299 | | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
299 | 306 | | |
300 | 307 | | |
301 | 308 | | |
| |||
307 | 314 | | |
308 | 315 | | |
309 | 316 | | |
| 317 | + | |
310 | 318 | | |
311 | 319 | | |
312 | 320 | | |
| |||
331 | 339 | | |
332 | 340 | | |
333 | 341 | | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
334 | 346 | | |
| 347 | + | |
335 | 348 | | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
336 | 358 | | |
337 | 359 | | |
338 | 360 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
822 | 822 | | |
823 | 823 | | |
824 | 824 | | |
| 825 | + | |
| 826 | + | |
| 827 | + | |
| 828 | + | |
| 829 | + | |
| 830 | + | |
825 | 831 | | |
826 | 832 | | |
827 | 833 | | |
| |||
842 | 848 | | |
843 | 849 | | |
844 | 850 | | |
| 851 | + | |
845 | 852 | | |
846 | 853 | | |
847 | 854 | | |
| |||
0 commit comments