You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[scheduler] Gate pending-jobs fetch on live SUM(proc), not stale PG
The fetch query filtered jobs on job_resource/folder_resource/subscription
.int_cores, but those PG columns are only materialized by the ~120s recompute
loop. The lag caused false-exclude starvation: when a frame completes and frees
burst, the column stays high for up to a cycle, so the show/folder/job is
dropped from the fetch and its (especially low-priority) jobs aren't queried
until the next recompute. It also over-fetched jobs for caps already full,
crowding the priority LIMIT.
Replace the gates with three show-scoped CTEs that sum the live `proc` table
(transactionally accurate: scheduler inserts on book, Cuebot deletes on
completion, compensation deletes on failed launch), mirroring the recompute
joins:
- job_live (proc -> pk_job) -> job cap
- folder_live (proc -> job -> pk_folder) -> folder cap (cores+gpus)
- sub_live (proc -> host -> pk_alloc) -> bookable_shows burst
All LEFT JOINed (no row = full headroom), scoped to the show via i_proc_pkshow.
Folds the prior job-cap correlated subquery into job_live for one uniform
pattern.
proc is preferred over the live Redis acct:* counters here: it's the DB ground
truth with no publish-failure stale-high window, and the job cap must be
filtered in-query before the LIMIT regardless. Watch
scheduler_job_query_duration_seconds; a Redis-backed subscription pre-check is
the fallback if the aggregation proves DB-heavy.
Validated: stress_booking_and_accounting passes (drain 99.7% dispatched, 0
rejections, audit OK; saturation audit OK with correct burst rejections).
0 commit comments