Skip to content

Commit 1decf38

Browse files
committed
[scheduler] Gate pending-jobs fetch on live SUM(proc), not stale PG
The fetch query filtered jobs on job_resource/folder_resource/subscription .int_cores, but those PG columns are only materialized by the ~120s recompute loop. The lag caused false-exclude starvation: when a frame completes and frees burst, the column stays high for up to a cycle, so the show/folder/job is dropped from the fetch and its (especially low-priority) jobs aren't queried until the next recompute. It also over-fetched jobs for caps already full, crowding the priority LIMIT. Replace the gates with three show-scoped CTEs that sum the live `proc` table (transactionally accurate: scheduler inserts on book, Cuebot deletes on completion, compensation deletes on failed launch), mirroring the recompute joins: - job_live (proc -> pk_job) -> job cap - folder_live (proc -> job -> pk_folder) -> folder cap (cores+gpus) - sub_live (proc -> host -> pk_alloc) -> bookable_shows burst All LEFT JOINed (no row = full headroom), scoped to the show via i_proc_pkshow. Folds the prior job-cap correlated subquery into job_live for one uniform pattern. proc is preferred over the live Redis acct:* counters here: it's the DB ground truth with no publish-failure stale-high window, and the job cap must be filtered in-query before the LIMIT regardless. Watch scheduler_job_query_duration_seconds; a Redis-backed subscription pre-check is the fallback if the aggregation proves DB-heavy. Validated: stress_booking_and_accounting passes (drain 99.7% dispatched, 0 rejections, audit OK; saturation audit OK with correct burst rejections).
1 parent 44c1670 commit 1decf38

1 file changed

Lines changed: 57 additions & 25 deletions

File tree

rust/crates/scheduler/src/dao/job_dao.rs

Lines changed: 57 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -65,18 +65,58 @@ impl DispatchJob {
6565
}
6666

6767
static QUERY_PENDING_BY_SHOW_FACILITY_TAG: &str = r#"
68-
-- bookable_shows: shows that still have room in at least one subscription.
69-
WITH bookable_shows AS (
68+
-- LIVE booked-core CTEs. The job/folder/subscription caps must be gated on CURRENT
69+
-- usage, but the PG columns job_resource.int_cores / folder_resource.int_cores /
70+
-- subscription.int_cores are only materialized by the ~120s recompute loop, so they
71+
-- lag reality. `proc` is the transactionally-accurate source (the scheduler inserts on
72+
-- booking, Cuebot deletes on frame completion, compensation deletes on a failed launch),
73+
-- so we sum it directly. Each CTE is scoped to this show ($1) via i_proc_pkshow and
74+
-- mirrors the recompute aggregation (so the value equals a fresher copy of the PG
75+
-- column), and joins on indexed pk_host / pk_job. Gating on stale PG would otherwise
76+
-- (a) over-fetch jobs for caps that are full in Redis -> wasted rejections, and worse
77+
-- (b) FALSE-EXCLUDE: a frame completes and frees burst live, but the lagged PG column
78+
-- stays high for up to a cycle, dropping the show/folder/job from the fetch and starving
79+
-- its (esp. low-priority) jobs until the next recompute.
80+
WITH job_live AS (
81+
-- per-job booked cores (mirrors RECOMPUTE_JOB_RESOURCE_FROM_PROC)
82+
SELECT p.pk_job, COALESCE(SUM(p.int_cores_reserved), 0)::int AS cores
83+
FROM proc p
84+
WHERE p.pk_show = $1
85+
GROUP BY p.pk_job
86+
),
87+
folder_live AS (
88+
-- per-folder booked cores/gpus (mirrors RECOMPUTE_FOLDER_RESOURCE_FROM_PROC)
89+
SELECT j2.pk_folder,
90+
COALESCE(SUM(p.int_cores_reserved), 0)::int AS cores,
91+
COALESCE(SUM(p.int_gpus_reserved), 0)::int AS gpus
92+
FROM proc p
93+
JOIN job j2 ON j2.pk_job = p.pk_job AND j2.str_state <> 'FINISHED'
94+
WHERE p.pk_show = $1
95+
GROUP BY j2.pk_folder
96+
),
97+
sub_live AS (
98+
-- per-alloc booked cores for this show (mirrors RECOMPUTE_SUBSCRIPTION_FROM_PROC:
99+
-- proc -> host -> alloc, excluding local/desktop bookings)
100+
SELECT h.pk_alloc, COALESCE(SUM(p.int_cores_reserved), 0)::int AS cores
101+
FROM proc p
102+
JOIN host h ON h.pk_host = p.pk_host
103+
WHERE p.pk_show = $1 AND p.b_local = false
104+
GROUP BY h.pk_alloc
105+
),
106+
-- bookable_shows: shows that still have room in at least one subscription (LIVE).
107+
bookable_shows AS (
70108
SELECT DISTINCT w.pk_show, sh.str_name AS show_name
71109
FROM subscription s
110+
-- LEFT JOIN: an alloc with no booked procs has full headroom.
111+
LEFT JOIN sub_live sl ON sl.pk_alloc = s.pk_alloc
72112
INNER JOIN vs_waiting w ON s.pk_show = w.pk_show
73113
INNER JOIN show sh ON sh.pk_show = w.pk_show
74114
WHERE s.pk_show = $1
75115
-- Burst == 0 is used to freeze a subscription.
76116
AND s.int_burst > 0
77-
-- At least one core unit available.
78-
AND s.int_burst - s.int_cores >= $2
79-
AND s.int_cores < s.int_burst
117+
-- At least one core unit available, measured against LIVE usage.
118+
AND s.int_burst - COALESCE(sl.cores, 0) >= $2
119+
AND COALESCE(sl.cores, 0) < s.int_burst
80120
)
81121
SELECT
82122
j.pk_job,
@@ -87,15 +127,19 @@ INNER JOIN bookable_shows bs ON j.pk_show = bs.pk_show
87127
INNER JOIN job_resource jr ON j.pk_job = jr.pk_job
88128
INNER JOIN folder f ON j.pk_folder = f.pk_folder
89129
INNER JOIN folder_resource fr ON f.pk_folder = fr.pk_folder
130+
-- LEFT JOIN the live CTEs: no row => 0 booked => full headroom.
131+
LEFT JOIN job_live jl ON jl.pk_job = j.pk_job
132+
LEFT JOIN folder_live fl ON fl.pk_folder = f.pk_folder
90133
WHERE j.str_state = 'PENDING'
91134
AND j.b_paused = false
92135
AND j.pk_facility = $4
93-
-- Folder must have any room at all; per-layer fit is checked below.
94-
AND (fr.int_max_cores = -1 OR fr.int_cores < fr.int_max_cores)
95-
AND (fr.int_max_gpus = -1 OR fr.int_gpus < fr.int_max_gpus)
136+
-- Folder must have any room at all (LIVE); per-layer fit is checked below.
137+
AND (fr.int_max_cores = -1 OR COALESCE(fl.cores, 0) < fr.int_max_cores)
138+
AND (fr.int_max_gpus = -1 OR COALESCE(fl.gpus, 0) < fr.int_max_gpus)
96139
-- The job must have at least one layer that matches the tag set, has waiting
97-
-- frames, and fits within the folder cap. EXISTS short-circuits per job and
98-
-- avoids the cardinality blowup of joining layer + layer_stat at the outer level.
140+
-- frames, and fits within the folder AND job caps (both LIVE). EXISTS short-circuits
141+
-- per job and avoids the cardinality blowup of joining layer + layer_stat at the
142+
-- outer level.
99143
AND EXISTS (
100144
SELECT 1
101145
FROM layer l
@@ -104,21 +148,9 @@ WHERE j.str_state = 'PENDING'
104148
AND ls.int_waiting_count > 0
105149
AND string_to_array(REPLACE($3, ' ', ''), '|')
106150
&& string_to_array(REPLACE(l.str_tags, ' ', ''), '|')
107-
AND (fr.int_max_cores = -1 OR fr.int_cores + l.int_cores_min <= fr.int_max_cores)
108-
AND (fr.int_max_gpus = -1 OR fr.int_gpus + l.int_gpus_min <= fr.int_max_gpus)
109-
-- Job-level core cap, checked against LIVE proc usage. `job_resource.int_cores`
110-
-- lags reality (only the ~120s recompute loop writes it), so we sum the proc
111-
-- rows directly: `proc` is transactionally accurate (the scheduler inserts on
112-
-- booking, Cuebot deletes on frame completion, compensation deletes on a failed
113-
-- launch). This excludes jobs already at their cap from the fetch so they don't
114-
-- crowd out bookable jobs within the priority LIMIT below — mirroring Cuebot's
115-
-- DispatchQuery. `-1` = unlimited. The correlated SUM is an index seek on
116-
-- proc(pk_job) over the job's handful of running rows.
117-
AND (jr.int_max_cores = -1
118-
OR COALESCE(
119-
(SELECT SUM(p.int_cores_reserved) FROM proc p WHERE p.pk_job = j.pk_job),
120-
0
121-
) + l.int_cores_min <= jr.int_max_cores)
151+
AND (fr.int_max_cores = -1 OR COALESCE(fl.cores, 0) + l.int_cores_min <= fr.int_max_cores)
152+
AND (fr.int_max_gpus = -1 OR COALESCE(fl.gpus, 0) + l.int_gpus_min <= fr.int_max_gpus)
153+
AND (jr.int_max_cores = -1 OR COALESCE(jl.cores, 0) + l.int_cores_min <= jr.int_max_cores)
122154
)
123155
ORDER BY jr.int_priority DESC
124156
LIMIT $5

0 commit comments

Comments
 (0)