You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: stop funneling all jobs to one worker (JobsProgress shared state) (#517)
* fix: revert JobsProgress to in-memory set, add PingJobMirror
JobsProgress persisted in-progress jobs to .runpod_jobs.pkl under
os.getcwd(); on endpoints with a network volume every worker shared one
file, so occupancy accounting cross-contaminated and jobs funneled onto
a single worker (#432). Restore the 1.7.10 in-memory set and feed the
separate ping process via a per-worker shared-memory mirror instead.
Refs SLS-314, fixes#432
* fix: ping reads in-progress job ids from injected mirror
The ping process no longer touches JobsProgress; it reads the job-id
snapshot from the per-worker PingJobMirror passed in at process start.
Refs SLS-314
* fix: JobScaler pushes job-id snapshot to ping mirror
JobScaler updates the per-worker PingJobMirror after each job is
acquired or finished, so the separate ping process always sees the
current in-progress job ids without shared-file state.
Refs SLS-314
* fix: create and share one PingJobMirror per worker
run_worker constructs a single mirror in the main process and passes it
to both the ping process and the JobScaler, completing the #432 fix.
Refs SLS-314
* chore: drop removed job-state-file reference from local_sim
The .runpod_jobs.pkl state file no longer exists; remove its cleanup
from the local_sim Makefile.
Refs SLS-314
* fix: sync ping mirror inside JobsProgress, cover API mode
PR #517 review (capy-ai): the API/realtime path (rp_fastapi WorkerAPI)
started the ping without a mirror while tracking jobs in JobsProgress,
so heartbeats sent job_id=None there. Move mirror propagation into
JobsProgress.add/remove/clear via an attached mirror, so every writer
path (JobScaler and rp_fastapi) stays in sync from a single place.
Attach the mirror in run_worker and WorkerAPI; drop JobScaler's
now-redundant job_mirror plumbing.
Refs SLS-314
0 commit comments