fix: prevent zombie jobs from starting after deletion#465
Open
Zakaria-Kofiro wants to merge 3 commits into
Open
fix: prevent zombie jobs from starting after deletion#465Zakaria-Kofiro wants to merge 3 commits into
Zakaria-Kofiro wants to merge 3 commits into
Conversation
During 03/02 load test, a deleted UC7 job in Created status was started
as an unknown clone and kicked off agents. Root causes: no Deleted enum
value, broken String-based synchronized lock (TOCTOU), no null checks on
start path, no REST DELETE endpoint, and VMTracker status pipeline could
overwrite any status.
Changes:
- Add Deleted to JobQueueStatus enum (soft-delete instead of hard-delete)
- Fix TOCTOU race in startJob/startAgents: replace synchronized(jobId)
with shared per-job lock via JobLockManager (ConcurrentHashMap-backed,
stable identity, no eviction). deleteJob uses same lock.
- Add null + status guards in JobEventSender.startJob/startAgents,
WorkLoadFactory.getModelRunner, and JobQueueAction.run/startAgents
- Add DELETE /v2/jobs/{jobId} endpoint with 409 Conflict for active jobs
(GenericServiceConflictException + handler)
- Guard VMTrackerImpl against overwriting Deleted status from stale
agent reports (both container-init and status-calc paths)
- Update UI deletion to soft-delete (isDeletable policy enforcement)
- Filter Deleted jobs from API list endpoints and UI tree
- Align delete policy: UI and API both allow delete for non-active
statuses (Created, Queued, Stopped, Completed, Aborted)
Known follow-ups:
- VMTracker check-then-save is non-atomic (needs CAS query)
- JobManager.jobInfoMapLocalCache has no cleanup on delete
kevin-mcgoldrick
approved these changes
Mar 23, 2026
added 2 commits
May 26, 2026 11:30
…k, watchdog guard - Move JobLockManager to api module (com.intuit.tank.vm.common) so all modules share the same per-job lock: JobEventSender, JobServiceV2Impl, JobTreeTableBean, and VMTrackerImpl - VMTrackerImpl.addStatusToJobContainer: wrap read/calculate/save in JobLockManager.getLock(jobId) — prevents concurrent delete from being overwritten by stale status update - VMTrackerImpl.setStatusThread: remove first-time DB write from the cloudVmStatusContainer==null path — only initialize in-memory container, let locked addStatusToJobContainer be the single DB writer - VMTrackerImpl.setStatusThread: replace early return with else block so AWSXRay.endSegment() always runs (fixes segment leak for deleted jobs) - JobTreeTableBean.deleteJobInstance: use JobLockManager.getLock instead of local ConcurrentHashMap, add active-status re-check inside lock, remove queue.getJobs() removal that caused unknown zombie entries - AgentWatchdog.killJobDirectly: skip DB status overwrite for Deleted jobs - JobEventSenderTest: fix stale assertion — non-Starting jobs no longer fire JOB_STARTED event
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
title: -
Please make sure these check boxes are checked before submitting
mvn clean test -P default** PR review process **