Skip to content

fix: prevent zombie jobs from starting after deletion#465

Open
Zakaria-Kofiro wants to merge 3 commits into
masterfrom
zkofiro/fix-unknown-job-bug
Open

fix: prevent zombie jobs from starting after deletion#465
Zakaria-Kofiro wants to merge 3 commits into
masterfrom
zkofiro/fix-unknown-job-bug

Conversation

@Zakaria-Kofiro
Copy link
Copy Markdown
Collaborator

title: -

Please make sure these check boxes are checked before submitting

  • ** Squashed Commits **
  • ** All Tests Passed ** - mvn clean test -P default

** PR review process **

  • Requires one +1 from a reviewer
  • Repository owners will merge your PR once it is approved.

During 03/02 load test, a deleted UC7 job in Created status was started
as an unknown clone and kicked off agents. Root causes: no Deleted enum
value, broken String-based synchronized lock (TOCTOU), no null checks on
start path, no REST DELETE endpoint, and VMTracker status pipeline could
overwrite any status.

Changes:
- Add Deleted to JobQueueStatus enum (soft-delete instead of hard-delete)
- Fix TOCTOU race in startJob/startAgents: replace synchronized(jobId)
  with shared per-job lock via JobLockManager (ConcurrentHashMap-backed,
  stable identity, no eviction). deleteJob uses same lock.
- Add null + status guards in JobEventSender.startJob/startAgents,
  WorkLoadFactory.getModelRunner, and JobQueueAction.run/startAgents
- Add DELETE /v2/jobs/{jobId} endpoint with 409 Conflict for active jobs
  (GenericServiceConflictException + handler)
- Guard VMTrackerImpl against overwriting Deleted status from stale
  agent reports (both container-init and status-calc paths)
- Update UI deletion to soft-delete (isDeletable policy enforcement)
- Filter Deleted jobs from API list endpoints and UI tree
- Align delete policy: UI and API both allow delete for non-active
  statuses (Created, Queued, Stopped, Completed, Aborted)

Known follow-ups:
- VMTracker check-then-save is non-atomic (needs CAS query)
- JobManager.jobInfoMapLocalCache has no cleanup on delete
@Zakaria-Kofiro Zakaria-Kofiro added AI Generated bug java Pull requests that update java code labels Mar 23, 2026
@Zakaria-Kofiro Zakaria-Kofiro changed the title fix: Prevent zombie jobs from starting after deletion fix: prevent zombie jobs from starting after deletion Mar 24, 2026
zkofiro added 2 commits May 26, 2026 11:30
…k, watchdog guard

- Move JobLockManager to api module (com.intuit.tank.vm.common) so all
  modules share the same per-job lock: JobEventSender, JobServiceV2Impl,
  JobTreeTableBean, and VMTrackerImpl
- VMTrackerImpl.addStatusToJobContainer: wrap read/calculate/save in
  JobLockManager.getLock(jobId) — prevents concurrent delete from being
  overwritten by stale status update
- VMTrackerImpl.setStatusThread: remove first-time DB write from the
  cloudVmStatusContainer==null path — only initialize in-memory container,
  let locked addStatusToJobContainer be the single DB writer
- VMTrackerImpl.setStatusThread: replace early return with else block so
  AWSXRay.endSegment() always runs (fixes segment leak for deleted jobs)
- JobTreeTableBean.deleteJobInstance: use JobLockManager.getLock instead
  of local ConcurrentHashMap, add active-status re-check inside lock,
  remove queue.getJobs() removal that caused unknown zombie entries
- AgentWatchdog.killJobDirectly: skip DB status overwrite for Deleted jobs
- JobEventSenderTest: fix stale assertion — non-Starting jobs no longer
  fire JOB_STARTED event
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI Generated bug java Pull requests that update java code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants