refactor data mover: switch to BatchJob with auto cleanup and sleep after every run#265
Conversation
…fter every run Signed-off-by: prekshivyas <prekhsivyas@gmail.com>
roclark
left a comment
There was a problem hiding this comment.
Thanks for putting this together, @prekshivyas! This looks good to me - just some small comments on extra wait states.
| raise TimeoutError(f"Job {job_id} did not complete within {timeout} seconds.") | ||
| current_job = client.job.get(job_id) | ||
| current_job_status = current_job.status.state | ||
| if count > 0 and current_job_status in [LeptonJobState.Completed, LeptonJobState.Failed, LeptonJobState.Unknown]: |
There was a problem hiding this comment.
Do we need the count > 0 check? If the job is immediately in one of the acceptable states, figured we could break out right away, but I suppose it could be in Unknown prior to running?
There was a problem hiding this comment.
So I had pout some logs to check what states come up when job is just scheduling or starting - sometimes it would be Unknown like in the initial tests - i saw it coming more than 2 times, but in my later tests it would just come up as Starting and not Unknown anymore - just to be sure and avoid this randomness i put a count
There was a problem hiding this comment.
That makes sense, thanks! We can keep the count check then. If we want to be super efficient, we could do another branch for if current_job_status == LeptonJobState.Unknown and count == 0 and sleep/retry in that state and only check for Completed and Failed without the count here, but I'm not too particular about it.
| if current_job_status != LeptonJobState.Completed: | ||
| raise RuntimeError(f"Job {job_id} failed with status: {current_job_status}") | ||
|
|
||
| time.sleep(sleep) |
There was a problem hiding this comment.
I lean towards not putting a sleep here. If the data has been uploaded to the remote FS and the data-mover is marked as Completed, I'd say we continue immediately to the next stages to reduce downtime. Theoretically, we should only make it to this line if everything is ready to continue.
There was a problem hiding this comment.
hmmm I see ! yeah we could remove sleep so I will do that now
Signed-off-by: prekshivyas <prekhsivyas@gmail.com>
Signed-off-by: prekshivyas <prekhsivyas@gmail.com>
Signed-off-by: prekshivyas <prekhsivyas@gmail.com>
Signed-off-by: prekshivyas <prekhsivyas@gmail.com>
Signed-off-by: prekshivyas <prekhsivyas@gmail.com>
Signed-off-by: prekshivyas <prekhsivyas@gmail.com>
Signed-off-by: prekshivyas <prekhsivyas@gmail.com>
No description provided.