Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adjusts Scheduler conformer-troubleshooting control flow to avoid prematurely breaking when troubleshooting is exhausted, preventing species from being incorrectly marked as having no converged TS due to a race-to-completion.
Changes:
- Only
breakafter conformer troubleshooting if conformer jobs are actually running for the species. - Explicitly allows fall-through to the “all conformers done” logic when troubleshooting doesn’t launch new work.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #862 +/- ##
==========================================
- Coverage 60.13% 60.10% -0.03%
==========================================
Files 102 102
Lines 31043 31041 -2
Branches 8082 8082
==========================================
- Hits 18667 18657 -10
- Misses 10068 10071 +3
- Partials 2308 2313 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
alongd
left a comment
There was a problem hiding this comment.
Thanks! Please see the couple of comments below
| # mistakenly concludes no TS guess converged. | ||
| if any(is_conformer_job(j) | ||
| for j in self.running_jobs.get(label, [])): | ||
| break |
There was a problem hiding this comment.
Is the unconditional break in the loop prevent the "all done" check from running?
There was a problem hiding this comment.
The break is conditional by if any(is_conformer_job(j) for j in self.running_jobs.get(label, [])):
| active_after_retry = counts[TaskState.CLAIMED.value] + counts[TaskState.RUNNING.value] | ||
| resubmit_grace = 120 # seconds | ||
| resubmit_grace = 120 # seconds — minimum wait after any submission | ||
| fresh_stale_timeout = 10800 # seconds (3 hours) — max time to trust fresh tasks' scheduler job |
There was a problem hiding this comment.
This is hardcoded. On long queues it might be too aggressive, and on fast-Qs it might be too conservative
There was a problem hiding this comment.
Yes it is hard coded cause right now ARC has trouble understanding what job in a batch is Q and what is a worker issue (why it could be stuck in pending). I think this is a limitation of Pipe and that it only reads the json file ont he hard disk and not actually checks the queue itself
There was a problem hiding this comment.
I am going to have to change the logic altogether here because of this limitation. We will have to assume that PENDING in the task json indicates to us that the job in the batch is in Q. There appears to be no other way unless we poll the actual scheduler itself. Else, ARC will assume a PENDING job (that is actually queued) needs resubmission. And it will do this forever - or at least until they are running on queue
c704f1a to
014c276
Compare
Workers remaining in the scheduler queue will eventually claim retried pending tasks when they start. Removing automatic resubmission prevents duplicate job submissions and ensures that manual intervention is required if a scheduler job is prematurely terminated or killed.
…oting Prevents ARC from incorrectly concluding that no TS guess converged when the final running conformer fails troubleshooting. The scheduler now checks if other conformer jobs are still in flight before breaking, ensuring that the completion check—and thus the evaluation of previously successful conformers—is triggered even if the last job fails to converge. Includes a refactor of conformer job identification logic in arc/checks/common.py.
Updates the scheduler pipe tests to verify that Pipe runs no longer automatically resubmit jobs when retried tasks are present. This aligns the test suite with the policy that workers remaining in the scheduler queue handle retries, preventing duplicate job submissions.
Two changes:
Change 1:
In the Scheduler, only break after conformer troubleshooting if new jobs (conf_opt or conf_sp) are actually running for that species. This ensures the scheduler correctly falls through to the "all conformers done" check if troubleshooting was attempted but failed to launch new tasks.
Essentially, this was a race to completion bug. If there were, for example, 3 TS guesses being troubleshooted and if the 2 out the 2 finished troubleshoot, then the last one, if it failed an exhausted all attempts would cause a bug where ARC would declare there was no TS that converged
Change 2:
There is a bug in ARC when Pipe is active. The issue is that when a batch job is submited, and let's say it has 20 jobs in it - so 20 workers needed, and then only 10 of those were picked up by workers but the other 10 are in Q mode, ARC misunderstand this and attempts to resubmit those 10 as it thinks they were not provided workers properly. And ARC will continue to do this ad infinitum