TS Edge Case & Pipe QUEUE bug by calvinp0 · Pull Request #862 · ReactionMechanismGenerator/ARC

calvinp0 · 2026-04-09T21:06:19Z

Two changes:

Change 1:

In the Scheduler, only break after conformer troubleshooting if new jobs (conf_opt or conf_sp) are actually running for that species. This ensures the scheduler correctly falls through to the "all conformers done" check if troubleshooting was attempted but failed to launch new tasks.

Essentially, this was a race to completion bug. If there were, for example, 3 TS guesses being troubleshooted and if the 2 out the 2 finished troubleshoot, then the last one, if it failed an exhausted all attempts would cause a bug where ARC would declare there was no TS that converged

Change 2:

There is a bug in ARC when Pipe is active. The issue is that when a batch job is submited, and let's say it has 20 jobs in it - so 20 workers needed, and then only 10 of those were picked up by workers but the other 10 are in Q mode, ARC misunderstand this and attempts to resubmit those 10 as it thinks they were not provided workers properly. And ARC will continue to do this ad infinitum

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adjusts Scheduler conformer-troubleshooting control flow to avoid prematurely breaking when troubleshooting is exhausted, preventing species from being incorrectly marked as having no converged TS due to a race-to-completion.

Changes:

Only break after conformer troubleshooting if conformer jobs are actually running for the species.
Explicitly allows fall-through to the “all conformers done” logic when troubleshooting doesn’t launch new work.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

codecov · 2026-04-10T00:43:40Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.10%. Comparing base (d96a9d2) to head (61a711a).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #862      +/-   ##
==========================================
- Coverage   60.13%   60.10%   -0.03%     
==========================================
  Files         102      102              
  Lines       31043    31041       -2     
  Branches     8082     8082              
==========================================
- Hits        18667    18657      -10     
- Misses      10068    10071       +3     
- Partials     2308     2313       +5

Flag	Coverage Δ
functionaltests	`60.10% <ø> (-0.03%)`	⬇️
unittests	`60.10% <ø> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

alongd

Thanks! Please see the couple of comments below

alongd · 2026-04-10T13:37:07Z

+                                    # mistakenly concludes no TS guess converged.
+                                    if any(is_conformer_job(j)
+                                           for j in self.running_jobs.get(label, [])):
+                                        break


Is the unconditional break in the loop prevent the "all done" check from running?

The break is conditional by if any(is_conformer_job(j) for j in self.running_jobs.get(label, [])):

alongd · 2026-04-10T13:42:20Z

        active_after_retry = counts[TaskState.CLAIMED.value] + counts[TaskState.RUNNING.value]
-        resubmit_grace = 120  # seconds
+        resubmit_grace = 120  # seconds — minimum wait after any submission
+        fresh_stale_timeout = 10800  # seconds (3 hours) — max time to trust fresh tasks' scheduler job


This is hardcoded. On long queues it might be too aggressive, and on fast-Qs it might be too conservative

Yes it is hard coded cause right now ARC has trouble understanding what job in a batch is Q and what is a worker issue (why it could be stuck in pending). I think this is a limitation of Pipe and that it only reads the json file ont he hard disk and not actually checks the queue itself

I am going to have to change the logic altogether here because of this limitation. We will have to assume that PENDING in the task json indicates to us that the job in the batch is in Q. There appears to be no other way unless we poll the actual scheduler itself. Else, ARC will assume a PENDING job (that is actually queued) needs resubmission. And it will do this forever - or at least until they are running on queue

Workers remaining in the scheduler queue will eventually claim retried pending tasks when they start. Removing automatic resubmission prevents duplicate job submissions and ensures that manual intervention is required if a scheduler job is prematurely terminated or killed.

…oting Prevents ARC from incorrectly concluding that no TS guess converged when the final running conformer fails troubleshooting. The scheduler now checks if other conformer jobs are still in flight before breaking, ensuring that the completion check—and thus the evaluation of previously successful conformers—is triggered even if the last job fails to converge. Includes a refactor of conformer job identification logic in arc/checks/common.py.

Updates the scheduler pipe tests to verify that Pipe runs no longer automatically resubmit jobs when retried tasks are present. This aligns the test suite with the policy that workers remaining in the scheduler queue handle retries, preventing duplicate job submissions.

alongd

Thanks!

calvinp0 requested review from Lilachn91, alongd and Copilot April 9, 2026 21:06

Copilot AI reviewed Apr 9, 2026

View reviewed changes

Comment thread arc/scheduler.py Outdated

Copilot started reviewing on behalf of calvinp0 April 9, 2026 21:26 View session

calvinp0 force-pushed the edge_case_ts branch from 25dfd11 to c092d57 Compare April 9, 2026 21:29

github-actions Bot added the Module: Scheduler label Apr 9, 2026

calvinp0 requested a review from Copilot April 9, 2026 21:54

calvinp0 changed the title ~~Fix stranding species when conformer troubleshooting is exhausted~~ TS Edge Case & Pipe QUEUE bug Apr 9, 2026

Copilot AI reviewed Apr 9, 2026

View reviewed changes

Comment thread arc/scheduler.py

Comment thread arc/job/pipe/pipe_run.py Outdated

Comment thread arc/job/pipe/pipe_run_test.py Outdated

calvinp0 force-pushed the edge_case_ts branch from 214108a to c704f1a Compare April 9, 2026 22:15

Copilot started reviewing on behalf of calvinp0 April 9, 2026 22:44 View session

alongd reviewed Apr 10, 2026

View reviewed changes

calvinp0 force-pushed the edge_case_ts branch 2 times, most recently from c704f1a to 014c276 Compare April 10, 2026 14:27

calvinp0 added 3 commits April 10, 2026 19:28

calvinp0 force-pushed the edge_case_ts branch from 014c276 to 61a711a Compare April 10, 2026 16:28

calvinp0 requested a review from alongd April 10, 2026 21:14

alongd approved these changes Apr 11, 2026

View reviewed changes

alongd merged commit 5c50f88 into main Apr 11, 2026
8 checks passed

alongd deleted the edge_case_ts branch April 11, 2026 06:58

Uh oh!

Conversation

calvinp0 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Two changes:

Change 1:

Change 2:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

alongd left a comment

Choose a reason for hiding this comment

Uh oh!

alongd Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

calvinp0 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

alongd Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

calvinp0 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

calvinp0 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

alongd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

calvinp0 commented Apr 9, 2026 •

edited

Loading

codecov Bot commented Apr 10, 2026 •

edited

Loading