Skip to content

TS Edge Case & Pipe QUEUE bug#862

Merged
alongd merged 3 commits intomainfrom
edge_case_ts
Apr 11, 2026
Merged

TS Edge Case & Pipe QUEUE bug#862
alongd merged 3 commits intomainfrom
edge_case_ts

Conversation

@calvinp0
Copy link
Copy Markdown
Member

@calvinp0 calvinp0 commented Apr 9, 2026

Two changes:


Change 1:

In the Scheduler, only break after conformer troubleshooting if new jobs (conf_opt or conf_sp) are actually running for that species. This ensures the scheduler correctly falls through to the "all conformers done" check if troubleshooting was attempted but failed to launch new tasks.

Essentially, this was a race to completion bug. If there were, for example, 3 TS guesses being troubleshooted and if the 2 out the 2 finished troubleshoot, then the last one, if it failed an exhausted all attempts would cause a bug where ARC would declare there was no TS that converged

Change 2:

There is a bug in ARC when Pipe is active. The issue is that when a batch job is submited, and let's say it has 20 jobs in it - so 20 workers needed, and then only 10 of those were picked up by workers but the other 10 are in Q mode, ARC misunderstand this and attempts to resubmit those 10 as it thinks they were not provided workers properly. And ARC will continue to do this ad infinitum

@calvinp0 calvinp0 requested review from Lilachn91, alongd and Copilot April 9, 2026 21:06
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adjusts Scheduler conformer-troubleshooting control flow to avoid prematurely breaking when troubleshooting is exhausted, preventing species from being incorrectly marked as having no converged TS due to a race-to-completion.

Changes:

  • Only break after conformer troubleshooting if conformer jobs are actually running for the species.
  • Explicitly allows fall-through to the “all conformers done” logic when troubleshooting doesn’t launch new work.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread arc/scheduler.py Outdated
@calvinp0 calvinp0 requested a review from Copilot April 9, 2026 21:54
@calvinp0 calvinp0 changed the title Fix stranding species when conformer troubleshooting is exhausted TS Edge Case & Pipe QUEUE bug Apr 9, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread arc/scheduler.py
Comment thread arc/job/pipe/pipe_run.py Outdated
Comment thread arc/job/pipe/pipe_run_test.py Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.10%. Comparing base (d96a9d2) to head (61a711a).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #862      +/-   ##
==========================================
- Coverage   60.13%   60.10%   -0.03%     
==========================================
  Files         102      102              
  Lines       31043    31041       -2     
  Branches     8082     8082              
==========================================
- Hits        18667    18657      -10     
- Misses      10068    10071       +3     
- Partials     2308     2313       +5     
Flag Coverage Δ
functionaltests 60.10% <ø> (-0.03%) ⬇️
unittests 60.10% <ø> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@alongd alongd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Please see the couple of comments below

Comment thread arc/scheduler.py
# mistakenly concludes no TS guess converged.
if any(is_conformer_job(j)
for j in self.running_jobs.get(label, [])):
break
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the unconditional break in the loop prevent the "all done" check from running?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The break is conditional by if any(is_conformer_job(j) for j in self.running_jobs.get(label, [])):

Comment thread arc/job/pipe/pipe_run.py Outdated
active_after_retry = counts[TaskState.CLAIMED.value] + counts[TaskState.RUNNING.value]
resubmit_grace = 120 # seconds
resubmit_grace = 120 # seconds — minimum wait after any submission
fresh_stale_timeout = 10800 # seconds (3 hours) — max time to trust fresh tasks' scheduler job
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hardcoded. On long queues it might be too aggressive, and on fast-Qs it might be too conservative

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is hard coded cause right now ARC has trouble understanding what job in a batch is Q and what is a worker issue (why it could be stuck in pending). I think this is a limitation of Pipe and that it only reads the json file ont he hard disk and not actually checks the queue itself

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to have to change the logic altogether here because of this limitation. We will have to assume that PENDING in the task json indicates to us that the job in the batch is in Q. There appears to be no other way unless we poll the actual scheduler itself. Else, ARC will assume a PENDING job (that is actually queued) needs resubmission. And it will do this forever - or at least until they are running on queue

@calvinp0 calvinp0 force-pushed the edge_case_ts branch 2 times, most recently from c704f1a to 014c276 Compare April 10, 2026 14:27
Workers remaining in the scheduler queue will eventually claim retried pending tasks when they start. Removing automatic resubmission prevents duplicate job submissions and ensures that manual intervention is required if a scheduler job is prematurely terminated or killed.
…oting

Prevents ARC from incorrectly concluding that no TS guess converged when the final running conformer fails troubleshooting. The scheduler now checks if other conformer jobs are still in flight before breaking, ensuring that the completion check—and thus the evaluation of previously successful conformers—is triggered even if the last job fails to converge. Includes a refactor of conformer job identification logic in arc/checks/common.py.
Updates the scheduler pipe tests to verify that Pipe runs no longer automatically resubmit jobs when retried tasks are present. This aligns the test suite with the policy that workers remaining in the scheduler queue handle retries, preventing duplicate job submissions.
Copy link
Copy Markdown
Member

@alongd alongd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@alongd alongd merged commit 5c50f88 into main Apr 11, 2026
8 checks passed
@alongd alongd deleted the edge_case_ts branch April 11, 2026 06:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants