Skip to content

[9.0] fix: disable watchdog wallclock check for remote executions#8180

Merged
fstagni merged 1 commit intoDIRACGrid:integrationfrom
aldbr:main_FIX_disable-watchdog-wallclock-check
Apr 25, 2025
Merged

[9.0] fix: disable watchdog wallclock check for remote executions#8180
fstagni merged 1 commit intoDIRACGrid:integrationfrom
aldbr:main_FIX_disable-watchdog-wallclock-check

Conversation

@aldbr
Copy link
Copy Markdown
Contributor

@aldbr aldbr commented Apr 25, 2025

For some reason, payloads executed through the PushJobAgent are quickly killed by the Watchdogs because the CPU consumption is way below the wallclock time, which is expected as the real work is done on a remote worker node.

I don't really understand why we didn't have such issues in v8.0 but the following patch should solve the problem, I am going to test it.

BEGINRELEASENOTES
*WorkloadManagement
FIX: disable watchdgo wallclock check for remote executions
ENDRELEASENOTES

@aldbr aldbr marked this pull request as ready for review April 25, 2025 10:08
@aldbr aldbr requested review from atsareg and fstagni as code owners April 25, 2025 10:08
@aldbr
Copy link
Copy Markdown
Contributor Author

aldbr commented Apr 25, 2025

It seems to work, I can see the following logs in the hotfixed PushJobAgent:

N.B. job would be declared as stalled but CPU / WallClock check is disabled by payload

@fstagni fstagni merged commit 10ef53a into DIRACGrid:integration Apr 25, 2025
23 checks passed
@DIRACGridBot DIRACGridBot added the sweep:ignore Prevent sweeping from being ran for this PR label Apr 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sweep:ignore Prevent sweeping from being ran for this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants