Support memory boost and prmon for build task#103
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #103 +/- ##
==========================================
- Coverage 35.01% 34.88% -0.14%
==========================================
Files 13 13
Lines 1308 1313 +5
Branches 214 214
==========================================
Hits 458 458
- Misses 830 835 +5
Partials 20 20 ☔ View full report in Codecov by Sentry. |
29a9ffe to
f732a05
Compare
MichelleGower
left a comment
There was a problem hiding this comment.
Looks good. One question. Merge approved.
| code = e.errno | ||
| if abs(code) == 9: | ||
| print(f"BPS prepare caught exception: {e.strerror}") | ||
| sys.exit(137) |
There was a problem hiding this comment.
Do you want to do this more generically for all signals instead of just signal 9? For example something like:
if code < 0:
print(f"BPS prepare caught exception: {e.strerror}")
sys.exit(128 + abs(code))
There was a problem hiding this comment.
Yesterday I saw there is build task failed at LANCS because the pipetask graph command got wrong option. Then the exit code is 2. For all failed build task for OOM, I've only see -9. Not sure if we should generalize all the exit codes to be retried with boosted memory. Waiting for both reviewers's suggestions.
There was a problem hiding this comment.
maybe it's good to do it more generally for all signals, instead of -9.
There was a problem hiding this comment.
Should all exit codes become 137 or as Michelle suggested 128+abs(exit_code)?
Rubin workflow is getting bigger and bigger, so the build task is usually start with very high initial memory. When the job exit with code 1 or 2 for payload issue other than OOM, the retry memory will be bumped too high.
| code = e.errno | ||
| if abs(code) == 9: | ||
| print(f"BPS prepare caught exception: {e.strerror}") | ||
| sys.exit(137) |
There was a problem hiding this comment.
maybe it's good to do it more generally for all signals, instead of -9.
| task_rss = get_task_parameter(config, remote_build, "requestMemory") | ||
| memory_multiplier = get_task_parameter(config, remote_build, "memoryMultiplier") | ||
| task_rss_retry_step = task_rss * memory_multiplier if memory_multiplier else 0 | ||
| task_rss_retry_offset = 0 if task_rss_retry_step else task_rss |
There was a problem hiding this comment.
add task_rss_max, to limit the top memory. It will be supported.
There was a problem hiding this comment.
I'll add this. I asked in the jira ticket whether this parameter will this break production jobs if it's not in iDDS yet?
c4b4ed7 to
7f74328
Compare
7f74328 to
395fc2f
Compare
Checklist
doc/changes