Possible k8s OOM kill prevention pill 1 (--max-old-space-size)#1368
Possible k8s OOM kill prevention pill 1 (--max-old-space-size)#1368taylordowns2000 wants to merge 1 commit intomainfrom
Conversation
|
Interesting - let me look into this. Thanks! |
|
What I need to look into here is: the engine relies on the inner worker thread blowing up so that the child process can catch the OOM error and feed it back to the parent. If we're now constraining the child process, who catches the OOM error that it throws? We might just need a bit of extra logic to support that. I also want to re-validate the resourceLimits flag we're using. If it's not robust (and the AI seems to think its not) should we be using it at all? Maybe I should move all the OOM handling into the child process. I don't really like that we have two different strategies to trap memory issues. I've also got a bit of a worry that setting the old size memory might not actually be helping in the kind of synchronous recursive functions that break us - we maybe need to be looking more at the young memory heap |
|
My recommendation is to get a build up on staging and be data driven about
it. (Relying on observation cause I don’t understand the intervals.)
So right now, I can crash the worker reliably, at the click of a button.
Let’s put this on staging, click that button, and see if it crashes.
Taylor Downs
Founder & CEO
OpenFn <https://www.openfn.org> // the DPG for ai orchestration & public
service automation
…On Tue, Apr 14, 2026 at 07:54 Joe Clark ***@***.***> wrote:
*josephjclark* left a comment (OpenFn/kit#1368)
<#1368 (comment)>
What I need to look into here is: the engine relies on the inner worker
thread blowing up so that the child process can catch the OOM error and
feed it back to the parent. If we're now constraining the child process,
who catches the OOM error that it throws?
We might just need a bit of extra logic to support that.
I also want to re-validate the resourceLimits flag we're using. If it's
not robust (and the AI seems to think its not) should we be using it at
all? Maybe I should move all the OOM handling into the child process. I
don't really like that we have two different strategies to trap memory
issues.
I've also got a bit of a worry that setting the old size memory might not
actually be helping in the kind of synchronous recursive functions that
break us - we maybe need to be looking more at the young memory heap
—
Reply to this email directly, view it on GitHub
<#1368 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACCUBLIONWBQLZPLLGV4SLD4VYRGZAVCNFSM6AAAAACXXYKX7CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DENBTGY2TKNBRG4>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Adds
--max-old-space-sizeto child processes forked by the worker pool.Currently, memory limits only apply at the Worker Thread level via V8's
resourceLimits. The child process itself runs with V8's default heap limit (~1.5-4GB). During rapid allocation spikes (e.g., infinite loop pushing to an array), this means the child process can grow far beyond the intended run memory limit before V8 kills the thread.This change passes
--max-old-space-sizeto the child process, reducing the ceiling from V8's default down toWORKER_MAX_RUN_MEMORY_MB(default 500MB). This makes V8 GC more aggressively at the process level and trigger OOM earlier.This does not guarantee protection against k8s OOM kills — RSS can still transiently exceed the V8 heap limit during rapid allocation. But it significantly reduces the headroom from gigabytes to something much closer to the configured limit. A hard guarantee would require kernel-level enforcement via cgroups (future work).
AI Usage
Please disclose whether you've used AI anywhere in this PR (it's cool, we just
want to know!):
You can read more details in our
Responsible AI Policy
Release branch checklist
Delete this section if this is not a release PR.
If this IS a release branch:
pnpm changeset versionfrom root to bump versionspnpm installpnpm changeset tagto generate tagsgit push --tagsTags may need updating if commits come in after the tags are first generated.