You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If you set WORKER_REPO_DIR to a path on a shared volume (e.g. an NFS mount, k8s PVC, or a shared docker volume), the intent is obvious: persist the adaptor
cache so workers don't re-download everything on every restart, and share it
across multiple workers so we're not duplicating gigabytes per pod.
This almost works today. A single worker pointed at a persistent volume
behaves correctly — restarts are fast, no redundant installs. But the moment
you point more than one worker at the same volume, things go sideways:
The in-memory install queue in engine-multi only serialises installs within
one process. Two workers seeing the same uninstalled adaptor will both shell
out to npm install against the same node_modules/ and package.json.
npm has no awareness of other npm processes writing the same tree. Result:
corrupt node_modules, half-written package.json, runs failing with
module resolution errors that are a pain to diagnose (see Autoinstall is breaking ? #503).
What we want
Multiple workers should be able to share one repo directory safely. The common
case (adaptor already installed) should stay fast — no coordination overhead
when there's nothing to do. The common but only once case (something needs installing)
should end up with a single install across all the workers sharing the volume,
with the others waiting.
Crash safety matters: if a worker dies mid-install, the next one shouldn't be
stuck forever waiting on a ghost.
Out of scope
Anything that requires external coordination (Redis, etcd, etc.). Keep it
filesystem-only so it works wherever a shared volume works.
Changing the CLI's behaviour.
Garbage-collecting old adaptor versions from the cache.
Problem
If you set
WORKER_REPO_DIRto a path on a shared volume (e.g. an NFS mount, k8s PVC, or a shared docker volume), the intent is obvious: persist the adaptorcache so workers don't re-download everything on every restart, and share it
across multiple workers so we're not duplicating gigabytes per pod.
This almost works today. A single worker pointed at a persistent volume
behaves correctly — restarts are fast, no redundant installs. But the moment
you point more than one worker at the same volume, things go sideways:
one process. Two workers seeing the same uninstalled adaptor will both shell
out to
npm installagainst the samenode_modules/andpackage.json.corrupt
node_modules, half-writtenpackage.json, runs failing withmodule resolution errors that are a pain to diagnose (see Autoinstall is breaking ? #503).
What we want
Multiple workers should be able to share one repo directory safely. The common
case (adaptor already installed) should stay fast — no coordination overhead
when there's nothing to do. The common but only once case (something needs installing)
should end up with a single install across all the workers sharing the volume,
with the others waiting.
Crash safety matters: if a worker dies mid-install, the next one shouldn't be
stuck forever waiting on a ghost.
Out of scope
filesystem-only so it works wherever a shared volume works.
Related
one way to take pressure off pod-local disk, so this work helps there too.
multi-process / multi-host version of that.