fix(volumes): broadcast volume directory creation to all cluster nodes#2595
fix(volumes): broadcast volume directory creation to all cluster nodes#2595AdaAibaby wants to merge 2 commits intoe2b-dev:mainfrom
Conversation
|
We require contributors to sign our Contributor License Agreement, and we don't have @adababys on file. You can sign our CLA at https://e2b.dev/docs/cla . Once you've signed, post a comment here that says '@cla-bot check' |
ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one. |
|
We require contributors to sign our Contributor License Agreement, and we don't have @adababys on file. You can sign our CLA at https://e2b.dev/docs/cla . Once you've signed, post a comment here that says '@cla-bot check' |
|
The cla-bot has been summoned, and re-checked this pull request! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bce5e5094d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| return ErrClusterNotFound | ||
| } | ||
|
|
||
| wg, wgCtx := errgroup.WithContext(ctx) |
There was a problem hiding this comment.
Do not cancel remaining node RPCs on first failure
Using errgroup.WithContext here cancels wgCtx as soon as one node returns an error, so other in-flight or not-yet-started node RPCs can be aborted before they execute. That breaks the intended “run on all nodes” behavior and can leave partial state (for example, deleteVolume may skip cleanup on healthy nodes after one early failure). Run goroutines under a non-canceling group and aggregate errors after all node attempts complete.
Useful? React with 👍 / 👎.
|
We require contributors to sign our Contributor License Agreement, and we don't have @adababys on file. You can sign our CLA at https://e2b.dev/docs/cla . Once you've signed, post a comment here that says '@cla-bot check' |
@cla-bot check |
|
The cla-bot has been summoned, and re-checked this pull request! |
Problem
When a volume is created via POST /volumes, the directory is only
created on a single randomly-selected orchestrator node
(
executeOnOrchestratorByClusterIDuses rand.Shuffle and returns onthe first success). Sandbox placement (PlaceSandbox) is independent
and selects nodes based on CPU/RAM availability, with no awareness of
which node holds the volume directory.
This causes an intermittent failure: sandboxes that land on the same
node as the volume get a working NFS mount; sandboxes that land on any
other node fail to mount because the NFS proxy calls
syscall.Mount(path, path, MS_BIND) on a path that does not exist on
that node.
A second, related bug exists in
envd/internal/api/init.go: when twoconcurrent
/initrequests race onisMountingNFS, the second requestsilently returns
nil, causing the sandbox to start successfully withno NFS mount.
Changes
volume_util.go
executeOnAllClusterNodes: fans out a gRPC call concurrently toevery ready node in the cluster using errgroup, returning a combined
error if any node fails.
volume_create.go
createVolumenow callsexecuteOnAllClusterNodesinstead ofexecuteOnOrchestratorByClusterID, so the volume directory iscreated on every orchestrator node at creation time.
volume_delete.go
deleteVolumelikewise broadcasts to all nodes to clean up thedirectory everywhere it was created.
init.go
setupNFS: whenisMountingNFSCAS fails (concurrent request),return an explicit error instead of
nilso the caller receives a400and can retry, rather than silently succeeding with no mount.Testing
go build ./packages/api/... ./packages/envd/... packages.✅go vet ./packages/api/internal/handlers/... internal.✅go test internal.✅volume and then creating 10 sandboxes with that volume mount all
succeed with a valid NFS mount (mount | grep nfs non-empty in every
sandbox).