Skip to content

fix(volumes): broadcast volume directory creation to all cluster nodes#2595

Open
AdaAibaby wants to merge 2 commits intoe2b-dev:mainfrom
AdaAibaby:fix-volumes
Open

fix(volumes): broadcast volume directory creation to all cluster nodes#2595
AdaAibaby wants to merge 2 commits intoe2b-dev:mainfrom
AdaAibaby:fix-volumes

Conversation

@AdaAibaby
Copy link
Copy Markdown

Problem

When a volume is created via POST /volumes, the directory is only
created on a single randomly-selected orchestrator node
(executeOnOrchestratorByClusterID uses rand.Shuffle and returns on
the first success). Sandbox placement (PlaceSandbox) is independent
and selects nodes based on CPU/RAM availability, with no awareness of
which node holds the volume directory.

This causes an intermittent failure: sandboxes that land on the same
node as the volume get a working NFS mount; sandboxes that land on any
other node fail to mount because the NFS proxy calls
syscall.Mount(path, path, MS_BIND) on a path that does not exist on
that node.

A second, related bug exists in envd/internal/api/init.go: when two
concurrent /init requests race on isMountingNFS, the second request
silently returns nil, causing the sandbox to start successfully with
no NFS mount.

Changes

volume_util.go

  • Add executeOnAllClusterNodes: fans out a gRPC call concurrently to
    every ready node in the cluster using errgroup, returning a combined
    error if any node fails.

volume_create.go

  • createVolume now calls executeOnAllClusterNodes instead of
    executeOnOrchestratorByClusterID, so the volume directory is
    created on every orchestrator node at creation time.

volume_delete.go

  • deleteVolume likewise broadcasts to all nodes to clean up the
    directory everywhere it was created.

init.go

  • setupNFS: when isMountingNFS CAS fails (concurrent request),
    return an explicit error instead of nil so the caller receives a
    400 and can retry, rather than silently succeeding with no mount.

Testing

  • go build ./packages/api/... ./packages/envd/... packages.
  • go vet ./packages/api/internal/handlers/... internal.
  • go test internal.
  • Manual: on a staging cluster with ≥2 orchestrator nodes, creating a
    volume and then creating 10 sandboxes with that volume mount all
    succeed with a valid NFS mount (mount | grep nfs non-empty in every
    sandbox).

@cla-bot
Copy link
Copy Markdown

cla-bot Bot commented May 8, 2026

We require contributors to sign our Contributor License Agreement, and we don't have @adababys on file. You can sign our CLA at https://e2b.dev/docs/cla . Once you've signed, post a comment here that says '@cla-bot check'

@qodo-code-review
Copy link
Copy Markdown

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

@AdaAibaby
Copy link
Copy Markdown
Author

@cla-bot check

@cla-bot check

@cla-bot
Copy link
Copy Markdown

cla-bot Bot commented May 8, 2026

We require contributors to sign our Contributor License Agreement, and we don't have @adababys on file. You can sign our CLA at https://e2b.dev/docs/cla . Once you've signed, post a comment here that says '@cla-bot check'

@cla-bot
Copy link
Copy Markdown

cla-bot Bot commented May 8, 2026

The cla-bot has been summoned, and re-checked this pull request!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bce5e5094d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread packages/api/internal/handlers/volume_util.go
return ErrClusterNotFound
}

wg, wgCtx := errgroup.WithContext(ctx)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not cancel remaining node RPCs on first failure

Using errgroup.WithContext here cancels wgCtx as soon as one node returns an error, so other in-flight or not-yet-started node RPCs can be aborted before they execute. That breaks the intended “run on all nodes” behavior and can leave partial state (for example, deleteVolume may skip cleanup on healthy nodes after one early failure). Run goroutines under a non-canceling group and aggregate errors after all node attempts complete.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The executeOnAllClusterNodes function incorrectly uses errgroup.WithContext, which cancels all operations on the first error, preventing full cleanup during deletions, and returns success when no nodes are ready.

Comment thread packages/api/internal/handlers/volume_util.go Outdated
@cla-bot
Copy link
Copy Markdown

cla-bot Bot commented May 8, 2026

We require contributors to sign our Contributor License Agreement, and we don't have @adababys on file. You can sign our CLA at https://e2b.dev/docs/cla . Once you've signed, post a comment here that says '@cla-bot check'

@adababys
Copy link
Copy Markdown

adababys commented May 8, 2026

We require contributors to sign our Contributor License Agreement, and we don't have @adababys on file. You can sign our CLA at https://e2b.dev/docs/cla . Once you've signed, post a comment here that says '@cla-bot check'

@cla-bot check

@cla-bot cla-bot Bot added the cla-signed label May 8, 2026
@cla-bot
Copy link
Copy Markdown

cla-bot Bot commented May 8, 2026

The cla-bot has been summoned, and re-checked this pull request!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants