Skip to content

fix(orch): P2P reader race on peer→storage transition#2627

Closed
levb wants to merge 3 commits into
mainfrom
lev-fix-peer-transition
Closed

fix(orch): P2P reader race on peer→storage transition#2627
levb wants to merge 3 commits into
mainfrom
lev-fix-peer-transition

Conversation

@levb

@levb levb commented May 11, 2026

Copy link
Copy Markdown
Contributor

When a peer signaled UseStorage to indicate its upload finished, the current reader saw uploaded=true and routed to base storage, but concurrent sibling reads on the same File could fall through with no FrameTable, attempting to decode the wrong bytes.

The peer now ships the final V4 header inline on its UseStorage response (new header_bytes field on PeerAvailability).
Client side, the resolver's per-buildID state holds the parsed header. build.File.installPendingHeader polls the state at the top of every Slice/ReadAt iteration and CAS-installs the new header atomically. Concurrent readers either install the same value or observe it already installed — no coordination primitive needed, no GCS poll.

When a peer signaled UseStorage to indicate its upload finished, the
current reader saw `uploaded=true` and routed to base storage, but
concurrent sibling reads on the same File could fall through with a
stale FrameTable — V4 chunk offsets shift between the pre-finalize
in-memory header and the post-finalize GCS header, so reads decoded
the wrong bytes.

The peer now ships the final V4 header inline on its UseStorage
response (new `header_bytes` field on `PeerAvailability`). The bytes
are the same `[]byte` produced once by `header.StoreHeader` during
upload — captured on `*Upload` and forwarded into the server's
`uploadedBuilds` cache, no second serialization, no failure path.

Client side, the resolver's per-buildID state holds the parsed header.
`build.File.installPendingHeader` polls the state at the top of every
Slice/ReadAt iteration and CAS-installs the new header atomically.
Concurrent readers either install the same value or observe it
already installed — no coordination primitive needed, no GCS poll.

Also: mid-stream peer abort (NotAvailable from cache eviction) now
surfaces as `storage.ErrPeerAborted` so the chunker reopens via base
on a single retry, instead of silently truncating the stream.
@codecov

codecov Bot commented May 11, 2026

Copy link
Copy Markdown

❌ 9 Tests Failed:

Tests completed Failed Passed Skipped
2612 9 2603 5
View the full list of 10 ❄️ flaky test(s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/metrics::TestTeamMetrics

Flake rate in main: 69.70% (Passed 123 times, Failed 283 times)

Stack Traces | 0.75s run time
=== RUN   TestTeamMetrics
=== PAUSE TestTeamMetrics
=== CONT  TestTeamMetrics
    team_metrics_test.go:61: 
        	Error Trace:	.../api/metrics/team_metrics_test.go:61
        	Error:      	Should be true
        	Test:       	TestTeamMetrics
        	Messages:   	MaxConcurrentSandboxes should be >= 0
--- FAIL: TestTeamMetrics (0.75s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig

Flake rate in main: 76.79% (Passed 130 times, Failed 430 times)

Stack Traces | 53s run time
=== RUN   TestUpdateNetworkConfig
=== PAUSE TestUpdateNetworkConfig
=== CONT  TestUpdateNetworkConfig
--- FAIL: TestUpdateNetworkConfig (53.04s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false

Flake rate in main: 77.23% (Passed 125 times, Failed 424 times)

Stack Traces | 5.66s run time
=== RUN   TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
Executing command curl in sandbox il4lyce61as6tz1outj6p
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1367}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35  exited:true  status:"exit status 35"  error:"exit status 35"}}
Executing command curl in sandbox i888gxc9so1tr9yra4mvh
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1368}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35  exited:true  status:"exit status 35"  error:"exit status 35"}}
Executing command curl in sandbox iw2syy88blkw2g9kch75m
    sandbox_network_update_test.go:391: Command [curl] output: event:{start:{pid:1369}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{data:{stdout:"HTTP/2 302 \r\nx-content-type-options: nosniff\r\nlocation: https://dns.google/\r\ndate: Mon, 11 May 2026 19:43:49 GMT\r\ncontent-type: text/html; charset=UTF-8\r\nserver: HTTP server (unknown)\r\ncontent-length: 216\r\nx-xss-protection: 0\r\nx-frame-options: SAMEORIGIN\r\nalt-svc: h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000\r\n\r\n"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{end:{exited:true  status:"exit status 0"}}
    sandbox_network_update_test.go:391: Command [curl] completed successfully in sandbox il4lyce61as6tz1outj6p
    sandbox_network_update_test.go:391: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:74
        	            				.../api/sandboxes/sandbox_network_update_test.go:60
        	            				.../api/sandboxes/sandbox_network_update_test.go:391
        	Error:      	An error is expected but got nil.
        	Test:       	TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
        	Messages:   	https://8.8.8.8 should be blocked
--- FAIL: TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false (5.66s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost

Flake rate in main: 57.84% (Passed 223 times, Failed 306 times)

Stack Traces | 0s run time
=== RUN   TestBindLocalhost
=== PAUSE TestBindLocalhost
=== CONT  TestBindLocalhost
--- FAIL: TestBindLocalhost (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_0_0_0_0

Flake rate in main: 63.99% (Passed 121 times, Failed 215 times)

Stack Traces | 8.81s run time
=== RUN   TestBindLocalhost/bind_0_0_0_0
=== PAUSE TestBindLocalhost/bind_0_0_0_0
=== CONT  TestBindLocalhost/bind_0_0_0_0
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1253}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_0_0_0_0
        	Messages:   	Unexpected status code 502 for bind address 0.0.0.0
Executing command python in sandbox iyrpxke3jme01x736x4tz
--- FAIL: TestBindLocalhost/bind_0_0_0_0 (8.81s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_127_0_0_1

Flake rate in main: 58.72% (Passed 123 times, Failed 175 times)

Stack Traces | 7.33s run time
=== RUN   TestBindLocalhost/bind_127_0_0_1
=== PAUSE TestBindLocalhost/bind_127_0_0_1
=== CONT  TestBindLocalhost/bind_127_0_0_1
Executing command python in sandbox ibgjszbjoqdnq29mc9zkd
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1269}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_127_0_0_1
        	Messages:   	Unexpected status code 502 for bind address 127.0.0.1
--- FAIL: TestBindLocalhost/bind_127_0_0_1 (7.33s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_::1

Flake rate in main: 65.72% (Passed 121 times, Failed 232 times)

Stack Traces | 7.83s run time
=== RUN   TestBindLocalhost/bind_::1
=== PAUSE TestBindLocalhost/bind_::1
=== CONT  TestBindLocalhost/bind_::1
Executing command python in sandbox iklmja4gr5416op9m0lfz
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1251}}
Executing command python in sandbox i37fbz9acag8b81gln3kj
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_::1
        	Messages:   	Unexpected status code 502 for bind address ::1
Executing command python in sandbox iwomog3mpvzs5wptjh4j4
--- FAIL: TestBindLocalhost/bind_::1 (7.83s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_localhost

Flake rate in main: 65.82% (Passed 121 times, Failed 233 times)

Stack Traces | 7.66s run time
=== RUN   TestBindLocalhost/bind_localhost
=== PAUSE TestBindLocalhost/bind_localhost
=== CONT  TestBindLocalhost/bind_localhost
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1251}}
Executing command python in sandbox i8ksbywqgdw1btahr2v1p
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_localhost
        	Messages:   	Unexpected status code 502 for bind address localhost
--- FAIL: TestBindLocalhost/bind_localhost (7.66s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity

Flake rate in main: 66.41% (Passed 131 times, Failed 259 times)

Stack Traces | 75.1s run time
=== RUN   TestSandboxMemoryIntegrity
=== PAUSE TestSandboxMemoryIntegrity
=== CONT  TestSandboxMemoryIntegrity
    sandbox_memory_integrity_test.go:26: Build completed successfully
--- FAIL: TestSandboxMemoryIntegrity (75.14s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity/tmpfs_hash

Flake rate in main: 67.65% (Passed 121 times, Failed 253 times)

Stack Traces | 47.2s run time
=== RUN   TestSandboxMemoryIntegrity/tmpfs_hash
=== PAUSE TestSandboxMemoryIntegrity/tmpfs_hash
=== CONT  TestSandboxMemoryIntegrity/tmpfs_hash
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{start:{pid:1264}}
Executing command bash in sandbox i24nfc8adrsgef10d4ntk (user: root)
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Total memory: 985 MB\nUsed memory before tmpfs mount: 184 MB\nFree memory before tmpfs mount: 799 MB\nMemory to use in integrity test (80% of free, min 64MB): 639 MB\n"}}
Executing command bash in sandbox i24nfc8adrsgef10d4ntk (user: root)
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"639+0 records in\n639+0 records out\n670040064 bytes (670 MB, 639 MiB) copied, 16.643 s, 40.3 MB/s\n\tCommand being timed: \"dd if=/dev/urandom of=/mnt/testfile bs=1M count=639\"\n\tUser time (seconds): 0.00\n\tSystem time (seconds): 16.28\n\tPercent of C"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"PU this job got: 97%\n\tElapsed (wall clock) time (h:mm:ss or m:ss): 0:16.65\n\tAverage "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"shared text size ("}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"kbytes): 0\n\tAverage unshared data size (kbytes): 0\n\tAverage stack size (kbytes): 0\n\tAvera"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ge total size (kb"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ytes): 0\n\tMaxi"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"mum resident set"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" size (kbytes):"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" 2720\n\tAverage r"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"esident set siz"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"e (kbytes): 0\n\tMajor (requiring I/O) page faults: 3\n\tMinor (reclaiming a frame) pa"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ge faults: 345\n\tVol"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"untary context"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" switches: 4\n\tI"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"nvoluntary cont"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ext switches: "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"120\n\tSwaps: 0\n\tF"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ile system inputs: 176\n\tFile system outputs: 0\n\tSocket messages sent: 0\n\tSocket m"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"essages received"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:": 0\n\tSignals "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"delivered: 0\n\t"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"Page size (byt"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"es): 4096\n\tExit"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" status: 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory after tmpfs mount and file fill: 833 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:70: Command [bash] completed successfully in sandbox icd3oknbmtw1pogo8aa44
Executing command bash in sandbox icd3oknbmtw1pogo8aa44 (user: root)
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{start:{pid:1280}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{data:{stdout:"fdb0801b5bb1576348ab36546986e131e87de6d67ee428ee530cfbe49fa6d475\n"}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:74: Command [bash] completed successfully in sandbox icd3oknbmtw1pogo8aa44
Executing command bash in sandbox icd3oknbmtw1pogo8aa44 (user: root)
    sandbox_memory_integrity_test.go:99: Command [bash] output: event:{start:{pid:1283}}
    sandbox_memory_integrity_test.go:100: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:100
        	Error:      	Received unexpected error:
        	            	failed to execute command bash in sandbox icd3oknbmtw1pogo8aa44: invalid_argument: protocol error: incomplete envelope: unexpected EOF
        	Test:       	TestSandboxMemoryIntegrity/tmpfs_hash
--- FAIL: TestSandboxMemoryIntegrity/tmpfs_hash (47.20s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The checkPeerAvailability function silently ignores errors during header deserialization while still marking the build as uploaded, which can leave the reader in an unrecoverable state where it attempts to read from storage without the necessary V4 frame table. Additionally, a name mismatch exists in the peerState.header lookup where the code expects storage.RootfsName but receives the rootfs diff type string, preventing the pending header from being correctly retrieved and installed for rootfs files.

Comment thread packages/orchestrator/pkg/sandbox/build/build.go
Comment thread packages/orchestrator/pkg/sandbox/template/peerclient/storage.go
Comment thread packages/orchestrator/pkg/sandbox/template/peerclient/resolver.go
levb added 2 commits May 11, 2026 12:07
checkPeerAvailability silently dropped errors from header.DeserializeBytes
on peer-delivered V4 header bytes. Now logs loudly via logger.L().Error
with file name + error. The transition to storage still proceeds — File
keeps its existing header rather than installing a corrupted one, matching
the V3 fallback path. Gating uploaded.Store(true) on parse success (as
the bot suggested) would re-route the next read through the same peer
and loop on the same bad bytes; logging surfaces the corruption without
that risk.
@levb levb marked this pull request as ready for review May 11, 2026 19:32

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Code review skipped — your organization has reached its monthly code review spending cap.

An organization admin can view or raise the cap at claude.ai/admin-settings/claude-code. The cap resets at the start of the next billing period.

Once the cap resets or is raised, reopen this pull request to trigger a review.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ea5233a8cc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +111 to +114
if outcome != served {
cancel()

return peerAttempt[io.ReadCloser]{}, nil
return peerAttempt[io.ReadCloser]{result: outcome}, nil

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Refresh header before falling back on UseStorage

When the first UseStorage signal arrives, this path treats it like a normal miss and immediately opens base storage using the caller’s existing frameTable. If the caller still has the pre-upload header (e.g. CompressionNone before V4 finalization), the fallback opens the uncompressed object path and the first post-transition read fails, even though the peer already supplied the new header bytes. The previous PeerTransitionedError flow retried after header swap; this change drops that safeguard for the transition call.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a real issue, I moved the retry deeper down the stack, but it lost re-resolving the header.

Comment thread packages/orchestrator/pkg/sandbox/template/peerclient/storage.go
@levb levb marked this pull request as draft May 12, 2026 17:07
@levb

levb commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

#2919 supersedes

@levb levb closed this Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants