fix(orch): P2P reader race on peer→storage transition#2627
Conversation
When a peer signaled UseStorage to indicate its upload finished, the current reader saw `uploaded=true` and routed to base storage, but concurrent sibling reads on the same File could fall through with a stale FrameTable — V4 chunk offsets shift between the pre-finalize in-memory header and the post-finalize GCS header, so reads decoded the wrong bytes. The peer now ships the final V4 header inline on its UseStorage response (new `header_bytes` field on `PeerAvailability`). The bytes are the same `[]byte` produced once by `header.StoreHeader` during upload — captured on `*Upload` and forwarded into the server's `uploadedBuilds` cache, no second serialization, no failure path. Client side, the resolver's per-buildID state holds the parsed header. `build.File.installPendingHeader` polls the state at the top of every Slice/ReadAt iteration and CAS-installs the new header atomically. Concurrent readers either install the same value or observe it already installed — no coordination primitive needed, no GCS poll. Also: mid-stream peer abort (NotAvailable from cache eviction) now surfaces as `storage.ErrPeerAborted` so the chunker reopens via base on a single retry, instead of silently truncating the stream.
❌ 9 Tests Failed:
View the full list of 10 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
There was a problem hiding this comment.
Code Review
The checkPeerAvailability function silently ignores errors during header deserialization while still marking the build as uploaded, which can leave the reader in an unrecoverable state where it attempts to read from storage without the necessary V4 frame table. Additionally, a name mismatch exists in the peerState.header lookup where the code expects storage.RootfsName but receives the rootfs diff type string, preventing the pending header from being correctly retrieved and installed for rootfs files.
checkPeerAvailability silently dropped errors from header.DeserializeBytes on peer-delivered V4 header bytes. Now logs loudly via logger.L().Error with file name + error. The transition to storage still proceeds — File keeps its existing header rather than installing a corrupted one, matching the V3 fallback path. Gating uploaded.Store(true) on parse success (as the bot suggested) would re-route the next read through the same peer and loop on the same bad bytes; logging surfaces the corruption without that risk.
There was a problem hiding this comment.
An organization admin can view or raise the cap at claude.ai/admin-settings/claude-code. The cap resets at the start of the next billing period.
Once the cap resets or is raised, reopen this pull request to trigger a review.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ea5233a8cc
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if outcome != served { | ||
| cancel() | ||
|
|
||
| return peerAttempt[io.ReadCloser]{}, nil | ||
| return peerAttempt[io.ReadCloser]{result: outcome}, nil |
There was a problem hiding this comment.
Refresh header before falling back on UseStorage
When the first UseStorage signal arrives, this path treats it like a normal miss and immediately opens base storage using the caller’s existing frameTable. If the caller still has the pre-upload header (e.g. CompressionNone before V4 finalization), the fallback opens the uncompressed object path and the first post-transition read fails, even though the peer already supplied the new header bytes. The previous PeerTransitionedError flow retried after header swap; this change drops that safeguard for the transition call.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
This is a real issue, I moved the retry deeper down the stack, but it lost re-resolving the header.
|
#2919 supersedes |
When a peer signaled UseStorage to indicate its upload finished, the current reader saw
uploaded=trueand routed to base storage, but concurrent sibling reads on the same File could fall through with no FrameTable, attempting to decode the wrong bytes.The peer now ships the final V4 header inline on its UseStorage response (new
header_bytesfield onPeerAvailability).Client side, the resolver's per-buildID state holds the parsed header.
build.File.installPendingHeaderpolls the state at the top of every Slice/ReadAt iteration and CAS-installs the new header atomically. Concurrent readers either install the same value or observe it already installed — no coordination primitive needed, no GCS poll.