Skip to content

feature: Compression for memfile/rootfs assets#2034

Merged
levb merged 227 commits into
mainfrom
lev-compression-final
May 5, 2026
Merged

feature: Compression for memfile/rootfs assets#2034
levb merged 227 commits into
mainfrom
lev-compression-final

Conversation

@levb
Copy link
Copy Markdown
Contributor

@levb levb commented Mar 2, 2026

Summary

Compression for data files (memfile, rootfs). Files are broken into independently decompressible frames (2 MiB, zstd), stored in GCS alongside V4 headers with per-mapping frame tables. Fully backward-compatible: the read path auto-detects V3/V4 headers and routes compressed vs uncompressed reads per-mapping. Gated by compress-config LaunchDarkly flag (per-team/cluster/template targeting).

What changed

  • FramedFile interface replaces Seekable — unified GetFrame(ctx, offset, frameTable, decompress, buf, readSize, onRead) handles both compressed and uncompressed data
  • V4 header with FrameTable per mapping + BuildFileInfo (uncompressed size, SHA-256 checksum) per build; LZ4-block-compressed header blob
  • NFS cache extended for compressed frames (.frm files keyed by compressed offset+size); progressive streaming decompression on cache miss; write-through on upload
  • P2P resume integration — peers read uncompressed from origin during upload, then atomically swap to V4 header (CAS) when origin signals use_storage with serialized headers
  • compress-build CLI for background compression of existing uncompressed builds (supports --recursive for dependency chains)
  • New Chunker with mmap cache, and fetch sessions dedupe replacing streaming_chunk.go

Read path

  NBD/UFFD/Prefetch
    → header.GetShiftedMapping(offset) → BuildMap + FrameTable
    → DiffStore.Get(ctx, diff)         → cached Chunker
    → Chunker.GetBlock(offset, len, ft)
        → mmap hit? return reference
        → miss: fetchSession (dedup) → GetFrame
            → NFS hit? decompress from disk → mmap
            → NFS miss? GCS range read → decompress → mmap + NFS write-back

P2P header switchover

  Origin (pause):
    snapshot → register buildID in Redis → serve mmap cache via gRPC
    background: upload compressed data + V4 headers to GCS
    on completion: uploadedBuilds.Set(buildID, serialized V4 headers)
                → peerRegistry.Unregister(buildID)

  Peer (resume, upload in progress):
    GetFrame(ft=nil) → gRPC stream → origin serves from mmap (uncompressed)

  Peer (origin signals use_storage):
    checkPeerAvailability() → transitionHeaders.Store({memH, rootH})
                            → uploaded.Store(true)
    next GetFrame(ft=nil): ft==nil + transitionHeaders != nil
      → return PeerTransitionedError{headers}
      → build.File.swapHeader(): Deserialize(bytes) → CompareAndSwap(old, new)
        first goroutine wins CAS; others see swapped header on retry
      → retry: GetFrame(ft!=nil) → NFS/GCS compressed (mmap mostly warm)

Benchmark results

End-to-end pause/resume

(BenchmarkBaseImage, 50 iterations, local disk):

  ┌──────────────┬─────────┬────────────┐
  │     Mode     │ Latency │ Build time │
  ├──────────────┼─────────┼────────────┤
  │ Uncompressed │ 97 ms   │ 61.0s      │
  ├──────────────┼─────────┼────────────┤
  │ LZ4:0        │ 100 ms  │ 61.4s      │
  ├──────────────┼─────────┼────────────┤
  │ Zstd:1       │ 100 ms  │ 60.9s      │
  ├──────────────┼─────────┼────────────┤
  │ Zstd:2       │ 102 ms  │ 62.4s      │
  ├──────────────┼─────────┼────────────┤
  │ Zstd:3       │ 98 ms   │ 61.7s      │
  └──────────────┴─────────┴────────────┘

Full architecture doc: docs/compression-architecture.md

levb and others added 18 commits February 27, 2026 05:52
…ning

- Use header.HugepageSize for uncompressed fetch alignment (semantically correct)
- Stream NFS cache hits directly into ReadFrame instead of buffering in memory
- Fix timer placement to cover full GetFrame (read + decompression)
- Fix onRead callback: nil for compressed inner calls (prevents double-invoke),
  pass through for uncompressed (bytes are final)
- Remove panic recovery from runFetch (never in main)
- Remove low-value chunker tests subsumed by ConcurrentStress
- Remove 4MB frame configs from benchmarks (targeting 2MB only)
- Remove unused readCacheFile function

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ble NFS cache

- Remove dead flagsClient chain through chunker/build/template layers (~15 files)
- Delete ChunkerConfigFlag (unused after flagsClient removal)
- Delete mock_flagsclient_test.go
- Simplify GetUploadOptions: remove redundant intOr/strOr fallbacks (flags have defaults)
- Add GetCompressionType helper to frame_table.go, deduplicate compression type extraction
- Replace [16]byte{} with uuid.Nil and "rootfs.ext4" with storage.RootfsName in inspect-build
- Simplify UploadV4Header return pattern
- Remove onRead callback from legacy fullFetchChunker (FullFetch should not use progressive reads)
- Re-enable NFS cache in template cache.go
- Remove all fmt.Printf debug instrumentation from orchestrator

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…sionType threading

Add per-build file size and SHA-256 checksum to V4 headers, eliminating
the redundant Size() network call when opening upstream data files on
the read path. Checksums are computed for free by piggybacking on
CompressStream's existing frame iteration.

Remove the separate compressionType parameter threaded through
getBuild → newStorageDiff → NewChunker; the read path now derives
compression state from the per-mapping FrameTable directly.

V4 binary format change (not yet deployed):
  [Metadata] [LZ4: numBuilds, builds(uuid+size+checksum),
              numMappings, mappings...]

V3 path unchanged — falls back to Size() call when size is unknown.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread packages/shared/pkg/storage/header/serialization_test.go Outdated
Comment thread packages/shared/pkg/storage/storage_cache_seekable.go Outdated
Comment thread packages/shared/pkg/storage/storage_cache_seekable.go Outdated
Comment thread packages/shared/pkg/storage/storage_fs.go
levb and others added 6 commits March 2, 2026 10:58
- Merge writeFrameToCache and writeChunkToCache into unified writeToCache
  with lock + atomic rename, used by all three cache write paths
- Fix file descriptor leak in cache hit paths: defer f.Close() and wrap
  in NopCloser so ReadFrame's close doesn't double-close the fd
- Add defer uploader.Close() in CompressStream so PartUploader file
  handles are released on error paths between Start() and Complete()
- Make Close() idempotent via sync.Once on fsPartUploader and filePartWriter

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread docs/compression-architecture.md Outdated
Comment thread packages/orchestrator/cmd/benchmark-compress/main.go Outdated
Comment thread packages/orchestrator/cmd/copy-build/main.go Outdated
Comment thread packages/orchestrator/cmd/inspect-build/main.go
Comment thread packages/orchestrator/cmd/internal/cmdutil/cmdutil.go Outdated
Comment thread packages/shared/pkg/storage/storage_aws.go Outdated
Comment thread packages/shared/pkg/storage/storage_cache_seekable.go Outdated
Comment thread packages/shared/pkg/storage/storage_cache_seekable.go Outdated
Comment thread packages/shared/pkg/storage/storage_cache_seekable.go Outdated
Comment thread packages/shared/pkg/storage/storage_fs.go
levb and others added 2 commits March 3, 2026 06:09
The SHA-256 checksum in BuildFileInfo now covers uncompressed data,
making it useful for end-to-end integrity verification of the original
content. Updated inspect-build to use SHA-256 (replacing MD5) and
verify checksums against the header. Fixed early-return lint warnings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GetUploadOptions now accepts fileType and useCase parameters, enriching
the LD evaluation context so dashboard targeting rules can differentiate
(e.g. compress memfile but not rootfs, or builds but not pauses).
TemplateBuild accepts per-file opts directly instead of holding an ff
reference.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread packages/orchestrator/pkg/sandbox/build/storage_diff.go
Comment thread packages/orchestrator/pkg/sandbox/template/peerserver/seekable.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/build/storage_diff.go
Comment thread packages/shared/pkg/storage/storage_google.go
levb added a commit that referenced this pull request May 5, 2026
- peerserver/seekable + localDiff: use logical cache.Size(), not
  on-disk FileSize() (8x-inflated formula broke V3/V4 P2P resume).
- storage_google.StoreFile: one outer WriteFromFileSystem timer
  covering all branches; drop the shadowed OneShot; tag with
  compression.type/level.
- Privatize ResolveCompressConfig into sandbox; validate frame size
  against the real per-file block size from the snapshot diff header.
- CI: add lz4 matrix; propagate COMPRESS_* to the base-template
  build via .env.test so the fixture matches the services.
- peerserver/seekable + localDiff: use logical cache.Size(), not
  on-disk FileSize() (8x-inflated formula broke V3/V4 P2P resume).
- storage_google.StoreFile: one outer WriteFromFileSystem timer
  covering all branches; drop the shadowed OneShot; tag with
  compression.type/level.
- Privatize ResolveCompressConfig into sandbox; validate frame size
  against the real per-file block size from the snapshot diff header.
- CI: add lz4 matrix; propagate COMPRESS_* to the base-template
  build via .env.test so the fixture matches the services.
@levb levb force-pushed the lev-compression-final branch from 048ae83 to 7e65a1f Compare May 5, 2026 16:15
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings (outside current diff — PR may have been updated during review):

  • 🔴 packages/orchestrator/pkg/sandbox/template/peerserver/seekable_test.go:17-23 — TestSeekableSource_Size mocks the wrong method, breaking CI: the test expects diff.FileSize() but seekableSource.Size (seekable.go:22) actually calls f.diff.Size(ctx). go test -run TestSeekableSource_Size ./packages/orchestrator/pkg/sandbox/template/peerserver/ fails with unexpected method call: Size(*context.cancelCtx) and FAIL: 0 out of 1 expectation(s) were met. FileSize(). Fix is one-character: change the mock to diff.EXPECT().Size(mock.Anything).Return(int64(1234), nil) (or change the production call to FileSize(), but Size(ctx) is the SeekableSource interface contract and is what other tests in the file already expect for similar code paths).

    Extended reasoning...

    What's broken

    In packages/orchestrator/pkg/sandbox/template/peerserver/seekable_test.go:20, the test sets up the mock as:

    diff := buildmocks.NewMockDiff(t)
    diff.EXPECT().FileSize().Return(int64(1234), nil)

    But the production code in packages/orchestrator/pkg/sandbox/template/peerserver/seekable.go:21-23 calls Size(ctx), not FileSize():

    func (f *seekableSource) Size(ctx context.Context) (int64, error) {
        return f.diff.Size(ctx)
    }

    build.Diff is an interface that has both methods (Size(ctx) (int64, error) from the embedded storage.SeekableReader, plus a separate FileSize() (int64, error) method). The test mocks the wrong one of the two.

    Step-by-step proof

    Reproduced directly on the PR branch:

    $ go test -run TestSeekableSource_Size ./packages/orchestrator/pkg/sandbox/template/peerserver/
    --- FAIL: TestSeekableSource_Size (0.00s)
        mock.go:361:
            assert: mock: I don't know what to return because the method call was unexpected.
                Either do Mock.On("Size").Return(...) first, or remove the Size() call.
                This method was unexpected:
                    Size(*context.cancelCtx)
                at: [...mockdiff.go:411 ...seekable.go:22 ...seekable_test.go:28]
        mockdiff.go:24: FAIL:	FileSize()
                at: [...mockdiff.go:260 ...seekable_test.go:20]
        mockdiff.go:24: FAIL: 0 out of 1 expectation(s) were met.
                The code you are testing needs to make 1 more call(s).
    FAIL
    FAIL	github.com/e2b-dev/infra/packages/orchestrator/pkg/sandbox/template/peerserver	0.017s
    

    Walking through the call:

    1. src.Size(t.Context()) is called (seekable_test.go:28).
    2. That dispatches to seekableSource.Size, which calls f.diff.Size(ctx) (seekable.go:22).
    3. The mock has no expectation for Size — only for FileSize() — so testify panics: Size(*context.cancelCtx) is unexpected.
    4. After panic recovery, mock cleanup reports the unmet FileSize() expectation as a second failure.

    Why this matters / impact

    CI-breaking. The PR's own diff (visible in this review) replaced the previously-correct Size(mock.Anything) expectation with FileSize(), but the production code at seekable.go:22 was not changed to match. This appears to be an in-flight refactor that was committed half-done.

    Fix

    The simpler/safer fix is to update the test to match the production interface contract. Size(ctx context.Context) (int64, error) is part of the storage.SeekableReader interface that build.Diff embeds and is what the rest of the codebase uses for context-cancellable size lookups. Change line 20 to:

    diff.EXPECT().Size(mock.Anything).Return(int64(1234), nil)

    Alternatively, if the intent is to use the no-context FileSize() here (it returns from the local cache without an RPC), seekable.go:22 needs to be changed to return f.diff.FileSize() and the function signature kept as Size(ctx context.Context) to satisfy the SeekableSource interface (with _ context.Context to silence unused-param). Either way, the test and the production code must agree.

Comment thread packages/orchestrator/pkg/sandbox/build/build.go
Comment thread packages/orchestrator/pkg/sandbox/build/storage_diff.go
Comment thread packages/shared/pkg/storage/gcp_multipart.go
@levb levb enabled auto-merge (squash) May 5, 2026 19:05
@levb levb requested a review from dobrac May 5, 2026 19:05
@levb
Copy link
Copy Markdown
Contributor Author

levb commented May 5, 2026

@cla-bot check

@cla-bot
Copy link
Copy Markdown

cla-bot Bot commented May 5, 2026

We require contributors to sign our Contributor License Agreement, and we don't have @github-actions[bot] on file. You can sign our CLA at https://e2b.dev/docs/cla . Once you've signed, post a comment here that says '@cla-bot check'

@cla-bot
Copy link
Copy Markdown

cla-bot Bot commented May 5, 2026

The cla-bot has been summoned, and re-checked this pull request!

@tvi
Copy link
Copy Markdown
Contributor

tvi commented May 5, 2026

@cla-bot check

@cla-bot cla-bot Bot added the cla-signed label May 5, 2026
@cla-bot
Copy link
Copy Markdown

cla-bot Bot commented May 5, 2026

The cla-bot has been summoned, and re-checked this pull request!

@dobrac
Copy link
Copy Markdown
Contributor

dobrac commented May 5, 2026

@cla-bot check

@cla-bot
Copy link
Copy Markdown

cla-bot Bot commented May 5, 2026

The cla-bot has been summoned, and re-checked this pull request!

@dobrac dobrac marked this pull request as draft May 5, 2026 20:06
auto-merge was automatically disabled May 5, 2026 20:06

Pull request was converted to draft

@dobrac dobrac marked this pull request as ready for review May 5, 2026 20:06
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7e65a1f6d5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread .github/actions/build-sandbox-template/action.yml
Comment thread packages/orchestrator/pkg/sandbox/build/storage_diff.go
Comment on lines +39 to +44
// IncompletePendingUpload is set on diff headers produced by ToDiffHeader and
// cleared on the finalized headers swapped in by the upload pipeline. It
// is in-memory only (never serialized), and signals that the build's data
// has not yet reached object storage — readers must serve from the local
// cache and skip FrameTable lookups for the still-missing self entry.
IncompletePendingUpload bool
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The doc comment for IncompletePendingUpload claims it "is in-memory only (never serialized)", but the V4 wire format reserves bit 0 of a dedicated flags byte for exactly this field — serialization_v4.go:111-118 writes it and serialization_v4.go:198 reads it back. peerserver/header.go:46 explicitly forces this bit ON for V4 headers shipped over P2P, which is the wire mechanism by which peers detect an in-flight build. The flag is never persisted to long-term storage (StoreHeader refuses headers carrying it), but it IS serialized in the V4 wire format. Consider rewording to something like "never persisted to long-term storage by StoreHeader, but is serialized in the V4 wire format for in-flight P2P signaling". Documentation only — runtime behavior is correct.

Extended reasoning...

What the bug is

The comment block at packages/shared/pkg/storage/header/header.go:39-44 reads:

// IncompletePendingUpload is set on diff headers produced by ToDiffHeader and
// cleared on the finalized headers swapped in by the upload pipeline. It
// is in-memory only (never serialized), and signals that the builds data
// has not yet reached object storage — readers must serve from the local
// cache and skip FrameTable lookups for the still-missing self entry.
IncompletePendingUpload bool

The phrase "in-memory only (never serialized)" is factually wrong for V4 headers. This is documentation only — runtime behavior is correct — but it will mislead future maintainers about the V4 wire format and the P2P contract.

Step-by-step proof

  1. SerializeHeader (serialization.go:8-14) dispatches to serializeV4 for any header with Metadata.Version >= 4, passing h.IncompletePendingUpload through.

  2. serializeV4 (serialization_v4.go:111-119) packs the flag into bit 0 of a dedicated flags byte that becomes part of the on-the-wire bytes:

    var flags uint8
    if incomplete {
        flags |= v4FlagIncomplete
    }
    result := make([]byte, metadataSize+v4FlagsLen+v4SizePrefixLen+len(compressed))
    copy(result, metaBuf.Bytes())
    result[metadataSize] = flags

    v4FlagIncomplete = 1 << 0 (serialization_v4.go:24).

  3. deserializeV4 (serialization_v4.go:131, 198) reads the flag back:

    flags := blockData[0]
    ...
    h.IncompletePendingUpload = flags&v4FlagIncomplete != 0
  4. peerserver/header.go:38-47 is the wire mechanism — it explicitly forces this bit ON for every V4 header it serves to a peer, regardless of the in-memory state:

    wire := *h
    if wire.Metadata.Version >= header.MetadataVersionV4 {
        wire.IncompletePendingUpload = true
    }
    data, err := header.SerializeHeader(&wire)

    This is the entire mechanism by which a peer learns the origin upload is still in flight.

Why the comment is technically true only for the storage path

StoreHeader (serialization.go:62-67) refuses to persist a header with the flag set:

if h.IncompletePendingUpload {
    return fmt.Errorf("refusing to persist incomplete header for %s", path)
}

So the flag never lands in long-term object storage — that is the only sense in which it is "not serialized". But the V4 wire path absolutely encodes it.

Impact

Documentation-only — the runtime is correct. But the comment will mislead anyone trying to understand:

  • the V4 wire format,
  • the P2P → "switch to storage" handshake,
  • why peerserver/header.go mutates the field before serialization,
  • why V3 doesn't need this and V4 does.

A future maintainer trusting the "never serialized" claim might remove the flags byte, the peerserver mutation, or the StoreHeader guard, breaking the P2P in-flight signal in subtle ways.

Suggested fix

Reword the comment to reflect the actual contract, e.g.:

// IncompletePendingUpload is set on diff headers produced by ToDiffHeader and
// cleared on the finalized headers swapped in by the upload pipeline. It is
// never persisted to long-term storage (StoreHeader refuses headers carrying
// it), but it IS serialized in the V4 wire format (bit 0 of the flags byte)
// to signal in-flight builds over P2P. Readers seeing this bit set must serve
// from the local cache and skip FrameTable lookups for the still-missing self
// entry.
IncompletePendingUpload bool

Comment thread packages/shared/pkg/storage/compress_upload.go
@qodo-code-review
Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (2) 📎 Requirement gaps (0)

Grey Divider

Issue findings (3)

Findings have been published as inline review comments

Grey Divider

 ⓘ Findings that repeat in the code will be published as multiple comments

Grey Divider

Qodo Logo

Comment thread .mockery.yaml
Comment thread packages/orchestrator/pkg/sandbox/uploads_test.go
Comment thread packages/shared/pkg/storage/storage_cache_seekable.go
@levb levb merged commit 08810db into main May 5, 2026
91 of 93 checks passed
@levb levb deleted the lev-compression-final branch May 5, 2026 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants