perf(block): sync.Map -> atomic bitmap for cache dirty tracking#2235
perf(block): sync.Map -> atomic bitmap for cache dirty tracking#2235
Conversation
…y tracking
Replace the `sync.Map`-based dirty block tracking in `block.Cache` with a
pre-allocated `[]atomic.Uint64` bitset. Each bit represents one block;
atomic OR (`atomic.Uint64.Or`) is used for concurrent-safe marking.
## Problem
The block cache tracks which blocks have been written (are "dirty") so that
`Slice()` can decide whether to serve from mmap or return `BytesNotAvailable`.
The previous implementation used `sync.Map` with one entry per dirty block
offset. For a 64 MiB file at 4K block size, that's up to 16,384 map entries —
each requiring a heap-allocated key and value, plus the internal hash map
overhead of sync.Map.
Every call to `setIsCached` allocated a `[]int64` via `header.BlocksOffsets()`
and then called `sync.Map.Store()` in a loop. Every call to `isCached` did
the same allocation and called `sync.Map.Load()` in a loop. These are the
hottest paths in the block device — called on every NBD read and every
chunk fetch.
## Solution
- `dirty` field changed from `sync.Map` to `[]atomic.Uint64`, sized at
`ceil(numBlocks / 64)` and allocated once in `NewCache`.
- `setIsCached` computes a bitmask per 64-bit word and applies it with
`atomic.Uint64.Or` — one atomic op per word instead of one map store per
block.
- `isCached` / `isBlockCached` read bits with `atomic.Uint64.Load` — no
allocation, no map lookup.
- `dirtySortedKeys` iterates words and uses `bits.TrailingZeros64` to extract
set bits — produces a naturally sorted result without `slices.Sort`.
## Benchmarks
64 MiB cache, 4K blocks, 4 MiB chunks (1024 blocks per chunk):
```
│ sync.Map │ bitmap │
│ sec/op │ sec/op vs base │
MarkRangeCached-16 63857.50n ± 3% 25.04n ± 6% -99.96% (p=0.002 n=6)
IsCached_Hit-16 19406.0n ± 1% 908.6n ± 0% -95.32% (p=0.002 n=6)
IsCached_Miss-16 1168.500n ± 1% 2.813n ± 1% -99.76% (p=0.002 n=6)
Slice_Hit-16 19296.0n ± 1% 913.7n ± 1% -95.27% (p=0.002 n=6)
Slice_Miss-16 1161.000n ± 3% 3.765n ± 0% -99.68% (p=0.002 n=6)
│ sync.Map │ bitmap │
│ B/op │ B/op vs base │
MarkRangeCached-16 64.05Ki ± 0% 0.00Ki ± 0% -100.00% (p=0.002 n=6)
IsCached_Hit-16 8.000Ki ± 0% 0.000Ki ± 0% -100.00% (p=0.002 n=6)
IsCached_Miss-16 8.000Ki ± 0% 0.000Ki ± 0% -100.00% (p=0.002 n=6)
Slice_Hit-16 8.000Ki ± 0% 0.000Ki ± 0% -100.00% (p=0.002 n=6)
Slice_Miss-16 8.000Ki ± 0% 0.000Ki ± 0% -100.00% (p=0.002 n=6)
│ sync.Map │ bitmap │
│ allocs/op │ allocs/op vs base │
MarkRangeCached-16 2.049k ± 0% 0.000k ± 0% -100.00% (p=0.002 n=6)
IsCached_Hit-16 1.000 ± 0% 0.000 ± 0% -100.00% (p=0.002 n=6)
IsCached_Miss-16 1.000 ± 0% 0.000 ± 0% -100.00% (p=0.002 n=6)
Slice_Hit-16 1.000 ± 0% 0.000 ± 0% -100.00% (p=0.002 n=6)
Slice_Miss-16 1.000 ± 0% 0.000 ± 0% -100.00% (p=0.002 n=6)
```
Zero allocations across all operations. `setIsCached` goes from 64µs to 25ns
(2550x), `Slice` hit from 19µs to 914ns (21x).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix OOB panic in setIsCached when range extends past cache size. The old sync.Map.Store silently accepted out-of-bounds keys, but the bitmap would panic on index OOB. Cap n to len(dirty)*64. - Add TestSetIsCached_PastCacheSize to verify the fix. - Add TestSetIsCached_ConcurrentOverlapping: 8 goroutines with overlapping ranges under -race to prove atomic OR correctness. - Remove redundant sort.SliceIsSorted (covered by require.Equal). - Use 2 MiB block size in BoundaryCrossing test (real hugepage size) to exercise a second block size and drop nolint:unparam. - Make benchmark consts consistently int64. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dobrac
left a comment
There was a problem hiding this comment.
Could we either use a library (like the bits-and-blooms/bitset) for this functionality - prefered, or refactor the []atomic.Uint64 functionality to a separate file (module)?
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…o lev-block-cache-bitmap
dobrac
left a comment
There was a problem hiding this comment.
Lets then move it to its own utility right away, there is almost none added cost doing it already
Move the []atomic.Uint64 bitmap from block.Cache into a standalone atomicbitset.Bitset in packages/shared/pkg/atomicbitset. Word-level masking in HasRange reduces atomic loads on the chunker hot path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dobrac For the compressed bitmaps, https://github.com/RoaringBitmap/roaring seems to be widely used, and for the filesystem, it could really improve the default sizes (even beyond this PR's optimization). The only problem is that it is also not atomic, so we would need have a mutex there. This PR is better in a way that the it effectively shards the mutexes via using the atomic. |
|
We've agreed @levb to put this on hold for now unless there is something important that needs it |
- Remove newCache/newFullFetchChunker/newStreamingChunker indirection - Cache creates its own bitset internally, no external passing - Remove AddDirtyOffset, pass dirty BitSet directly to DiffMetadataBuilder - Remove debug logging
…nfra into lev-block-cache-bitmap
861d113 to
2ca2a54
Compare
…, remove unused impls
CardinalityInRange correlated with integration test bus errors in VMs. Revert to the proven Contains loop until the library method is vetted.
…or HasRange Remove Flat, Sharded impls and Bitset interface — only one consumer, always roaring. Rename to concrete Bitset struct with New() constructor. Switch HasRange from per-bit Contains loop to Rank()-based cardinality check: O(containers) instead of O(range_size), ~19ns for 1024-block chunk queries. CardinalityInRange is buggy in roaring v2.16.0 (off-by-one in runContainer16.getCardinalityInRange when query falls in a gap between run intervals). Test documents the bug for future library upgrade.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6c0f1dd8fe
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
For now, using roaring fork with a fix. After merging RoaringBitmap/roaring#521 we can switch to the original lib. |
Switch HasRange from Rank() to CardinalityInRange() which is O(k) in containers spanned rather than O(C) in total containers. Use e2b-dev/roaring fork that fixes an off-by-one in runContainer16.getCardinalityInRange for gaps between runs within a single container (upstream PR RoaringBitmap/roaring#521). Also: init dirty bitset in zero-size Cache path, fix stale comment in streaming_chunk.go.
Outdated after the latest changes.
|
👍 |
Pre-cursor to PR #2034, separated to reduce the scope there.
Replace the
sync.Map-based dirty block tracking inblock.Cachewith a pre-allocated[]atomic.Uint64bitset. Each bit represents one block; atomic OR (atomic.Uint64.Or) is used for concurrent-safe marking.Problem
The block cache tracks which blocks have been written (are "dirty") so that
Slice()can decide whether to serve from mmap or returnBytesNotAvailable. The previous implementation usedsync.Mapwith one entry per dirty block offset. For a 64 MiB file at 4K block size, that's up to 16,384 map entries — each requiring a heap-allocated key and value, plus the internal hash map overhead of sync.Map.Every call to
setIsCachedallocated a[]int64viaheader.BlocksOffsets()and then calledsync.Map.Store()in a loop. Every call toisCacheddid the same allocation and calledsync.Map.Load()in a loop. These are the hottest paths in the block device — called on every NBD read and every chunk fetch.Solution
dirtyfield changed fromsync.Mapto[]atomic.Uint64, sized atceil(numBlocks / 64)and allocated once inNewCache.setIsCachedcomputes a bitmask per 64-bit word and applies it withatomic.Uint64.Or— one atomic op per word instead of one map store per block.isCached/isBlockCachedread bits withatomic.Uint64.Load— no allocation, no map lookup.dirtySortedKeysiterates words and usesbits.TrailingZeros64to extract set bits — produces a naturally sorted result withoutslices.Sort.Benchmarks
64 MiB cache, 4K blocks, 4 MiB chunks (1024 blocks per chunk):
Zero allocations across all operations.
setIsCachedgoes from 64µs to 25ns (2550x),Slicehit from 19µs to 914ns (21x).