Skip to content

perf(lib/buffer_readwriter): make buffered read writer lock-free to reduce mutexes#616

Draft
sambhav-jain-16 wants to merge 3 commits into
uber:masterfrom
sambhav-jain-16:lock-less-buffer
Draft

perf(lib/buffer_readwriter): make buffered read writer lock-free to reduce mutexes#616
sambhav-jain-16 wants to merge 3 commits into
uber:masterfrom
sambhav-jain-16:lock-less-buffer

Conversation

@sambhav-jain-16
Copy link
Copy Markdown
Collaborator

@sambhav-jain-16 sambhav-jain-16 commented May 8, 2026

What?

Mutexes on the buffer are the third-highest mutex contention for kraken-origin.
The aws.WriteAtBuffer acquires a lock at every lock in the whole buffer even when there's no overlap

--- mutex:
cycles/second=2445350745
sampling period=1
65383536215716 29312064 @ 0x5154cb3 0x5154b6a 0x708ee8a 0x708f322 0x7188429 0x49db190 0x718fdbd 0x718fd7d 0x49db09d 0x71a1508 0x71a14da 0x719fff5 0x71a1137 0x499fb81
#	0x5154cb2	sync.(*Mutex).Unlock+0x12								GOROOT/src/sync/mutex.go:65
#	0x5154b69	golang.org/x/net/http2.transportResponseBody.Read+0x409					external/org_golang_x_net/http2/transport.go:2598
#	0x708ee89	go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*wrappedBody).Read+0x29	external/io_opentelemetry_go_contrib_instrumentation_net_http_otelhttp/transport.go:229
#	0x708f321	go:(*struct { io.ReadCloser }).Read+0x21						<autogenerated>:1
#	0x7188428	cloud.google.com/go/storage.(*httpReader).Read+0xa8					external/com_google_cloud_go_storage/http_client.go:1341
#	0x49db18f	io.copyBuffer+0x18f									GOROOT/src/io/io.go:429
#	0x718fdbc	io.Copy+0x5c										GOROOT/src/io/io.go:388
#	0x718fd7c	cloud.google.com/go/storage.(*Reader).WriteTo+0x1c					external/com_google_cloud_go_storage/reader.go:302
#	0x49db09c	io.copyBuffer+0x9c									GOROOT/src/io/io.go:411
#	0x71a1507	io.Copy+0x367										GOROOT/src/io/io.go:388
#	0x71a14d9	cloud.google.com/go/storage/transfermanager.(*DownloadObjectInput).downloadShard+0x339	external/com_google_cloud_go_storage/transfermanager/downloader.go:660
#	0x719fff4	cloud.google.com/go/storage/transfermanager.(*Downloader).downloadWorker+0x74		external/com_google_cloud_go_storage/transfermanager/downloader.go:363
#	0x71a1136	cloud.google.com/go/storage/transfermanager.NewDownloader.gowrap2+0x16			external/com_google_cloud_go_storage/transfermanager/downloader.go:555

2942885073034 1384198 @ 0x5154cb3 0x5154b6a 0x708ee8a 0x708f322 0x7188429 0x718fd0c 0x49db423 0x49db190 0x49dae4c 0x49dae2b 0x71a1bc8 0x71a00cf 0x719ffd5 0x71a1137 0x499fb81
#	0x5154cb2	sync.(*Mutex).Unlock+0x12									GOROOT/src/sync/mutex.go:65
#	0x5154b69	golang.org/x/net/http2.transportResponseBody.Read+0x409						external/org_golang_x_net/http2/transport.go:2598
#	0x708ee89	go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*wrappedBody).Read+0x29		external/io_opentelemetry_go_contrib_instrumentation_net_http_otelhttp/transport.go:229
#	0x708f321	go:(*struct { io.ReadCloser }).Read+0x21							<autogenerated>:1
#	0x7188428	cloud.google.com/go/storage.(*httpReader).Read+0xa8						external/com_google_cloud_go_storage/http_client.go:1341
#	0x718fd0b	cloud.google.com/go/storage.(*Reader).Read+0x2b							external/com_google_cloud_go_storage/reader.go:290
#	0x49db422	io.(*LimitedReader).Read+0x42									GOROOT/src/io/io.go:479
#	0x49db18f	io.copyBuffer+0x18f										GOROOT/src/io/io.go:429
#	0x49dae4b	io.Copy+0x8b											GOROOT/src/io/io.go:388
#	0x49dae2a	io.CopyN+0x6a											GOROOT/src/io/io.go:364
#	0x71a1bc7	cloud.google.com/go/storage/transfermanager.(*DownloadObjectInput).downloadFirstShard+0x2a7	external/com_google_cloud_go_storage/transfermanager/downloader.go:706
#	0x71a00ce	cloud.google.com/go/storage/transfermanager.(*Downloader).startDownload+0x6e			external/com_google_cloud_go_storage/transfermanager/downloader.go:382
#	0x719ffd4	cloud.google.com/go/storage/transfermanager.(*Downloader).downloadWorker+0x54			external/com_google_cloud_go_storage/transfermanager/downloader.go:361
#	0x71a1136	cloud.google.com/go/storage/transfermanager.NewDownloader.gowrap2+0x16				external/com_google_cloud_go_storage/transfermanager/downloader.go:555

652185813531 2034435 @ 0x6fe1333 0x6fe1287 0x6fe3617 0x49db70c 0x49dc8c2 0x49db1d4 0x718fdbd 0x718fd7d 0x49db09d 0x71a1508 0x71a14da 0x719fff5 0x71a1137 0x499fb81
#	0x6fe1332	sync.(*Mutex).Unlock+0x12								GOROOT/src/sync/mutex.go:65
#	0x6fe1286	github.com/aws/aws-sdk-go/aws.(*WriteAtBuffer).WriteAt+0x206				external/com_github_aws_aws_sdk_go/aws/types.go:200
#	0x6fe3616	github.com/uber/kraken/lib/store/base.(*BufferReadWriter).WriteAt+0x76			external/com_github_uber_kraken/lib/store/base/buffer_readwriter.go:58
#	0x49db70b	io.(*OffsetWriter).Write+0x2b								GOROOT/src/io/io.go:583
#	0x49dc8c1	io.(*multiWriter).Write+0x61								GOROOT/src/io/multi.go:85
#	0x49db1d3	io.copyBuffer+0x1d3									GOROOT/src/io/io.go:431
#	0x718fdbc	io.Copy+0x5c										GOROOT/src/io/io.go:388
#	0x718fd7c	cloud.google.com/go/storage.(*Reader).WriteTo+0x1c					external/com_google_cloud_go_storage/reader.go:302
#	0x49db09c	io.copyBuffer+0x9c									GOROOT/src/io/io.go:411
#	0x71a1507	io.Copy+0x367										GOROOT/src/io/io.go:388
#	0x71a14d9	cloud.google.com/go/storage/transfermanager.(*DownloadObjectInput).downloadShard+0x339	external/com_google_cloud_go_storage/transfermanager/downloader.go:660
#	0x719fff4	cloud.google.com/go/storage/transfermanager.(*Downloader).downloadWorker+0x74		external/com_google_cloud_go_storage/transfermanager/downloader.go:363
#	0x71a1136	cloud.google.com/go/storage/transfermanager.NewDownloader.gowrap2+0x16			external/com_google_cloud_go_storage/transfermanager/downloader.go:555

Some observations:

  1. Mismatch between the actual blob download size and the Stat call is rare; growth of the buffer and the need to acquire a lock at that time are rare but required.
  2. The blobs that are supposed to be downloaded are chunked, and ideally, there shouldn't be any overlap b/w chunks. If it is there, the caller should handle it rather than the buffer getting blocked.

This change removes aws.WriteAtBuffer and adds a normal byte array that can be edited by simultaneous writers. Also, to handle the rare scenario where the buffer is smaller than expected, a lock is acquired, and a new buffer is allocated with the new size.

Testing

Benchmarks

goos: linux
goarch: amd64
pkg: github.com/uber/kraken/lib/store/base
cpu: AMD EPYC 7B13
                                                    │ /home/user/kraken/bench-results/run-a2-20260508-142356-n100/before.txt │ /home/user/kraken/bench-results/run-a2-20260508-142356-n100/after.txt │
                                                    │                                 sec/op                                 │                    sec/op                     vs base                 │
BufferReadWriter_WriteAt/presized_1_shards-96                                                                   868.2µ ±  2%                                   908.0µ ±  3%   +4.58% (p=0.000 n=100)
BufferReadWriter_WriteAt/presized_4_shards-96                                                                   2.544m ±  2%                                   1.898m ±  1%  -25.40% (n=100)
BufferReadWriter_WriteAt/presized_10_shards-96                                                                  7.009m ±  1%                                   6.082m ± 12%  -13.22% (p=0.002 n=100)
BufferReadWriter_WriteAt/half_presized_1_shards-96                                                              1.195m ±  2%                                   1.337m ±  2%  +11.93% (p=0.000 n=100)
BufferReadWriter_WriteAt/half_presized_4_shards-96                                                              4.499m ± 80%                                   3.818m ±  1%  -15.13% (p=0.000 n=100)
BufferReadWriter_WriteAt/half_presized_10_shards-96                                                             9.041m ±  1%                                  10.472m ±  7%  +15.83% (p=0.000 n=100)
BufferReadWriter_WriteAt/dynamic_1_shards-96                                                                    872.6µ ±  2%                                   855.7µ ±  1%   -1.93% (p=0.002 n=100)
BufferReadWriter_WriteAt/dynamic_4_shards-96                                                                    2.551m ±  1%                                   2.151m ±  1%  -15.69% (n=100)
BufferReadWriter_WriteAt/dynamic_10_shards-96                                                                   7.335m ±  2%                                   8.950m ±  7%  +22.01% (p=0.000 n=100)
geomean                                                                                                         2.847m                                         2.758m         -3.13%

                                                    │ /home/user/kraken/bench-results/run-a2-20260508-142356-n100/before.txt │ /home/user/kraken/bench-results/run-a2-20260508-142356-n100/after.txt │
                                                    │                                  B/s                                   │                     B/s                       vs base                 │
BufferReadWriter_WriteAt/presized_1_shards-96                                                                  4.499Gi ±  2%                                  4.302Gi ±  3%   -4.38% (p=0.000 n=100)
BufferReadWriter_WriteAt/presized_4_shards-96                                                                  6.141Gi ±  2%                                  8.233Gi ±  1%  +34.05% (p=0.000 n=100)
BufferReadWriter_WriteAt/presized_10_shards-96                                                                 5.573Gi ±  1%                                  6.422Gi ± 11%  +15.24% (p=0.002 n=100)
BufferReadWriter_WriteAt/half_presized_1_shards-96                                                             3.270Gi ±  2%                                  2.921Gi ±  2%  -10.66% (p=0.000 n=100)
BufferReadWriter_WriteAt/half_presized_4_shards-96                                                             3.473Gi ± 45%                                  4.092Gi ±  1%  +17.82% (p=0.000 n=100)
BufferReadWriter_WriteAt/half_presized_10_shards-96                                                            4.321Gi ±  2%                                  3.730Gi ±  8%  -13.66% (p=0.000 n=100)
BufferReadWriter_WriteAt/dynamic_1_shards-96                                                                   4.477Gi ±  2%                                  4.565Gi ±  1%   +1.97% (p=0.002 n=100)
BufferReadWriter_WriteAt/dynamic_4_shards-96                                                                   6.125Gi ±  1%                                  7.265Gi ±  1%  +18.61% (p=0.000 n=100)
BufferReadWriter_WriteAt/dynamic_10_shards-96                                                                  5.326Gi ±  2%                                  4.365Gi ±  7%  -18.04% (p=0.000 n=100)
geomean                                                                                                        4.693Gi                                        4.844Gi         +3.23%

                                                    │ /home/user/kraken/bench-results/run-a2-20260508-142356-n100/before.txt │ /home/user/kraken/bench-results/run-a2-20260508-142356-n100/after.txt │
                                                    │                          mutex-contentions/op                          │             mutex-contentions/op              vs base                 │
BufferReadWriter_WriteAt/presized_1_shards-96                                                                    6.043 ±  2%                                    4.080 ±  5%  -32.48% (n=100)
BufferReadWriter_WriteAt/presized_4_shards-96                                                                    8.942 ±  1%                                    6.302 ±  5%  -29.53% (n=100)
BufferReadWriter_WriteAt/presized_10_shards-96                                                                  18.395 ±  2%                                    5.376 ± 31%  -70.78% (n=100)
BufferReadWriter_WriteAt/half_presized_1_shards-96                                                               8.229 ±  2%                                    6.889 ±  2%  -16.28% (n=100)
BufferReadWriter_WriteAt/half_presized_4_shards-96                                                               13.32 ± 22%                                    12.29 ±  2%   -7.73% (p=0.020 n=100)
BufferReadWriter_WriteAt/half_presized_10_shards-96                                                              28.55 ±  3%                                    23.97 ±  6%  -16.07% (p=0.000 n=100)
BufferReadWriter_WriteAt/dynamic_1_shards-96                                                                     7.012 ±  2%                                    6.286 ±  3%  -10.36% (n=100)
BufferReadWriter_WriteAt/dynamic_4_shards-96                                                                    10.680 ±  2%                                    9.564 ±  2%  -10.44% (n=100)
BufferReadWriter_WriteAt/dynamic_10_shards-96                                                                    20.36 ±  1%                                    17.07 ±  7%  -16.18% (n=100)
geomean                                                                                                          11.89                                          8.708        -26.76%

                                                    │ /home/user/kraken/bench-results/run-a2-20260508-142356-n100/before.txt │ /home/user/kraken/bench-results/run-a2-20260508-142356-n100/after.txt │
                                                    │                                  B/op                                  │                     B/op                       vs base                │
BufferReadWriter_WriteAt/presized_1_shards-96                                                                   4.002Mi ± 0%                                    4.002Mi ± 0%       ~ (p=0.381 n=100)
BufferReadWriter_WriteAt/presized_4_shards-96                                                                   16.01Mi ± 0%                                    16.01Mi ± 0%  -0.01% (n=100)
BufferReadWriter_WriteAt/presized_10_shards-96                                                                  40.02Mi ± 0%                                    40.02Mi ± 0%  -0.01% (p=0.000 n=100)
BufferReadWriter_WriteAt/half_presized_1_shards-96                                                              6.003Mi ± 0%                                    6.003Mi ± 0%  +0.01% (p=0.000 n=100)
BufferReadWriter_WriteAt/half_presized_4_shards-96                                                              24.02Mi ± 0%                                    24.01Mi ± 0%  -0.04% (p=0.000 n=100)
BufferReadWriter_WriteAt/half_presized_10_shards-96                                                             60.02Mi ± 0%                                    60.03Mi ± 0%  +0.01% (p=0.007 n=100)
BufferReadWriter_WriteAt/dynamic_1_shards-96                                                                    4.002Mi ± 0%                                    4.002Mi ± 0%  -0.00% (p=0.002 n=100)
BufferReadWriter_WriteAt/dynamic_4_shards-96                                                                    16.01Mi ± 0%                                    16.08Mi ± 0%  +0.43% (p=0.000 n=100)
BufferReadWriter_WriteAt/dynamic_10_shards-96                                                                   40.07Mi ± 0%                                    43.03Mi ± 1%  +7.41% (p=0.000 n=100)
geomean                                                                                                         15.67Mi                                         15.80Mi       +0.84%

                                                    │ /home/user/kraken/bench-results/run-a2-20260508-142356-n100/before.txt │ /home/user/kraken/bench-results/run-a2-20260508-142356-n100/after.txt │
                                                    │                               allocs/op                                │                      allocs/op                        vs base         │
BufferReadWriter_WriteAt/presized_1_shards-96                                                                     6.000 ± 0%                                             5.000 ± 0%  -16.67% (n=100)
BufferReadWriter_WriteAt/presized_4_shards-96                                                                     14.00 ± 7%                                             11.00 ± 0%  -21.43% (n=100)
BufferReadWriter_WriteAt/presized_10_shards-96                                                                    29.00 ± 3%                                             24.00 ± 0%  -17.24% (n=100)
BufferReadWriter_WriteAt/half_presized_1_shards-96                                                                7.000 ± 0%                                             6.000 ± 0%  -14.29% (n=100)
BufferReadWriter_WriteAt/half_presized_4_shards-96                                                                15.00 ± 0%                                             13.00 ± 0%  -13.33% (n=100)
BufferReadWriter_WriteAt/half_presized_10_shards-96                                                               34.00 ± 3%                                             27.00 ± 4%  -20.59% (n=100)
BufferReadWriter_WriteAt/dynamic_1_shards-96                                                                      6.000 ± 0%                                             5.000 ± 0%  -16.67% (n=100)
BufferReadWriter_WriteAt/dynamic_4_shards-96                                                                      13.00 ± 0%                                             11.00 ± 0%  -15.38% (n=100)
BufferReadWriter_WriteAt/dynamic_10_shards-96                                                                     27.00 ± 0%                                             24.00 ± 0%  -11.11% (n=100)
geomean                                                                                                           13.81                                                  11.55       -16.36%

Copilot AI review requested due to automatic review settings May 8, 2026 17:02
@sambhav-jain-16 sambhav-jain-16 force-pushed the lock-less-buffer branch 2 times, most recently from f682f0d to 8a56bcc Compare May 8, 2026 17:06
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces the in-memory BufferReadWriter implementation to reduce lock contention during concurrent WriteAt workloads (e.g., sharded downloads/uploads), and adds tests/benchmarks to validate correctness and quantify performance.

Changes:

  • Replaced the AWS WriteAtBuffer-backed implementation with a custom []byte buffer guarded by an RWMutex plus an atomic “written extent”.
  • Updated Size(), Bytes(), Read(), ReadAt(), and SeekEnd semantics to reflect only the maximum written extent (not the preallocated capacity).
  • Added a concurrent WriteAt correctness test and a benchmark suite that reports mutex-contentions/op.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
lib/store/base/buffer_readwriter.go New RWMutex/atomic-based buffer implementation to reduce serialization for non-overlapping concurrent WriteAt calls.
lib/store/base/buffer_readwriter_test.go Added concurrency test and benchmarks to validate/measure parallel WriteAt behavior and contention.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/store/base/buffer_readwriter.go
Comment thread lib/store/base/buffer_readwriter_test.go
Comment thread lib/store/base/buffer_readwriter_test.go Outdated
Comment thread lib/store/base/buffer_readwriter_test.go
@sambhav-jain-16 sambhav-jain-16 changed the title Lock less buffer perf(lib/buffer_read_writer): make buffered read writer lock-free to reduce mutexes May 8, 2026
@sambhav-jain-16 sambhav-jain-16 requested a review from Copilot May 8, 2026 17:16
@sambhav-jain-16 sambhav-jain-16 changed the title perf(lib/buffer_read_writer): make buffered read writer lock-free to reduce mutexes perf(lib/buffer_readwriter): make buffered read writer lock-free to reduce mutexes May 8, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

Comment thread lib/store/base/buffer_readwriter.go
Comment thread lib/store/base/buffer_readwriter.go
Comment thread lib/store/base/buffer_readwriter.go
Comment thread lib/store/base/buffer_readwriter.go
Comment thread lib/store/base/buffer_readwriter.go
Comment thread lib/store/base/buffer_readwriter.go
Comment thread lib/store/base/buffer_readwriter_test.go
Comment thread lib/store/base/buffer_readwriter_test.go Outdated
Copilot AI review requested due to automatic review settings May 11, 2026 13:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

Comment on lines +47 to +48
// Pass the exact blob size when known to enable lock-free
// concurrent WriteAt calls for non-overlapping shard ranges.
Comment on lines +70 to +76
end := off + int64(len(p))
if end < off {
return 0, fmt.Errorf("write at offset %d length %d overflows int64", off, len(p))
}

b.mu.RLock()
if end <= int64(len(b.buf)) {
Comment thread lib/store/base/buffer_readwriter.go
Comment thread lib/store/base/buffer_readwriter.go
Comment thread lib/store/base/buffer_readwriter.go
Comment thread lib/store/base/buffer_readwriter.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants