Skip to content

Commit e7c1b4d

Browse files
authored
benchmarks: Align buf to cache line for consistency (ghostty-org#8569)
This aligns the `buf` of `4096` bytes in the benchmarks to the cache line, to ensure a consistent number of cache lines are used, and also to avoid any sub-`usize` alignment issues as seen in ghostty-org#8548. This has less of an effect as ghostty-org#8548, and looking at the before and after of the current benchmarks in the repo doesn't show any noticeable difference. In my case, I've been comparing the `table` option with [uucode in this branch](https://github.com/ghostty-org/ghostty/compare/main...jacobsandlund:jacob/uucode?expand=1), and I did see a difference. ### Before I ran the before code several times (6 with the exact same binary, but several more with essentially the same code), always getting something like this, with `table` edging out `uucode` by something like 3-4ms: ``` Benchmark 1: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table Time (mean ± σ): 927.8 ms ± 1.3 ms [User: 883.7 ms, System: 42.5 ms] Range (min … max): 926.0 ms … 929.8 ms 10 runs Benchmark 2: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode Time (mean ± σ): 930.9 ms ± 1.4 ms [User: 886.8 ms, System: 42.5 ms] Range (min … max): 928.5 ms … 933.4 ms 10 runs ``` ### After After this change, it shows `uucode` coming in at 10-11ms (~1%) faster: ``` Benchmark 1: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table Time (mean ± σ): 930.6 ms ± 1.3 ms [User: 886.5 ms, System: 42.4 ms] Range (min … max): 928.9 ms … 932.4 ms 10 runs Benchmark 2: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode Time (mean ± σ): 920.1 ms ± 1.4 ms [User: 876.3 ms, System: 42.1 ms] Range (min … max): 918.4 ms … 923.3 ms 10 runs Summary zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode ran 1.01 ± 0.00 times faster than zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table ``` This ~1% faster time checks out, since from looking at the assembly, it's an exact match minus this small place where the compiler can optimize `uucode` a little better: ``` # both table.asm/uucode.asm: 140 const high = cp >> 8; 141 const low = cp & 0xFF; ** 142 return self.stage3[self.stage2[self.stage1[high] + low]]; <+464>: ubfx x12, x11, ghostty-org#8, ghostty-org#13 <+468>: ldrh w12, [x27, x12, lsl ghostty-org#1] <+472>: add x11, x28, w11, uxtb ghostty-org#1 <+476>: ldrh w11, [x11, x12, lsl ghostty-org#1] # table.asm: <+480>: lsl x11, x11, ghostty-org#1 ** 158 table.get(@intcast(cp)).width); 159 } 160 } <+484>: ldrb w11, [x22, x11] # uucode.asm: ** 148 return @field(data(stages, cp), name); <+480>: ldrh w11, [x22, x11, lsl ghostty-org#1] ``` ### More confusion with showing addresses Confusingly, when I added `std.debug.print("buf addr={}\n", .{@intFromPtr(&buf)})` to show the addresses, this somehow made the `before` case show `uucode` as being faster. Then, when I added alignment, `uucode` and `table` were taking about the same time (**edit:** _uucode was only ~4 ms faster, but see more in "Edit: more investigation"_) If I run without the `std.debug.print` and with `--show-output`, the times are different, so just making a note of this. ``` Benchmark 1: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table Time (mean ± σ): 904.2 ms ± 1.2 ms [User: 884.6 ms, System: 40.3 ms] Range (min … max): 902.8 ms … 906.1 ms 10 runs Benchmark 2: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode Time (mean ± σ): 892.7 ms ± 2.0 ms [User: 873.2 ms, System: 40.1 ms] Range (min … max): 887.9 ms … 895.6 ms 10 runs Summary zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode ran 1.01 ± 0.00 times faster than zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table ``` I think, even with this confusing case, aligning is going to be more consistent than not. ### Edit: more investigation I wasn't satisfied with the discovery that adding `std.debug.print` made this difference and I wanted to dig in and figure out exactly what's going on, but I didn't get a satisfactory answer. Here's what I tried: * I compared the un-aligned addresses from `stepTable` and `stepUucode`, but both seemed similar (not aligned to 128, different each run, but aligned to 8). Note though that `uucode` was running ~1% faster still, similar to the aligned case even though here it was un-aligned. * Instead of doing `std.debug.print` in the step function, I printed in teardown, just in case. This had no difference in the unaligned case, but with alignment it brought the ~4 ms faster `uucode` (as noted above) back closer to the original "after" at around 11-12 ms faster (~1%). * I forced the `buf` in `stepUucode` to not be aligned (e.g. by making it `= other_aligned_buf[3..4096 + 3]`). Still it was ~1% faster. * I compared the assembly of `stepTable` and `stepUucode` for both aligned and not aligned cases, including doing a diff of the diff of these two across aligned and not aligned. The only difference between `stepTable` and `stepUucode` is what's noted above, and nothing stood out in the double diff. * I tried going back to the original un-aligned non-printing code, but then swapped the lines that get from `table` or `uucode`, so that `stepTable` and `stepUucode` were actually doing the opposite. And the result is`stepTable` (actually `uucode`) was 10-11 ms (~1%) faster, just like the aligned case! In summary, I wasn't able to replicate the original benchmark behavior _and print out buffer addresses that pointed to alignment being the issue_. I still feel like in theory aligning the buffer ought to make the benchmark more reliable, and indeed the original un-aligned version gives the result that is more of an outlier, but the evidence here is weak, so I'm alright if we stick with the status quo and close. I think a lesson here is benchmarks are hard to get precise.
2 parents cd6820e + 77b4c52 commit e7c1b4d

3 files changed

Lines changed: 7 additions & 7 deletions

File tree

src/benchmark/CodepointWidth.zig

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,7 @@ fn stepWcwidth(ptr: *anyopaque) Benchmark.Error!void {
109109
const f = self.data_f orelse return;
110110
var r = std.io.bufferedReader(f.reader());
111111
var d: UTF8Decoder = .{};
112-
var buf: [4096]u8 = undefined;
112+
var buf: [4096]u8 align(std.atomic.cache_line) = undefined;
113113
while (true) {
114114
const n = r.read(&buf) catch |err| {
115115
log.warn("error reading data file err={}", .{err});
@@ -133,7 +133,7 @@ fn stepTable(ptr: *anyopaque) Benchmark.Error!void {
133133
const f = self.data_f orelse return;
134134
var r = std.io.bufferedReader(f.reader());
135135
var d: UTF8Decoder = .{};
136-
var buf: [4096]u8 = undefined;
136+
var buf: [4096]u8 align(std.atomic.cache_line) = undefined;
137137
while (true) {
138138
const n = r.read(&buf) catch |err| {
139139
log.warn("error reading data file err={}", .{err});
@@ -162,7 +162,7 @@ fn stepSimd(ptr: *anyopaque) Benchmark.Error!void {
162162
const f = self.data_f orelse return;
163163
var r = std.io.bufferedReader(f.reader());
164164
var d: UTF8Decoder = .{};
165-
var buf: [4096]u8 = undefined;
165+
var buf: [4096]u8 align(std.atomic.cache_line) = undefined;
166166
while (true) {
167167
const n = r.read(&buf) catch |err| {
168168
log.warn("error reading data file err={}", .{err});

src/benchmark/GraphemeBreak.zig

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,7 @@ fn stepNoop(ptr: *anyopaque) Benchmark.Error!void {
9292
const f = self.data_f orelse return;
9393
var r = std.io.bufferedReader(f.reader());
9494
var d: UTF8Decoder = .{};
95-
var buf: [4096]u8 = undefined;
95+
var buf: [4096]u8 align(std.atomic.cache_line) = undefined;
9696
while (true) {
9797
const n = r.read(&buf) catch |err| {
9898
log.warn("error reading data file err={}", .{err});
@@ -114,7 +114,7 @@ fn stepTable(ptr: *anyopaque) Benchmark.Error!void {
114114
var d: UTF8Decoder = .{};
115115
var state: unicode.GraphemeBreakState = .{};
116116
var cp1: u21 = 0;
117-
var buf: [4096]u8 = undefined;
117+
var buf: [4096]u8 align(std.atomic.cache_line) = undefined;
118118
while (true) {
119119
const n = r.read(&buf) catch |err| {
120120
log.warn("error reading data file err={}", .{err});

src/benchmark/IsSymbol.zig

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,7 @@ fn stepZiglyph(ptr: *anyopaque) Benchmark.Error!void {
9191
const f = self.data_f orelse return;
9292
var r = std.io.bufferedReader(f.reader());
9393
var d: UTF8Decoder = .{};
94-
var buf: [4096]u8 = undefined;
94+
var buf: [4096]u8 align(std.atomic.cache_line) = undefined;
9595
while (true) {
9696
const n = r.read(&buf) catch |err| {
9797
log.warn("error reading data file err={}", .{err});
@@ -115,7 +115,7 @@ fn stepTable(ptr: *anyopaque) Benchmark.Error!void {
115115
const f = self.data_f orelse return;
116116
var r = std.io.bufferedReader(f.reader());
117117
var d: UTF8Decoder = .{};
118-
var buf: [4096]u8 = undefined;
118+
var buf: [4096]u8 align(std.atomic.cache_line) = undefined;
119119
while (true) {
120120
const n = r.read(&buf) catch |err| {
121121
log.warn("error reading data file err={}", .{err});

0 commit comments

Comments
 (0)