Commit e7c1b4d
authored
benchmarks: Align
This aligns the `buf` of `4096` bytes in the benchmarks to the cache
line, to ensure a consistent number of cache lines are used, and also to
avoid any sub-`usize` alignment issues as seen in
ghostty-org#8548.
This has less of an effect as
ghostty-org#8548, and looking at the
before and after of the current benchmarks in the repo doesn't show any
noticeable difference.
In my case, I've been comparing the `table` option with [uucode in this
branch](https://github.com/ghostty-org/ghostty/compare/main...jacobsandlund:jacob/uucode?expand=1),
and I did see a difference.
### Before
I ran the before code several times (6 with the exact same binary, but
several more with essentially the same code), always getting something
like this, with `table` edging out `uucode` by something like 3-4ms:
```
Benchmark 1: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table
Time (mean ± σ): 927.8 ms ± 1.3 ms [User: 883.7 ms, System: 42.5 ms]
Range (min … max): 926.0 ms … 929.8 ms 10 runs
Benchmark 2: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode
Time (mean ± σ): 930.9 ms ± 1.4 ms [User: 886.8 ms, System: 42.5 ms]
Range (min … max): 928.5 ms … 933.4 ms 10 runs
```
### After
After this change, it shows `uucode` coming in at 10-11ms (~1%) faster:
```
Benchmark 1: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table
Time (mean ± σ): 930.6 ms ± 1.3 ms [User: 886.5 ms, System: 42.4 ms]
Range (min … max): 928.9 ms … 932.4 ms 10 runs
Benchmark 2: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode
Time (mean ± σ): 920.1 ms ± 1.4 ms [User: 876.3 ms, System: 42.1 ms]
Range (min … max): 918.4 ms … 923.3 ms 10 runs
Summary
zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode ran
1.01 ± 0.00 times faster than zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table
```
This ~1% faster time checks out, since from looking at the assembly,
it's an exact match minus this small place where the compiler can
optimize `uucode` a little better:
```
# both table.asm/uucode.asm:
140 const high = cp >> 8;
141 const low = cp & 0xFF;
** 142 return self.stage3[self.stage2[self.stage1[high] + low]];
<+464>: ubfx x12, x11, ghostty-org#8, ghostty-org#13
<+468>: ldrh w12, [x27, x12, lsl ghostty-org#1]
<+472>: add x11, x28, w11, uxtb ghostty-org#1
<+476>: ldrh w11, [x11, x12, lsl ghostty-org#1]
# table.asm:
<+480>: lsl x11, x11, ghostty-org#1
** 158 table.get(@intcast(cp)).width);
159 }
160 }
<+484>: ldrb w11, [x22, x11]
# uucode.asm:
** 148 return @field(data(stages, cp), name);
<+480>: ldrh w11, [x22, x11, lsl ghostty-org#1]
```
### More confusion with showing addresses
Confusingly, when I added `std.debug.print("buf addr={}\n",
.{@intFromPtr(&buf)})` to show the addresses, this somehow made the
`before` case show `uucode` as being faster. Then, when I added
alignment, `uucode` and `table` were taking about the same time
(**edit:** _uucode was only ~4 ms faster, but see more in "Edit: more
investigation"_)
If I run without the `std.debug.print` and with `--show-output`, the
times are different, so just making a note of this.
```
Benchmark 1: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table
Time (mean ± σ): 904.2 ms ± 1.2 ms [User: 884.6 ms, System: 40.3 ms]
Range (min … max): 902.8 ms … 906.1 ms 10 runs
Benchmark 2: zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode
Time (mean ± σ): 892.7 ms ± 2.0 ms [User: 873.2 ms, System: 40.1 ms]
Range (min … max): 887.9 ms … 895.6 ms 10 runs
Summary
zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=uucode ran
1.01 ± 0.00 times faster than zig-out/bin/ghostty-bench +codepoint-width --data=data.txt --mode=table
```
I think, even with this confusing case, aligning is going to be more
consistent than not.
### Edit: more investigation
I wasn't satisfied with the discovery that adding `std.debug.print` made
this difference and I wanted to dig in and figure out exactly what's
going on, but I didn't get a satisfactory answer. Here's what I tried:
* I compared the un-aligned addresses from `stepTable` and `stepUucode`,
but both seemed similar (not aligned to 128, different each run, but
aligned to 8). Note though that `uucode` was running ~1% faster still,
similar to the aligned case even though here it was un-aligned.
* Instead of doing `std.debug.print` in the step function, I printed in
teardown, just in case. This had no difference in the unaligned case,
but with alignment it brought the ~4 ms faster `uucode` (as noted above)
back closer to the original "after" at around 11-12 ms faster (~1%).
* I forced the `buf` in `stepUucode` to not be aligned (e.g. by making
it `= other_aligned_buf[3..4096 + 3]`). Still it was ~1% faster.
* I compared the assembly of `stepTable` and `stepUucode` for both
aligned and not aligned cases, including doing a diff of the diff of
these two across aligned and not aligned. The only difference between
`stepTable` and `stepUucode` is what's noted above, and nothing stood
out in the double diff.
* I tried going back to the original un-aligned non-printing code, but
then swapped the lines that get from `table` or `uucode`, so that
`stepTable` and `stepUucode` were actually doing the opposite. And the
result is`stepTable` (actually `uucode`) was 10-11 ms (~1%) faster, just
like the aligned case!
In summary, I wasn't able to replicate the original benchmark behavior
_and print out buffer addresses that pointed to alignment being the
issue_. I still feel like in theory aligning the buffer ought to make
the benchmark more reliable, and indeed the original un-aligned version
gives the result that is more of an outlier, but the evidence here is
weak, so I'm alright if we stick with the status quo and close. I think
a lesson here is benchmarks are hard to get precise.buf to cache line for consistency (ghostty-org#8569)3 files changed
Lines changed: 7 additions & 7 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
109 | 109 | | |
110 | 110 | | |
111 | 111 | | |
112 | | - | |
| 112 | + | |
113 | 113 | | |
114 | 114 | | |
115 | 115 | | |
| |||
133 | 133 | | |
134 | 134 | | |
135 | 135 | | |
136 | | - | |
| 136 | + | |
137 | 137 | | |
138 | 138 | | |
139 | 139 | | |
| |||
162 | 162 | | |
163 | 163 | | |
164 | 164 | | |
165 | | - | |
| 165 | + | |
166 | 166 | | |
167 | 167 | | |
168 | 168 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
92 | 92 | | |
93 | 93 | | |
94 | 94 | | |
95 | | - | |
| 95 | + | |
96 | 96 | | |
97 | 97 | | |
98 | 98 | | |
| |||
114 | 114 | | |
115 | 115 | | |
116 | 116 | | |
117 | | - | |
| 117 | + | |
118 | 118 | | |
119 | 119 | | |
120 | 120 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
91 | 91 | | |
92 | 92 | | |
93 | 93 | | |
94 | | - | |
| 94 | + | |
95 | 95 | | |
96 | 96 | | |
97 | 97 | | |
| |||
115 | 115 | | |
116 | 116 | | |
117 | 117 | | |
118 | | - | |
| 118 | + | |
119 | 119 | | |
120 | 120 | | |
121 | 121 | | |
| |||
0 commit comments