@@ -12,7 +12,7 @@ Originally posted as Fast TotW #39 on January 22, 2021
1212
1313* By [ Chris Kennelly] ( mailto:ckennelly@google.com ) and [ Alkis Evlogimenos] ( mailto:alkis@evlogimenos.com ) *
1414
15- Updated 2023-10-10
15+ Updated 2025-03-24
1616
1717Quicklink: [ abseil.io/fast/39] ( https://abseil.io/fast/39 )
1818
@@ -146,14 +146,14 @@ would "reduce" the
146146[ data center tax] ( https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44271.pdf ) ,
147147but we would actually hurt [ application productivity] ( /fast/7 ) -per-CPU. Time we
148148spend in malloc is
149- [ less important than application performance] ( https://research.google/pubs/pub50370 .pdf ) .
149+ [ less important than application performance] ( https://storage.googleapis.com/gweb-research2023-media/pubtools/6170 .pdf ) .
150150
151151Trace-driven simulations with hardware-validated architectural simulators showed
152152the prefetched data was frequently used. Additionally, it is better to stall on
153153a TLB miss at the prefetch site--which has no dependencies, than to stall at the
154154point of use.
155155
156- ## Pitfalls
156+ ## Pitfalls {#pitfalls}
157157
158158There are a number of things that commonly go wrong when writing benchmarks. The
159159following is a non-exhaustive list:
@@ -175,15 +175,23 @@ following is a non-exhaustive list:
175175 [ Stabilizer (by Berger, et. al.)] ( https://people.cs.umass.edu/~emery/pubs/stabilizer-asplos13.pdf )
176176 deliberately perturb these parameters to improve benchmarking statistical
177177 quality.
178+ * Sensitivity to stack alignment. Changes anywhere in the stack--added/removed
179+ variables, better (or worse) spilling due to compiler optimizations,
180+ etc.--can affect the alignment at the start of the function-under-test. This
181+ has been seen to produce 20% performance swings.
178182* Representative data. The data in the benchmark needs to be "similar" to the
179183 data in production - for example, imagine having short strings in the
180184 benchmark, and long strings in the fleet. This also extends to the code
181185 paths in the benchmarks being similar to the code paths that the application
182- exercises.
186+ exercises. This is a common pain point for macrobenchmarks too. A loadtest
187+ may cover certain request types, rather than all of those seen by production
188+ servers.
189+
183190* Benchmarking the right code. It's very easy to introduce code into the
184191 benchmark that's not present in the real workload. For example, using a
185192 random number generator's cost for a benchmark could exceed the cost of the
186193 work being benchmarked.
194+
187195* Being aware of steady state vs dynamic behaviour. For more complex
188196 benchmarks it's easy to produce something that converges to a steady state -
189197 for example if it has a constant arrival rate and service time. Production
0 commit comments