Skip to content

Commit b9e4e0e

Browse files
authored
Merge pull request #520 from aalexand/fast-99-etc
Publish fast/26, fast/94, fast/95, fast/97, fast/98, fast/99 episodes.
2 parents 39a5ea2 + 1dca477 commit b9e4e0e

24 files changed

Lines changed: 1797 additions & 71 deletions

_posts/2023-03-02-fast-21.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #21 on January 16, 2020
1212

1313
*By [Paul Wankadia](mailto:junyer@google.com) and [Darryl Gove](mailto:djgove@google.com)*
1414

15-
Updated 2024-10-21
15+
Updated 2025-09-03
1616

1717
Quicklink: [abseil.io/fast/21](https://abseil.io/fast/21)
1818

_posts/2023-03-02-fast-39.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #39 on January 22, 2021
1212

1313
*By [Chris Kennelly](mailto:ckennelly@google.com) and [Alkis Evlogimenos](mailto:alkis@evlogimenos.com)*
1414

15-
Updated 2025-03-24
15+
Updated 2025-09-29
1616

1717
Quicklink: [abseil.io/fast/39](https://abseil.io/fast/39)
1818

@@ -112,10 +112,11 @@ challenging: Microbenchmarks tend to have small working sets that tend to be
112112
cache resident. Real code, particularly Google C++, is not.
113113

114114
In production, the cacheline holding `kMasks` might be evicted, leading to much
115-
worse stalls (hundreds of cycles to access main memory). Additionally, on x86
116-
processors since Haswell, this [optimization can be past its prime](/fast/9):
117-
BMI2's `bzhi` instruction is both faster than loading and masking *and* delivers
118-
more consistent performance.
115+
worse stalls
116+
([hundreds of cycles to access main memory](https://sre.google/static/pdf/rule-of-thumb-latency-numbers-letter.pdf)).
117+
Additionally, on x86 processors since Haswell, this
118+
[optimization can be past its prime](/fast/9): BMI2's `bzhi` instruction is both
119+
faster than loading and masking *and* delivers more consistent performance.
119120

120121
When developing benchmarks for
121122
[SwissMap](https://abseil.io/blog/20180927-swisstables), individual operations

_posts/2023-03-02-fast-53.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #53 on October 14, 2021
1212

1313
*By [Mircea Trofin](mailto:mtrofin@google.com)*
1414

15-
Updated 2024-11-19
15+
Updated 2025-09-03
1616

1717
Quicklink: [abseil.io/fast/53](https://abseil.io/fast/53)
1818

@@ -73,7 +73,7 @@ the process of writing a benchmark. An example of its use may be seen
7373
[here](https://github.com/llvm/llvm-test-suite/tree/main/MicroBenchmarks/LoopVectorization)
7474

7575
The benchmark harness support for performance counters consists of allowing the
76-
user to specify up to 3 counters in a comma-separated list, via the
76+
user to specify counters in a comma-separated list, via the
7777
`--benchmark_perf_counters` flag, to be measured alongside the time measurement.
7878
Just like time measurement, each counter value is captured right before the
7979
benchmarked code is run, and right after. The difference is reported to the user
@@ -131,13 +131,15 @@ instructions, and 6 memory ops per iteration.
131131

132132
- *Number of counters*: At most 32 events may be requested for simultaneous
133133
collection. Note however, that the number of hardware counters available is
134-
much lower (usually 4-8 on modern CPUs) -- requesting more events than the
134+
much lower (usually 4-8 on modern CPUs, see
135+
`PerfCounterValues::kMaxCounters`) -- requesting more events than the
135136
hardware counters will cause
136137
[multiplexing](https://perf.wiki.kernel.org/index.php/Tutorial#multiplexing_and_scaling_events)
137138
and decreased accuracy.
138139

139-
- *Visualization*: There is no visualization available, so the user needs to
140-
rely on collecting JSON result files and summarizing the results.
140+
- *Visualization*: There is no dedicated visualization UI available, so for
141+
complex analysis, users may need to collect JSON result files and summarize
142+
the results.
141143

142144
- *Counting vs. Sampling*: The framework only collects counters in "counting"
143145
mode -- it answers how many cycles/cache misses/etc. happened, but not does

_posts/2023-03-02-fast-9.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #9 on June 24, 2019
1212

1313
*By [Chris Kennelly](mailto:ckennelly@google.com)*
1414

15-
Updated 2025-03-27
15+
Updated 2025-10-03
1616

1717
Quicklink: [abseil.io/fast/9](https://abseil.io/fast/9)
1818

_posts/2023-09-14-fast-7.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #7 on June 6, 2019
1212

1313
*By [Chris Kennelly](mailto:ckennelly@google.com)*
1414

15-
Updated 2025-03-25
15+
Updated 2025-10-03
1616

1717
Quicklink: [abseil.io/fast/7](https://abseil.io/fast/7)
1818

_posts/2023-09-30-fast-52.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #52 on September 30, 2021
1212

1313
*By [Chris Kennelly](mailto:ckennelly@google.com)*
1414

15-
Updated 2025-03-24
15+
Updated 2025-10-03
1616

1717
Quicklink: [abseil.io/fast/52](https://abseil.io/fast/52)
1818

_posts/2023-10-10-fast-64.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #64 on October 21, 2022
1212

1313
*By [Chris Kennelly](mailto:ckennelly@google.com)*
1414

15-
Updated 2025-03-24
15+
Updated 2025-09-29
1616

1717
Quicklink: [abseil.io/fast/64](https://abseil.io/fast/64)
1818

@@ -192,7 +192,7 @@ that can be returned. This approach has two problems:
192192
variable small string object buffer sizes. Returning `const std::string&`
193193
constrains the implementation to that particular size of buffer.
194194

195-
In contrast, by returning `std::string_view` (or our
195+
In contrast, by returning [`std::string_view`](/tips/1) (or our
196196
[internal predecessor](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3442.html),
197197
`StringPiece`), we decouple callers from the internal representation. The API is
198198
the same, independent of whether the string is constant data (backed by the

_posts/2023-10-15-fast-60.md

Lines changed: 14 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -12,14 +12,15 @@ Originally posted as Fast TotW #60 on June 6, 2022
1212

1313
*By [Chris Kennelly](mailto:ckennelly@google.com)*
1414

15-
Updated 2025-03-24
15+
Updated 2025-09-29
1616

1717
Quicklink: [abseil.io/fast/60](https://abseil.io/fast/60)
1818

1919

2020
[Google-Wide Profiling](https://research.google/pubs/pub36575/) collects data
2121
not just from our hardware performance counters, but also from in-process
22-
profilers.
22+
profilers. These have been covered in previous episodes covering
23+
[hashtables](/fast/26).
2324

2425
In-process profilers can give deeper insights about the state of the program
2526
that are hard to observe from the outside, such as lock contention, where memory
@@ -39,8 +40,8 @@ decisions faster, shortening our
3940
The value is in pulling in the area-under-curve and landing in a better spot. An
4041
"imperfect" profiler that can help make a decision is better than a "perfect"
4142
profiler that is unwieldy to collect for performance or privacy reasons. Extra
42-
information or precision is only useful insofar as it helps us make a *better*
43-
decision or *changes* the outcome.
43+
information or precision is only useful insofar as it helps us make a
44+
[*better* decision or *changes* the outcome](/fast/94).
4445

4546
For example, most new optimizations to
4647
[TCMalloc](https://github.com/google/tcmalloc/blob/master/tcmalloc) start from
@@ -54,7 +55,7 @@ steps didn't directly save any CPU usage or bytes of RAM, but they enabled
5455
better decisions. Capabilities are harder to directly quantify, but they are the
5556
motor of progress.
5657

57-
## Leveraging existing profilers: the "No build" option
58+
## Leveraging existing profilers: the "No build" option {#no-build}
5859

5960
Developing a new profiler takes considerable time, both in terms of
6061
implementation and wallclock time to ready the fleet for collection at scale.
@@ -65,19 +66,19 @@ For example, if the case for hashtable profiling was just reporting the capacity
6566
of hashtables, then we could also derive that information from heap profiles,
6667
TCMalloc's heap profiles of the fleet. Even where heap profiles might not be
6768
able to provide precise insights--the actual "size" of the hashtable, rather
68-
than its capacity--we can make an informed guess from the profile combined with
69-
knowledge about the typical load factors due to SwissMap's design.
69+
than its capacity--we can make an [informed guess](/fast/90) from the profile
70+
combined with knowledge about the typical load factors due to SwissMap's design.
7071

7172
It is important to articulate the value of the new profiler over what is already
7273
provided. A key driver for hashtable-specific profiling is that the CPU profiles
7374
of a hashtable with a
7475
[bad hash function look similar to those](https://youtu.be/JZE3_0qvrMg?t=1864)
75-
with a good hash function. The added information collected for stuck bits helps
76-
us drive optimization decisions we wouldn't have been able to make. The capacity
77-
information collected during hashtable-profiling is incidental to the profiler's
78-
richer, hashtable-specific details, but wouldn't be a particularly compelling
79-
reason to collect it on its own given the redundant information available from
80-
ordinary heap profiles.
76+
with a good hash function. The [added information collected](/fast/26) for stuck
77+
bits helps us drive optimization decisions we wouldn't have been able to make.
78+
The capacity information collected during hashtable-profiling is incidental to
79+
the profiler's richer, hashtable-specific details, but wouldn't be a
80+
particularly compelling reason to collect it on its own given the redundant
81+
information available from ordinary heap profiles.
8182

8283
## Sampling strategies
8384

_posts/2023-10-20-fast-70.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #70 on June 26, 2023
1212

1313
*By [Chris Kennelly](mailto:ckennelly@google.com)*
1414

15-
Updated 2025-03-25
15+
Updated 2025-10-03
1616

1717
Quicklink: [abseil.io/fast/70](https://abseil.io/fast/70)
1818

@@ -129,6 +129,13 @@ performance improvements. We still need to measure the impact on application and
129129
service-level performance, but the proxies help us hone in on an optimization
130130
that we want to deploy faster.
131131

132+
When we are considering multiple options for a project, secondary metrics can
133+
give us confirmation after the fact that our expectations were correct. For
134+
example, suppose we chose option A over option B because both provided
135+
comparable performance but A would not impact reliability. We should measure
136+
both the performance and reliability outcomes to support our engineering
137+
decision. This lets us close the loop between expectations and reality.
138+
132139
## Aligning with success
133140

134141
The metrics we pick need to align with success. If a metric tells us to do the

_posts/2023-11-10-fast-74.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Originally posted as Fast TotW #74 on September 29, 2023
1212

1313
*By [Chris Kennelly](mailto:ckennelly@google.com) and [Matt Kulukundis](mailto:kfm@google.com)*
1414

15-
Updated 2025-03-25
15+
Updated 2025-10-03
1616

1717
Quicklink: [abseil.io/fast/74](https://abseil.io/fast/74)
1818

@@ -74,12 +74,12 @@ understand, we might be tempted to remove it. TCMalloc's fast path would appear
7474
cheaper, but other code somewhere else would experience a cache miss and
7575
[application productivity](/fast/7) would decline.
7676

77-
To make matters worse, the cost is partly a profiling artifact. The TLB miss
78-
blocks instruction retirement, but our processors are superscalar, out-of-order
79-
behemoths. The processor can continue to execute further instructions in the
80-
meantime, but this execution is not visible to a sampling profiler like
81-
Google-Wide Profiling. IPC in the application may be improved, but not in a way
82-
immediately associated with TCMalloc.
77+
To make matters worse, the cost is partly [a profiling artifact](/fast/94). The
78+
TLB miss blocks instruction retirement, but our processors are superscalar,
79+
out-of-order behemoths. The processor can continue to execute further
80+
instructions in the meantime, but this execution is not visible to a sampling
81+
profiler like Google-Wide Profiling. IPC in the application may be improved, but
82+
not in a way immediately associated with TCMalloc.
8383

8484
### Hidden context switch costs
8585

@@ -104,11 +104,11 @@ increase apparent kernel scheduler latency.
104104

105105
### Sweeping away protocol buffers
106106

107-
Consider an extreme example. When our hashtable profiler for Abseil's hashtables
108-
indicates a problematic hashtable, a user could switch the offending table from
109-
`absl::flat_hash_map` to `std::unordered_map`. Since the profiler doesn't
110-
collect information about `std` containers, the offending table would no longer
111-
show up, although the fleet itself would be dramatically worse.
107+
Consider an extreme example. When [our hashtable profiler](/fast/26) for
108+
Abseil's hashtables indicates a problematic hashtable, a user could switch the
109+
offending table from `absl::flat_hash_map` to `std::unordered_map`. Since the
110+
profiler doesn't collect information about `std` containers, the offending table
111+
would no longer show up, although the fleet itself would be dramatically worse.
112112

113113
While the above example may seem contrived, an almost entirely analogous
114114
recommendation comes up with some regularity: migrate users from protos to

0 commit comments

Comments
 (0)