@@ -12,7 +12,7 @@ Originally posted as Fast TotW #53 on October 14, 2021
1212
1313* By [ Mircea Trofin] ( mailto:mtrofin@google.com ) *
1414
15- Updated 2023-03-02
15+ Updated 2023-09-04
1616
1717Quicklink: [ abseil.io/fast/53] ( https://abseil.io/fast/53 )
1818
@@ -77,10 +77,9 @@ user to specify up to 3 counters in a comma-separated list, via the
7777` --benchmark_perf_counters ` flag, to be measured alongside the time measurement.
7878Just like time measurement, each counter value is captured right before the
7979benchmarked code is run, and right after. The difference is reported to the user
80- as per-iteration values (similar to the time measurement). The report is only
81- available in the JSON output (` --benchmark_format=json ` ).
80+ as per-iteration values (similar to the time measurement).
8281
83- ### Simple example
82+ ### Basic usage
8483
8584** Note** : counter names are hardware vendor and version specific. The example
8685here assumes Intel Skylake. Check how this maps to other versions of Intel CPUs,
@@ -92,51 +91,41 @@ Build a benchmark executable - for example, let's use "swissmap" from
9291[ fleetbench] ( https://github.com/google/fleetbench ) :
9392
9493<pre class =" prettyprint code " >
95- bazel build -c opt //fleetbench/swissmap:swissmap_benchmark
94+ bazel build -c opt //fleetbench/swissmap:cold_swissmap_benchmark
9695</pre >
9796
9897Run the benchmark; let's ask for instructions, cycles, and loads:
9998
10099<pre class =" prettyprint code " >
101- bazel-bin/fleetbench/swissmap/swissmap_benchmark --benchmarks=all --benchmark_perf_counters=INSTRUCTIONS,CYCLES,MEM_UOPS_RETIRED:ALL_LOADS --benchmark_format=json
100+ bazel-bin/fleetbench/swissmap/cold_swissmap_benchmark \
101+ --benchmark_filter='BM_.*::absl::flat_hash_set.*64.*set_size:64.*density:0' \
102+ --benchmark_perf_counters=INSTRUCTIONS,CYCLES,MEM_UOPS_RETIRED:ALL_LOADS
102103</pre >
103104
104- The output JSON file is organized as follows:
105-
106- <pre class =" prettyprint code " >
107- {
108- "benchmarks": [
109- {
110- "CYCLES": 183357.29158733244,
111- "INSTRUCTIONS": 603772.790402176,
112- "MEM_UOPS_RETIRED:ALL_LOADS": 121.63652613172722,
113- "bytes_per_second": 1804401396.9863303,
114- "cpu_time_ns": 56750.122323683696,
115- "iterations": 25735,
116- "label": "html",
117- "name": "BM_UDataBuffer/0",
118- "real_time_ns": 56900.075383718671
119- },
120- {
121- "CYCLES": 183782.38686892079,
122- "INSTRUCTIONS": 603772.91427358345,
123- "MEM_UOPS_RETIRED:ALL_LOADS": 119.59456538520921,
124- "bytes_per_second": 1825391775.0291102,
125- "cpu_time_ns": 56097.546510730273,
126- "iterations": 25908,
127- "label": "html",
128- "name": "BM_UDataBuffer/0",
129- "real_time_ns": 56245.906090782773
130- },
131- [...]
132- }
133- </pre >
134-
135- For each run of the benchmark, the requested counters and their values are
136- captured in a JSON dictionary. The values are per-iteration (note the
137- ` iterations ` field). In the first run the benchmark completed ` 25735 `
138- iterations, so the total value for CYCLES measured by the benchmark was
139- ` 183357.29158733244 * 25735 ` .
105+ The output looks like:
106+
107+ ```
108+ Running ./cold_swissmap_benchmark
109+ Run on (8 X 4667.91 MHz CPU s)
110+ CPU Caches:
111+ L1 Data 32 KiB (x4)
112+ L1 Instruction 32 KiB (x4)
113+ L2 Unified 256 KiB (x4)
114+ L3 Unified 8192 KiB (x1)
115+ Load Average: 2.31, 2.08, 1.95
116+ ---------------------------------------------------------------------------------------------------------------------------------------
117+ Benchmark Time CPU Iterations UserCounters...
118+ ---------------------------------------------------------------------------------------------------------------------------------------
119+ BM_FindMiss_Cold<::absl::flat_hash_set, 64>/set_size:64/density:0 18.4 ns 18.4 ns 39048136 CYCLES=82.9019 INSTRUCTIONS=35.7284 MEM_UOPS_RETIRED:ALL_LOADS=6.05507
120+ BM_FindHit_Cold<::absl::flat_hash_set, 64>/set_size:64/density:0 33.3 ns 33.3 ns 20600490 CYCLES=152.156 INSTRUCTIONS=55.0354 MEM_UOPS_RETIRED:ALL_LOADS=15.0034
121+ BM_InsertHit_Cold<::absl::flat_hash_set, 64>/set_size:64/density:0 34.8 ns 34.8 ns 19004416 CYCLES=157.956 INSTRUCTIONS=59.0354 MEM_UOPS_RETIRED:ALL_LOADS=16.0013
122+ BM_Iterate_Cold<::absl::flat_hash_set, 64>/set_size:64/density:0 33.5 ns 33.5 ns 25444389 CYCLES=152.431 INSTRUCTIONS=57.9225 MEM_UOPS_RETIRED:ALL_LOADS=13.3892
123+ BM_InsertManyOrdered_Cold<::absl::flat_hash_set, 64>/set_size:64/density:0 54.9 ns 54.8 ns 14141958 CYCLES=242.373 INSTRUCTIONS=111.455 MEM_UOPS_RETIRED:ALL_LOADS=33.1838
124+ BM_InsertManyUnordered_Cold<::absl::flat_hash_set, 64>/set_size:64/density:0 50.0 ns 50.0 ns 14234753 CYCLES=227.516 INSTRUCTIONS=111.415 MEM_UOPS_RETIRED:ALL_LOADS=33.1781
125+ ```
126+
127+ So we can see that ` BM_FindMiss_Cold ` took approximately 83 cycles, 36
128+ instructions, and 6 memory ops per iteration.
140129
141130## Summary
142131
0 commit comments