|
| 1 | +# LRU cache example |
| 2 | + |
| 3 | +This is inspired from a real-life scenario where nice traces were obtained :) |
| 4 | + |
| 5 | +The program is a toy LRU cache backing a data structure. Reader threads each access the data structure sequentially through the cache. When the cache contains an element, it's returned really quick, else there's a delay to simulate it being loaded from slower storage (e.g. disk). The size of the data struct and cache are configurable, as well as the number of reader thread and a few other parameters. |
| 6 | + |
| 7 | +## Real-life case |
| 8 | + |
| 9 | +This toy example was inspired by a real-life investigation, regarding a web-accessible application that had configurable caches, including a specific one whose performance came under scrutiny. |
| 10 | + |
| 11 | +The main symptoms were that some performance problems coincided with that cache being used intensively. It was also noticed that though the cache was used by multiple threads concurrently, that should have been completely independent in their execution, they seem to end-up synchronized, executing in lock-step, and finishing execution almost at the same time. |
| 12 | + |
| 13 | +That investigation produced some interesting traces, which we aim to approximately replicate with this example. |
| 14 | + |
| 15 | +### Real life investigation - some info about that cache: |
| 16 | + |
| 17 | +`type`: LRU (least recently used) - when the cache is full, the cache element that was the least recently used is evicted to make space for a new element to be inserted in its place. |
| 18 | + |
| 19 | +`configurable cache size`: The number of elements it can hold is configurable by the app's admins |
| 20 | + |
| 21 | +`Cache usage`: The cache is used in a peculiar way. Thought the cached data can be accessed randomly, it's often accessed sequentially, from first element to the last, in order. The app is multi-threaded and often multiple reader threads would be reading the data through the cache, in parallel. |
| 22 | + |
| 23 | +The app provides cache statistics, which seemed quite good (a high cache "hit" rate was reported) despite the. |
| 24 | + |
| 25 | +### Investigation goal |
| 26 | + |
| 27 | +The cache was configured to be slightly under-sized vs the data it was caching, which size (number of elements) was slowly but surely increasing. Given the good reported cache stats, was it an issue that it was a bit undersized (by ~15%)? Was the cache, as configured, potentially affecting performance? |
| 28 | + |
| 29 | +### Tracing to the rescue! |
| 30 | + |
| 31 | +The cache and reader thread code were instrumented. From the generated traces, it was possible to distinguish cache hits vs misses. Capturing traces for a few scenarios, some surprising details started to emerge |
| 32 | + |
| 33 | +## Build and run the example: |
| 34 | + |
| 35 | +### Configuration |
| 36 | + |
| 37 | +The example has several configurable parameters, that can be provided as CLI arguments or through maven config: |
| 38 | + |
| 39 | +- --cache-size: number of elements the cache can hold |
| 40 | +- --data-size: number of elements contained in the structure to be cached |
| 41 | +- --num-threads: number of reader threads that will read the structure through the cache |
| 42 | +- --readers-startup-delay-ms: delay between starting reader threads |
| 43 | +- --cache-miss-delay-ms: delay to use when a cache miss occurs, e.g. to simulate the data being read from a slower source (e.g. disk or network) |
| 44 | +- --log-warmup: whether or not to include the cache warm-up loop in the generated traces |
| 45 | +- --verbose: Whether to have verbose, detailed output on STDOUT |
| 46 | + |
| 47 | + |
| 48 | +When run using maven, one can tweak the configuration by editing their values in local `pom.xml`, section `configuration` under `exec-maven-plugin`: |
| 49 | + |
| 50 | +```xml |
| 51 | +[...] |
| 52 | +<argument>--cache-size=95</argument> |
| 53 | +<argument>--data-size=100</argument> |
| 54 | +<argument>--num-threads=10</argument> |
| 55 | +<argument>--readers-startup-delay-ms=130</argument> |
| 56 | +<argument>--cache-miss-delay-ms=6</argument> |
| 57 | +<argument>--log-warmup=true</argument> |
| 58 | +<argument>--verbose=false</argument> |
| 59 | +``` |
| 60 | + |
| 61 | +### Running the example |
| 62 | + |
| 63 | +After having configured it as wanted in `pom.xml`: |
| 64 | + |
| 65 | +```bash |
| 66 | +$ cd examples/lrucache |
| 67 | +$ mvn -q compile exec:exec |
| 68 | +``` |
| 69 | + |
| 70 | +The generated trace will be under sub-folder `traces/`. |
| 71 | + |
| 72 | +### Opening the generated trace in Trace Compass |
| 73 | + |
| 74 | +First, run `jsonify.py` on the generated trace. The first time, it's necessary to install a couple of Python dependencies - see near the top of `jsonify.py` for the "pip" command to run. |
| 75 | + |
| 76 | +```bash |
| 77 | +$ cd examples/lrucache/traces |
| 78 | +# "massage" the trace into valid json - |
| 79 | +# feel free to give it a more descriptive name |
| 80 | +# (second parameter): |
| 81 | +../../../jsonify.py trace.log cache_trace.json |
| 82 | +# If everything went well "cache_trace.json" can be |
| 83 | +# opened in Trace Compass! |
| 84 | +``` |
| 85 | + |
| 86 | +#### Trace Compass |
| 87 | + |
| 88 | +If you do not already have it, download the Trace Compass RCP and unzip it to a convenient place, then start it. |
| 89 | + |
| 90 | +Create a new tracing project, expand it. Select "traces" and right-click, select "open trace", navigate-to and select the newly generated cache trace. Then right-click on the new trace and hover "Select trace type". Confirm if there is a "Trace Event Format" entry. If so, select "Generic Trace Event Trace" under it, as the trace type. Else, we need to install support for this format: |
| 91 | + |
| 92 | +`Help` -> `Install New Software`, then search for "trace event". Under |
| 93 | +category `Trace Types`, install: |
| 94 | +`Trace Compass TraceEvent Parser (incubator)` |
| 95 | + |
| 96 | +With the correct trace type selected, double-click on the trace to open it. Under it, there should be an expendable "views" item, and under it two "Thread by Component" entries - expand the second one, and select "Flame Chart (new callback)" under it. |
| 97 | + |
| 98 | + |
| 99 | +## Playing with the example |
| 100 | + |
| 101 | +With the mechanical aspect out of the way, here's some suggestions for interesting configurations to try, that should generate interesting traces, from which a deeper understanding of the LRU cache, as it's used in the example, can be gained. |
| 102 | + |
| 103 | +### 1) Warm-up and then 1 reader thread |
| 104 | + |
| 105 | +The warm-up populates the cache, hopefully making it ready for a "real" reader to use it. Compare the performance of 1 reader thread, when the cache is fully sized (100/100) vs slightly undersized (95/100): |
| 106 | + |
| 107 | +#### Traces to capture |
| 108 | + |
| 109 | +a) |
| 110 | +```xml |
| 111 | +<!-- Fully sized cache --> |
| 112 | +<argument>--cache-size=20</argument> |
| 113 | +<argument>--data-size=20</argument> |
| 114 | +<argument>--num-threads=1</argument> |
| 115 | +<argument>--readers-startup-delay-ms=0</argument> |
| 116 | +<argument>--cache-miss-delay-ms=6</argument> |
| 117 | +<argument>--log-warmup=true</argument> |
| 118 | +<argument>--verbose=false</argument> |
| 119 | +``` |
| 120 | + |
| 121 | +b) |
| 122 | +```xml |
| 123 | +<!-- Slightly undersized cache --> |
| 124 | +<argument>--cache-size=19</argument> |
| 125 | +<argument>--data-size=20</argument> |
| 126 | +<argument>--num-threads=1</argument> |
| 127 | +<argument>--readers-startup-delay-ms=0</argument> |
| 128 | +<argument>--cache-miss-delay-ms=6</argument> |
| 129 | +<argument>--log-warmup=true</argument> |
| 130 | +<argument>--verbose=false</argument> |
| 131 | +``` |
| 132 | + |
| 133 | +#### Sample results (STDOUT) |
| 134 | + |
| 135 | +a) Fully sized cache |
| 136 | +``` |
| 137 | +--- Cache Stats --- |
| 138 | +Total Requests : 20 |
| 139 | +Cache Hits : 20 (100.00%) |
| 140 | +Cache Misses : 0 (0.00%) |
| 141 | +Avg get() Time/Op: 0.469 ms |
| 142 | +Total get() Time : 9 ms |
| 143 | +------------------- |
| 144 | +``` |
| 145 | + |
| 146 | +b) Slightly undersized cache |
| 147 | +``` |
| 148 | +--- Cache Stats --- |
| 149 | +Total Requests : 20 |
| 150 | +Cache Hits : 0 (0.00%) |
| 151 | +Cache Misses : 20 (100.00%) |
| 152 | +Avg get() Time/Op: 8.044 ms |
| 153 | +Total get() Time : 160 ms |
| 154 | +------------------- |
| 155 | +``` |
| 156 | + |
| 157 | +#### Leads |
| 158 | + |
| 159 | +- How to explain the slightly undersized cache has `100%` miss rate? |
| 160 | +- Performance has decreased by ~20x, probably because the cache never hits? |
| 161 | + |
| 162 | + |
| 163 | +### 2) Warm-up then 5 reader threads started at once |
| 164 | + |
| 165 | +The warm-up populates the cache, hopefully making it ready for a "real" reader to use it. |
| 166 | + |
| 167 | +Compare the performance when the cache is fully sized (100/100) vs slightly undersized (95/100): |
| 168 | + |
| 169 | +#### Traces to capture |
| 170 | + |
| 171 | +a) |
| 172 | +```xml |
| 173 | +<!-- Fully sized cache --> |
| 174 | +<argument>--cache-size=20</argument> |
| 175 | +<argument>--data-size=20</argument> |
| 176 | +<argument>--num-threads=5</argument> |
| 177 | +<argument>--readers-startup-delay-ms=0</argument> |
| 178 | +<argument>--cache-miss-delay-ms=6</argument> |
| 179 | +<argument>--log-warmup=true</argument> |
| 180 | +``` |
| 181 | + |
| 182 | +b) |
| 183 | +```xml |
| 184 | +<!-- Slightly undersized cache --> |
| 185 | +<argument>--cache-size=19</argument> |
| 186 | +<argument>--data-size=20</argument> |
| 187 | +<argument>--num-threads=5</argument> |
| 188 | +<argument>--readers-startup-delay-ms=0</argument> |
| 189 | +<argument>--cache-miss-delay-ms=6</argument> |
| 190 | +<argument>--log-warmup=true</argument> |
| 191 | +<argument>--verbose=false</argument> |
| 192 | +``` |
| 193 | + |
| 194 | +#### Sample results (STDOUT) |
| 195 | + |
| 196 | +a) Fully sized cache |
| 197 | +``` |
| 198 | +--- Cache Stats --- |
| 199 | +Total Requests : 100 |
| 200 | +Cache Hits : 100 (100.00%) |
| 201 | +Cache Misses : 0 (0.00%) |
| 202 | +Avg get() Time/Op: 0.581 ms |
| 203 | +Total get() Time : 58 ms |
| 204 | +------------------- |
| 205 | +``` |
| 206 | + |
| 207 | +b) Slightly undersized cache |
| 208 | +``` |
| 209 | +--- Cache Stats --- |
| 210 | +Total Requests : 100 |
| 211 | +Cache Hits : 80 (80.00%) |
| 212 | +Cache Misses : 20 (20.00%) |
| 213 | +Avg get() Time/Op: 7.446 ms |
| 214 | +Total get() Time : 744 ms |
| 215 | +------------------- |
| 216 | +``` |
| 217 | + |
| 218 | +#### Leads |
| 219 | + |
| 220 | +- How to explain the disproportional decrease in performance between the fully sized caches vs 95% sized results? It's a ~12x factor! |
| 221 | + |
| 222 | +### 3) Warm-up then 5 reader threads started with a delay |
| 223 | + |
| 224 | +The warm-up populates the cache, hopefully making it ready for a "real" reader to use it. |
| 225 | + |
| 226 | +Compare the performance when the cache is fully sized (100/100) vs slightly undersized (95/100): |
| 227 | + |
| 228 | +#### Traces to capture |
| 229 | + |
| 230 | +a) |
| 231 | +```xml |
| 232 | +<!-- Fully sized cache --> |
| 233 | +<argument>--cache-size=20</argument> |
| 234 | +<argument>--data-size=20</argument> |
| 235 | +<argument>--num-threads=5</argument> |
| 236 | +<argument>--readers-startup-delay-ms=25</argument> |
| 237 | +<argument>--cache-miss-delay-ms=6</argument> |
| 238 | +<argument>--log-warmup=true</argument> |
| 239 | +``` |
| 240 | + |
| 241 | +b) |
| 242 | +```xml |
| 243 | +<!-- Slightly undersized cache --> |
| 244 | +<argument>--cache-size=19</argument> |
| 245 | +<argument>--data-size=20</argument> |
| 246 | +<argument>--num-threads=5</argument> |
| 247 | +<argument>--readers-startup-delay-ms=25</argument> |
| 248 | +<argument>--cache-miss-delay-ms=6</argument> |
| 249 | +<argument>--log-warmup=true</argument> |
| 250 | +<argument>--verbose=false</argument> |
| 251 | +``` |
| 252 | + |
| 253 | +#### Sample results (STDOUT) |
| 254 | + |
| 255 | +a) Fully sized cache |
| 256 | +``` |
| 257 | +--- Cache Stats --- |
| 258 | +Total Requests : 100 |
| 259 | +Cache Hits : 100 (100.00%) |
| 260 | +Cache Misses : 0 (0.00%) |
| 261 | +Avg get() Time/Op: 0.187 ms |
| 262 | +Total get() Time : 18 ms |
| 263 | +------------------- |
| 264 | +``` |
| 265 | + |
| 266 | +b) Slightly undersized cache |
| 267 | +``` |
| 268 | +--- Cache Stats --- |
| 269 | +Total Requests : 100 |
| 270 | +Cache Hits : 80 (80.00%) |
| 271 | +Cache Misses : 20 (20.00%) |
| 272 | +Avg get() Time/Op: 5.026 ms |
| 273 | +Total get() Time : 502 ms |
| 274 | +------------------- |
| 275 | +``` |
| 276 | + |
| 277 | +#### Leads |
| 278 | + |
| 279 | +- How to explain the disproportional decrease in performance between the fully sized caches vs 95% sized results? |
| 280 | +- Looking at the captured traces, do reader thread executions overlap in both cases? Any unexpected effect of overlap when the cache is undersized (e.g. unintended thread synchronization or lockstep effect)? |
| 281 | +- zooming-in, in a zone in the trace where two or more threads seem to execute in "lockstep", examine one particular iteration |
| 282 | + - how many of the reader threads end-up having a cache miss? |
| 283 | + - how does iteration execution time compare between the reader thread that has a cache miss vs those who have a "cache hit after waiting"? |
| 284 | + - in such a situation, is there a meaningful performance gain in a reported cache miss vs hit? |
| 285 | + |
| 286 | +### 4) "At scale" run |
| 287 | + |
| 288 | +This run we will push the limits a bit to generate a nice, rather big trace. |
| 289 | + |
| 290 | +Compare the performance when the cache is fully sized (100/100) vs slightly undersized (95/100): |
| 291 | + |
| 292 | +#### Traces to capture |
| 293 | + |
| 294 | +a) |
| 295 | +```xml |
| 296 | +<!-- Fully sized cache --> |
| 297 | +<argument>--cache-size=100</argument> |
| 298 | +<argument>--data-size=100</argument> |
| 299 | +<argument>--num-threads=10</argument> |
| 300 | +<argument>--readers-startup-delay-ms=130</argument> |
| 301 | +<argument>--cache-miss-delay-ms=6</argument> |
| 302 | +<argument>--log-warmup=true</argument> |
| 303 | +``` |
| 304 | + |
| 305 | +b) |
| 306 | +```xml |
| 307 | +<!-- Slightly undersized cache --> |
| 308 | +<argument>--cache-size=95</argument> |
| 309 | +<argument>--data-size=100</argument> |
| 310 | +<argument>--num-threads=10</argument> |
| 311 | +<argument>--readers-startup-delay-ms=130</argument> |
| 312 | +<argument>--cache-miss-delay-ms=6</argument> |
| 313 | +<argument>--log-warmup=true</argument> |
| 314 | +<argument>--verbose=false</argument> |
| 315 | +``` |
| 316 | + |
| 317 | +#### Sample results (STDOUT) |
| 318 | + |
| 319 | +a) Fully sized cache |
| 320 | +``` |
| 321 | +--- Cache Stats --- |
| 322 | +Total Requests : 1000 |
| 323 | +Cache Hits : 1000 (100.00%) |
| 324 | +Cache Misses : 0 (0.00%) |
| 325 | +Avg get() Time/Op: 0.083 ms |
| 326 | +Total get() Time : 82 ms |
| 327 | +------------------- |
| 328 | +``` |
| 329 | + |
| 330 | +b) Slightly undersized cache |
| 331 | +``` |
| 332 | +--- Cache Stats --- |
| 333 | +Total Requests : 1000 |
| 334 | +Cache Hits : 806 (80.60%) |
| 335 | +Cache Misses : 194 (19.40%) |
| 336 | +Avg get() Time/Op: 3.985 ms |
| 337 | +Total get() Time : 3985 ms |
| 338 | +------------------- |
| 339 | +``` |
| 340 | + |
| 341 | +#### Leads |
| 342 | + |
| 343 | +Nothing new, but look how nice the trace looks :) |
0 commit comments