Skip to content

Commit c10569f

Browse files
marcdumais-workMatthewKhouzam
authored andcommitted
[example] Add Instrumented LRU cache example
This toy example was inspired by a real-life investigation, regarding a web-accessible application that had configurable caches, including a specific one whose performance came under scrutiny. That investigation produces some interesting traces, which we aim to approximately replicate with this example. Real life investigation - some info about that cache: type: LRU (least recently used) - when the cache is full, the cache element that was the least recently used is evicted to make space for a new element to be inserted in its place. configurable cache size: The number of elements it can hold is configurable by the app's admins cache / cached data usage: The cache is used in a special way. Thought the cache-backed data can be accessed randomly, it's often accessed sequentially, from first element to the last, in order. The app is multi-threaded and often multiple reader threads would be reading the data through the cache, in parallel. The app provides cache statistics, which seemed quite good (a high cache "hit" rate was reported). Investigation: The cache was configured to be slightly under-sized vs the data it was caching, which size (number of elements) was slowly but surely increasing. Given the good reported cache stats, was it an issue that it was a bit undersized (by ~15%)? Was the cache, as configured, potentially affecting performance? Tracing to the rescue! The cache and reader thread code were instrumented. From the generated traces, it was possible to distinguish cache hits vs misses. Capturing traces for a few scenarios, some surprising details started to emerge! See the example's README for more details. Signed-off-by: Marc Dumais <marc.dumais@ericsson.com>
1 parent 3b76859 commit c10569f

10 files changed

Lines changed: 1159 additions & 0 deletions

File tree

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# This workflow will build a Java project with Maven, and cache/restore any dependencies to improve the workflow execution time
2+
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-java-with-maven
3+
4+
# This workflow uses actions that are not certified by GitHub.
5+
# They are provided by a third-party and are governed by
6+
# separate terms of service, privacy policy, and support
7+
# documentation.
8+
9+
name: Instrumented LRU Cache Example
10+
11+
on:
12+
push:
13+
branches: [ "main" ]
14+
pull_request:
15+
branches: [ "main" ]
16+
17+
jobs:
18+
build:
19+
20+
runs-on: ubuntu-latest
21+
strategy:
22+
matrix:
23+
java_version: [ 21 ]
24+
25+
steps:
26+
- uses: actions/checkout@v4
27+
- name: Set up JDK ${{ matrix.java_version }}
28+
uses: actions/setup-java@v4
29+
with:
30+
java-version: ${{ matrix.java_version }}
31+
distribution: 'temurin'
32+
cache: maven
33+
- name: Build with Maven
34+
run: cd examples/lrucache && mvn -q compile exec:exec
35+
36+

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ To use free form tracing, check traceInstant.
4646

4747
### Examples
4848

49+
(In addition to the examples below, also see under root folder `examples/lrucache` for an example that generates interesting traces, inspired by a real-life investigation about the performance of an application's cache.)
4950

5051
To instrument
5152

examples/lrucache/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
traces/

examples/lrucache/README.md

Lines changed: 343 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,343 @@
1+
# LRU cache example
2+
3+
This is inspired from a real-life scenario where nice traces were obtained :)
4+
5+
The program is a toy LRU cache backing a data structure. Reader threads each access the data structure sequentially through the cache. When the cache contains an element, it's returned really quick, else there's a delay to simulate it being loaded from slower storage (e.g. disk). The size of the data struct and cache are configurable, as well as the number of reader thread and a few other parameters.
6+
7+
## Real-life case
8+
9+
This toy example was inspired by a real-life investigation, regarding a web-accessible application that had configurable caches, including a specific one whose performance came under scrutiny.
10+
11+
The main symptoms were that some performance problems coincided with that cache being used intensively. It was also noticed that though the cache was used by multiple threads concurrently, that should have been completely independent in their execution, they seem to end-up synchronized, executing in lock-step, and finishing execution almost at the same time.
12+
13+
That investigation produced some interesting traces, which we aim to approximately replicate with this example.
14+
15+
### Real life investigation - some info about that cache:
16+
17+
`type`: LRU (least recently used) - when the cache is full, the cache element that was the least recently used is evicted to make space for a new element to be inserted in its place.
18+
19+
`configurable cache size`: The number of elements it can hold is configurable by the app's admins
20+
21+
`Cache usage`: The cache is used in a peculiar way. Thought the cached data can be accessed randomly, it's often accessed sequentially, from first element to the last, in order. The app is multi-threaded and often multiple reader threads would be reading the data through the cache, in parallel.
22+
23+
The app provides cache statistics, which seemed quite good (a high cache "hit" rate was reported) despite the.
24+
25+
### Investigation goal
26+
27+
The cache was configured to be slightly under-sized vs the data it was caching, which size (number of elements) was slowly but surely increasing. Given the good reported cache stats, was it an issue that it was a bit undersized (by ~15%)? Was the cache, as configured, potentially affecting performance?
28+
29+
### Tracing to the rescue!
30+
31+
The cache and reader thread code were instrumented. From the generated traces, it was possible to distinguish cache hits vs misses. Capturing traces for a few scenarios, some surprising details started to emerge
32+
33+
## Build and run the example:
34+
35+
### Configuration
36+
37+
The example has several configurable parameters, that can be provided as CLI arguments or through maven config:
38+
39+
- --cache-size: number of elements the cache can hold
40+
- --data-size: number of elements contained in the structure to be cached
41+
- --num-threads: number of reader threads that will read the structure through the cache
42+
- --readers-startup-delay-ms: delay between starting reader threads
43+
- --cache-miss-delay-ms: delay to use when a cache miss occurs, e.g. to simulate the data being read from a slower source (e.g. disk or network)
44+
- --log-warmup: whether or not to include the cache warm-up loop in the generated traces
45+
- --verbose: Whether to have verbose, detailed output on STDOUT
46+
47+
48+
When run using maven, one can tweak the configuration by editing their values in local `pom.xml`, section `configuration` under `exec-maven-plugin`:
49+
50+
```xml
51+
[...]
52+
<argument>--cache-size=95</argument>
53+
<argument>--data-size=100</argument>
54+
<argument>--num-threads=10</argument>
55+
<argument>--readers-startup-delay-ms=130</argument>
56+
<argument>--cache-miss-delay-ms=6</argument>
57+
<argument>--log-warmup=true</argument>
58+
<argument>--verbose=false</argument>
59+
```
60+
61+
### Running the example
62+
63+
After having configured it as wanted in `pom.xml`:
64+
65+
```bash
66+
$ cd examples/lrucache
67+
$ mvn -q compile exec:exec
68+
```
69+
70+
The generated trace will be under sub-folder `traces/`.
71+
72+
### Opening the generated trace in Trace Compass
73+
74+
First, run `jsonify.py` on the generated trace. The first time, it's necessary to install a couple of Python dependencies - see near the top of `jsonify.py` for the "pip" command to run.
75+
76+
```bash
77+
$ cd examples/lrucache/traces
78+
# "massage" the trace into valid json -
79+
# feel free to give it a more descriptive name
80+
# (second parameter):
81+
../../../jsonify.py trace.log cache_trace.json
82+
# If everything went well "cache_trace.json" can be
83+
# opened in Trace Compass!
84+
```
85+
86+
#### Trace Compass
87+
88+
If you do not already have it, download the Trace Compass RCP and unzip it to a convenient place, then start it.
89+
90+
Create a new tracing project, expand it. Select "traces" and right-click, select "open trace", navigate-to and select the newly generated cache trace. Then right-click on the new trace and hover "Select trace type". Confirm if there is a "Trace Event Format" entry. If so, select "Generic Trace Event Trace" under it, as the trace type. Else, we need to install support for this format:
91+
92+
`Help` -> `Install New Software`, then search for "trace event". Under
93+
category `Trace Types`, install:
94+
`Trace Compass TraceEvent Parser (incubator)`
95+
96+
With the correct trace type selected, double-click on the trace to open it. Under it, there should be an expendable "views" item, and under it two "Thread by Component" entries - expand the second one, and select "Flame Chart (new callback)" under it.
97+
98+
99+
## Playing with the example
100+
101+
With the mechanical aspect out of the way, here's some suggestions for interesting configurations to try, that should generate interesting traces, from which a deeper understanding of the LRU cache, as it's used in the example, can be gained.
102+
103+
### 1) Warm-up and then 1 reader thread
104+
105+
The warm-up populates the cache, hopefully making it ready for a "real" reader to use it. Compare the performance of 1 reader thread, when the cache is fully sized (100/100) vs slightly undersized (95/100):
106+
107+
#### Traces to capture
108+
109+
a)
110+
```xml
111+
<!-- Fully sized cache -->
112+
<argument>--cache-size=20</argument>
113+
<argument>--data-size=20</argument>
114+
<argument>--num-threads=1</argument>
115+
<argument>--readers-startup-delay-ms=0</argument>
116+
<argument>--cache-miss-delay-ms=6</argument>
117+
<argument>--log-warmup=true</argument>
118+
<argument>--verbose=false</argument>
119+
```
120+
121+
b)
122+
```xml
123+
<!-- Slightly undersized cache -->
124+
<argument>--cache-size=19</argument>
125+
<argument>--data-size=20</argument>
126+
<argument>--num-threads=1</argument>
127+
<argument>--readers-startup-delay-ms=0</argument>
128+
<argument>--cache-miss-delay-ms=6</argument>
129+
<argument>--log-warmup=true</argument>
130+
<argument>--verbose=false</argument>
131+
```
132+
133+
#### Sample results (STDOUT)
134+
135+
a) Fully sized cache
136+
```
137+
--- Cache Stats ---
138+
Total Requests : 20
139+
Cache Hits : 20 (100.00%)
140+
Cache Misses : 0 (0.00%)
141+
Avg get() Time/Op: 0.469 ms
142+
Total get() Time : 9 ms
143+
-------------------
144+
```
145+
146+
b) Slightly undersized cache
147+
```
148+
--- Cache Stats ---
149+
Total Requests : 20
150+
Cache Hits : 0 (0.00%)
151+
Cache Misses : 20 (100.00%)
152+
Avg get() Time/Op: 8.044 ms
153+
Total get() Time : 160 ms
154+
-------------------
155+
```
156+
157+
#### Leads
158+
159+
- How to explain the slightly undersized cache has `100%` miss rate?
160+
- Performance has decreased by ~20x, probably because the cache never hits?
161+
162+
163+
### 2) Warm-up then 5 reader threads started at once
164+
165+
The warm-up populates the cache, hopefully making it ready for a "real" reader to use it.
166+
167+
Compare the performance when the cache is fully sized (100/100) vs slightly undersized (95/100):
168+
169+
#### Traces to capture
170+
171+
a)
172+
```xml
173+
<!-- Fully sized cache -->
174+
<argument>--cache-size=20</argument>
175+
<argument>--data-size=20</argument>
176+
<argument>--num-threads=5</argument>
177+
<argument>--readers-startup-delay-ms=0</argument>
178+
<argument>--cache-miss-delay-ms=6</argument>
179+
<argument>--log-warmup=true</argument>
180+
```
181+
182+
b)
183+
```xml
184+
<!-- Slightly undersized cache -->
185+
<argument>--cache-size=19</argument>
186+
<argument>--data-size=20</argument>
187+
<argument>--num-threads=5</argument>
188+
<argument>--readers-startup-delay-ms=0</argument>
189+
<argument>--cache-miss-delay-ms=6</argument>
190+
<argument>--log-warmup=true</argument>
191+
<argument>--verbose=false</argument>
192+
```
193+
194+
#### Sample results (STDOUT)
195+
196+
a) Fully sized cache
197+
```
198+
--- Cache Stats ---
199+
Total Requests : 100
200+
Cache Hits : 100 (100.00%)
201+
Cache Misses : 0 (0.00%)
202+
Avg get() Time/Op: 0.581 ms
203+
Total get() Time : 58 ms
204+
-------------------
205+
```
206+
207+
b) Slightly undersized cache
208+
```
209+
--- Cache Stats ---
210+
Total Requests : 100
211+
Cache Hits : 80 (80.00%)
212+
Cache Misses : 20 (20.00%)
213+
Avg get() Time/Op: 7.446 ms
214+
Total get() Time : 744 ms
215+
-------------------
216+
```
217+
218+
#### Leads
219+
220+
- How to explain the disproportional decrease in performance between the fully sized caches vs 95% sized results? It's a ~12x factor!
221+
222+
### 3) Warm-up then 5 reader threads started with a delay
223+
224+
The warm-up populates the cache, hopefully making it ready for a "real" reader to use it.
225+
226+
Compare the performance when the cache is fully sized (100/100) vs slightly undersized (95/100):
227+
228+
#### Traces to capture
229+
230+
a)
231+
```xml
232+
<!-- Fully sized cache -->
233+
<argument>--cache-size=20</argument>
234+
<argument>--data-size=20</argument>
235+
<argument>--num-threads=5</argument>
236+
<argument>--readers-startup-delay-ms=25</argument>
237+
<argument>--cache-miss-delay-ms=6</argument>
238+
<argument>--log-warmup=true</argument>
239+
```
240+
241+
b)
242+
```xml
243+
<!-- Slightly undersized cache -->
244+
<argument>--cache-size=19</argument>
245+
<argument>--data-size=20</argument>
246+
<argument>--num-threads=5</argument>
247+
<argument>--readers-startup-delay-ms=25</argument>
248+
<argument>--cache-miss-delay-ms=6</argument>
249+
<argument>--log-warmup=true</argument>
250+
<argument>--verbose=false</argument>
251+
```
252+
253+
#### Sample results (STDOUT)
254+
255+
a) Fully sized cache
256+
```
257+
--- Cache Stats ---
258+
Total Requests : 100
259+
Cache Hits : 100 (100.00%)
260+
Cache Misses : 0 (0.00%)
261+
Avg get() Time/Op: 0.187 ms
262+
Total get() Time : 18 ms
263+
-------------------
264+
```
265+
266+
b) Slightly undersized cache
267+
```
268+
--- Cache Stats ---
269+
Total Requests : 100
270+
Cache Hits : 80 (80.00%)
271+
Cache Misses : 20 (20.00%)
272+
Avg get() Time/Op: 5.026 ms
273+
Total get() Time : 502 ms
274+
-------------------
275+
```
276+
277+
#### Leads
278+
279+
- How to explain the disproportional decrease in performance between the fully sized caches vs 95% sized results?
280+
- Looking at the captured traces, do reader thread executions overlap in both cases? Any unexpected effect of overlap when the cache is undersized (e.g. unintended thread synchronization or lockstep effect)?
281+
- zooming-in, in a zone in the trace where two or more threads seem to execute in "lockstep", examine one particular iteration
282+
- how many of the reader threads end-up having a cache miss?
283+
- how does iteration execution time compare between the reader thread that has a cache miss vs those who have a "cache hit after waiting"?
284+
- in such a situation, is there a meaningful performance gain in a reported cache miss vs hit?
285+
286+
### 4) "At scale" run
287+
288+
This run we will push the limits a bit to generate a nice, rather big trace.
289+
290+
Compare the performance when the cache is fully sized (100/100) vs slightly undersized (95/100):
291+
292+
#### Traces to capture
293+
294+
a)
295+
```xml
296+
<!-- Fully sized cache -->
297+
<argument>--cache-size=100</argument>
298+
<argument>--data-size=100</argument>
299+
<argument>--num-threads=10</argument>
300+
<argument>--readers-startup-delay-ms=130</argument>
301+
<argument>--cache-miss-delay-ms=6</argument>
302+
<argument>--log-warmup=true</argument>
303+
```
304+
305+
b)
306+
```xml
307+
<!-- Slightly undersized cache -->
308+
<argument>--cache-size=95</argument>
309+
<argument>--data-size=100</argument>
310+
<argument>--num-threads=10</argument>
311+
<argument>--readers-startup-delay-ms=130</argument>
312+
<argument>--cache-miss-delay-ms=6</argument>
313+
<argument>--log-warmup=true</argument>
314+
<argument>--verbose=false</argument>
315+
```
316+
317+
#### Sample results (STDOUT)
318+
319+
a) Fully sized cache
320+
```
321+
--- Cache Stats ---
322+
Total Requests : 1000
323+
Cache Hits : 1000 (100.00%)
324+
Cache Misses : 0 (0.00%)
325+
Avg get() Time/Op: 0.083 ms
326+
Total get() Time : 82 ms
327+
-------------------
328+
```
329+
330+
b) Slightly undersized cache
331+
```
332+
--- Cache Stats ---
333+
Total Requests : 1000
334+
Cache Hits : 806 (80.60%)
335+
Cache Misses : 194 (19.40%)
336+
Avg get() Time/Op: 3.985 ms
337+
Total get() Time : 3985 ms
338+
-------------------
339+
```
340+
341+
#### Leads
342+
343+
Nothing new, but look how nice the trace looks :)

0 commit comments

Comments
 (0)