I attempted to reproduce the hit rate results in your blog article, where you observed that TinyLFU had lower performance than S3Fifo. I ran your workloads against the reference implementation, Caffeine.
Background
Caffeine uses the adaptive windowing technique, so this does not fully discount your observations. The paper recommended a 1% window as a starting point, since that works well in many critical workloads like database, search, and analytics. It concludes by showing workloads where a larger window is required and that dynamically adjusting this is left for a future work. That subsequent work to evaluate approaches does appear to correct this deficiency and is most often used in practice (available in Java, Rust, Go, Python, TypeScript).
Caffeine starts with a 1% window, monitors the hit rate over the TinyLFU sample period, and uses hill climbing to adjust the window size. The number of frequency counters is not known upfront, so like a hash table the capacity grows based on the entry count. This resizing loses the popularity scores and the policy may have started at a poor window size. Thus, like a JIT compiler, during this warmup period the hit rate may be lower but it quickly reaches peak performance so that this becomes noise except in very short runs. The admission decisions vary slightly between runs due to jitter introduced to mitigate hash flooding attacks.
Analysis
I ran the two simulators side-by-side by comparing the LRU hit rates to ensure that the traces are being evaluated correctly. Caffeine's simulator does not have rich variable size support (entry's weight) as this is less common in Java due to the object's heap usage being mostly hidden from the developer. There is a fork that extends support in order to evaluate size-aware TinyLFU strategies (research paper) where they proposed an adaption to further improve the hit rate. I hope to someday evaluate that proposal and incorporate the improvements.
I used libcachesim at 0f4d135 (current master) and Caffeine at cb5f75d (current master) with this patch to include the new trace formats.
In some traces the simulators report different results for FIFO and LRU by up to 0.25%. I determined that this was because libcachesim does not update the entry's size on a cache hit but does record the byte hit with the new weight.
./cachesim ../../data/trace.csv csv fifo,lru,qdlp,s3fifo,sieve 1gb -t "time-col=2, obj-id-col=5, obj-size-col=4"
|
Hit Rate |
Byte Hit Rate |
| FIFO |
36.64% (63.36% MR) |
26.77% (73.23% MR) |
| LRU |
36.96% (63.04% MR) |
27.14% (72.86% MR) |
| QDLP |
43.43% (56.57% MR) |
35.99% (64.01% MR) |
| S3Fifo |
38.22% (61.78% MR) |
29.10% (70.90% MR) |
| Sieve |
43.47% (56.53% MR) |
36.03% (63.97% MR) |
| Caffeine |
47.07% (52.93% MR) |
38.22% (61.78% MR) |
Caffeine leads Sieve (the best algorithm) by 3.60% HR / 2.19% BHR, and S3Fifo by 8.85% HR / 9.12% BHR.
./cachesim ../../data/twitter_cluster52.csv csv fifo,lru,qdlp,s3fifo,sieve 1mb -t "time-col=1, obj-id-col=2, obj-size-col=3"
|
Hit Rate |
Byte Hit Rate |
| FIFO |
70.65 (29.35% MR) |
68.89% (31.11% MR) |
| LRU |
73.13% (26.87% MR) |
71.53% (28.47% MR) |
| QDLP |
75.63% (24.37% MR) |
74.82% (25.18% MR) |
| S3Fifo |
76.66% (23.34% MR) |
75.99% (24.01% MR) |
| Sieve |
74.86% (25.14% MR) |
73.88 (26.12% MR) |
| Caffeine |
75.91% (24.09% MR) |
75.09% (24.91% MR) |
Caffeine is beaten by S3Fifo (the best algorithm) by -0.75% HR / -0.90% BHR. That is very different from the -10% difference that was observed in the blog article.
./cachesim ../../data/hm_0.csv csv fifo,lru,qdlp,s3fifo,sieve 1gb -t "time-col=2, obj-id-col=5, obj-size-col=6"
|
Hit Rate |
Byte Hit Rate |
| FIFO |
72.86 (27.14% MR) |
59.12% (40.88% MR) |
| LRU |
73.76% (26.24% MR) |
60.67% (39.33% MR) |
| QDLP |
75.27% (24.73% MR) |
63.82% (36.18% MR) |
| S3Fifo |
75.22 (24.78% MR) |
63.52% (36.48% MR) |
| Sieve |
73.93% (26.07% MR) |
62.34% (37.66% MR) |
| Caffeine |
75.65% (24.35% MR) |
64.27% (35.73% MR) |
Caffeine leads QDLP (the best algorithm) by 0.38% HR / 0.45% BHR, and S3Fifo by 0.43% HR / 0.75% BHR.
Multithreading
The article states that LRU is a scalability bottleneck due to reordering under a lock. This assertion might be misleading because many caches decouple their eviction policy from the way that they manage concurrency, rendering LRU's characteristics here irrelevant. This allows the cache designer to consider a broader set of algorithms and data structures that may allow for better time and space efficiency, or that enable additional features. For example, Caffeine scaled linearly to 380M reads/s on 16 cores (2015 hardware). In comparison, Segcache's FIFO policy was reported at 70M reads/s on 24 cores (2020 hardware). The benchmarks may differ, but it shows that the policy's concurrency does not need to be a bottleneck and designers can focus on other areas to optimize.
Conclusion
In the workloads that were considered exemplary by this project and used for its policy designs, we can observe that adaptive W-TinyLFU (Caffeine) demonstrates competitive performance. It is important to compare reference implementations to avoid accidental differences and I have not attempted to debug your implementation to explain the degradation. I believe any static configuration will underperform in some scenario and that designers should continue to explore ways to make more effective and adaptive choices.
P.S. This is an issue since you disabled the discussion tab. You are welcome to close this after reading (self destructs in 10, 9, ...)
I attempted to reproduce the hit rate results in your blog article, where you observed that TinyLFU had lower performance than S3Fifo. I ran your workloads against the reference implementation, Caffeine.
Background
Caffeine uses the adaptive windowing technique, so this does not fully discount your observations. The paper recommended a 1% window as a starting point, since that works well in many critical workloads like database, search, and analytics. It concludes by showing workloads where a larger window is required and that dynamically adjusting this is left for a future work. That subsequent work to evaluate approaches does appear to correct this deficiency and is most often used in practice (available in Java, Rust, Go, Python, TypeScript).
Caffeine starts with a 1% window, monitors the hit rate over the TinyLFU sample period, and uses hill climbing to adjust the window size. The number of frequency counters is not known upfront, so like a hash table the capacity grows based on the entry count. This resizing loses the popularity scores and the policy may have started at a poor window size. Thus, like a JIT compiler, during this warmup period the hit rate may be lower but it quickly reaches peak performance so that this becomes noise except in very short runs. The admission decisions vary slightly between runs due to jitter introduced to mitigate hash flooding attacks.
Analysis
I ran the two simulators side-by-side by comparing the LRU hit rates to ensure that the traces are being evaluated correctly. Caffeine's simulator does not have rich variable size support (entry's weight) as this is less common in Java due to the object's heap usage being mostly hidden from the developer. There is a fork that extends support in order to evaluate size-aware TinyLFU strategies (research paper) where they proposed an adaption to further improve the hit rate. I hope to someday evaluate that proposal and incorporate the improvements.
I used libcachesim at 0f4d135 (current master) and Caffeine at cb5f75d (current master) with this patch to include the new trace formats.
In some traces the simulators report different results for FIFO and LRU by up to 0.25%. I determined that this was because libcachesim does not update the entry's size on a cache hit but does record the byte hit with the new weight.
trace.csv
./cachesim ../../data/trace.csv csv fifo,lru,qdlp,s3fifo,sieve 1gb -t "time-col=2, obj-id-col=5, obj-size-col=4"Caffeine leads Sieve (the best algorithm) by 3.60% HR / 2.19% BHR, and S3Fifo by 8.85% HR / 9.12% BHR.
twitter_cluster52.csv
./cachesim ../../data/twitter_cluster52.csv csv fifo,lru,qdlp,s3fifo,sieve 1mb -t "time-col=1, obj-id-col=2, obj-size-col=3"Caffeine is beaten by S3Fifo (the best algorithm) by -0.75% HR / -0.90% BHR. That is very different from the -10% difference that was observed in the blog article.
hm_0.csv.gz
./cachesim ../../data/hm_0.csv csv fifo,lru,qdlp,s3fifo,sieve 1gb -t "time-col=2, obj-id-col=5, obj-size-col=6"Caffeine leads QDLP (the best algorithm) by 0.38% HR / 0.45% BHR, and S3Fifo by 0.43% HR / 0.75% BHR.
Multithreading
The article states that LRU is a scalability bottleneck due to reordering under a lock. This assertion might be misleading because many caches decouple their eviction policy from the way that they manage concurrency, rendering LRU's characteristics here irrelevant. This allows the cache designer to consider a broader set of algorithms and data structures that may allow for better time and space efficiency, or that enable additional features. For example, Caffeine scaled linearly to 380M reads/s on 16 cores (2015 hardware). In comparison, Segcache's FIFO policy was reported at 70M reads/s on 24 cores (2020 hardware). The benchmarks may differ, but it shows that the policy's concurrency does not need to be a bottleneck and designers can focus on other areas to optimize.
Conclusion
In the workloads that were considered exemplary by this project and used for its policy designs, we can observe that adaptive W-TinyLFU (Caffeine) demonstrates competitive performance. It is important to compare reference implementations to avoid accidental differences and I have not attempted to debug your implementation to explain the degradation. I believe any static configuration will underperform in some scenario and that designers should continue to explore ways to make more effective and adaptive choices.
P.S. This is an issue since you disabled the discussion tab. You are welcome to close this after reading (self destructs in 10, 9, ...)