Skip to content

Commit 7d2adb8

Browse files
committed
Add Token-Cost-Aware W-TinyLFU eviction policy
Implement a W-TinyLFU eviction policy that combines frequency-based admission filtering with cost-weighted eviction decisions, targeting LLM caching workloads where response regeneration costs vary widely. Architecture (following Caffeine's design): - Window LRU (1%): absorbs burst traffic - TinyLFU admission gate: Count-Min Sketch + Bloom doorkeeper - Segmented main LRU (99%): probation (20%) + protected (80%) Cost-aware extension: when enabled, admission multiplies frequency by response token count, preferring to retain expensive entries. Components: - count_min_sketch.py: 4-bit packed counters with periodic aging - doorkeeper.py: Bloom filter to reject one-hit-wonders - segmented_lru.py: two-tier LRU with promotion/demotion - wtinylfu_eviction.py: orchestrator implementing EvictionBase Registered as name="wtinylfu" in the eviction factory. Tunable via window_pct, probation_pct, cost_aware, and CMS parameters. No new external dependencies (uses numpy + stdlib only). 32 unit tests covering all components and algorithm properties. Usage examples added to examples/eviction/.
1 parent c59fb3a commit 7d2adb8

12 files changed

Lines changed: 1026 additions & 3 deletions

File tree

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -372,7 +372,7 @@ The **Cache Manager** is responsible for controlling the operation of both the *
372372
- [x] Support FIFO eviction policy.
373373
- [x] Support LFU eviction policy.
374374
- [x] Support RR eviction policy.
375-
- [ ] Support more complicated eviction policies.
375+
- [x] Support W-TinyLFU eviction policy with cost-aware admission.
376376
- **Distributed Caching**
377377

378378
If you were to scale your GPTCache deployment horizontally using in-memory caching, it won't be possible. Since the cached information would be limited to the single pod.

examples/README.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
- [Start server](#start-server)
1717
- [Benchmark](#benchmark)
1818
- [How to use post-process function](#how-to-use-post-process-function)
19+
- [How to set the eviction policy](#how-to-set-the-eviction-policy)
1920

2021
## How to run Visual Question Answering with MiniGPT-4
2122

@@ -715,3 +716,29 @@ cache.init(
715716
```
716717
717718
See [processor/post_example.py](./processor/post_example.py) for a runnable example.
719+
720+
## How to set the `eviction` policy
721+
722+
GPTCache supports several eviction policies: LRU (default), FIFO, LFU, and W-TinyLFU.
723+
724+
### W-TinyLFU eviction
725+
726+
The W-TinyLFU policy combines a TinyLFU admission filter with a segmented LRU, achieving near-optimal hit rates. It optionally supports cost-aware admission for LLM workloads where response regeneration costs vary.
727+
728+
See [eviction/wtinylfu_eviction.py](./eviction/wtinylfu_eviction.py) for full examples.
729+
730+
```python
731+
from gptcache.manager import get_data_manager, CacheBase, VectorBase
732+
from gptcache.manager.eviction import EvictionBase
733+
734+
data_manager = get_data_manager(
735+
cache_base=CacheBase("sqlite"),
736+
vector_base=VectorBase("faiss", dimension=onnx.dimension),
737+
eviction_base=EvictionBase(
738+
"wtinylfu",
739+
maxsize=200,
740+
clean_size=50,
741+
cost_aware=True, # weight admission by response token count
742+
),
743+
)
744+
```
Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
from gptcache import Cache
2+
from gptcache.embedding import Onnx
3+
from gptcache.manager import get_data_manager, CacheBase, VectorBase
4+
from gptcache.manager.eviction import EvictionBase
5+
6+
7+
def wtinylfu_basic_example():
8+
"""
9+
Basic W-TinyLFU eviction example.
10+
11+
Uses the default settings: 1% window, 20/80 probation/protected split,
12+
cost-aware admission enabled. The policy combines frequency-based
13+
admission filtering (TinyLFU) with cost-weighted eviction decisions,
14+
preferring to retain expensive-to-regenerate cache entries.
15+
"""
16+
onnx = Onnx()
17+
data_manager = get_data_manager(
18+
cache_base=CacheBase("sqlite"),
19+
vector_base=VectorBase("faiss", dimension=onnx.dimension),
20+
eviction_base=EvictionBase(
21+
"wtinylfu",
22+
maxsize=200,
23+
clean_size=50,
24+
),
25+
)
26+
27+
cache = Cache()
28+
cache.init(data_manager=data_manager)
29+
question = "What is github?"
30+
answer = "Online platform for version control and code collaboration."
31+
embedding = onnx.to_embeddings(question)
32+
cache.import_data([question], [answer], [embedding])
33+
34+
35+
def wtinylfu_custom_params_example():
36+
"""
37+
W-TinyLFU with custom parameters.
38+
39+
Tunable parameters:
40+
- window_pct: window cache as % of total capacity (default: 1.0)
41+
- probation_pct: probation segment as % of main cache (default: 20.0)
42+
- cost_aware: enable cost-weighted admission (default: True)
43+
- cms_width_multiplier: Count-Min Sketch width scaling (default: 1)
44+
- reset_multiplier: CMS aging interval as multiple of capacity (default: 10)
45+
"""
46+
onnx = Onnx()
47+
data_manager = get_data_manager(
48+
cache_base=CacheBase("sqlite"),
49+
vector_base=VectorBase("faiss", dimension=onnx.dimension),
50+
eviction_base=EvictionBase(
51+
"wtinylfu",
52+
maxsize=500,
53+
clean_size=100,
54+
window_pct=2.0,
55+
probation_pct=25.0,
56+
cost_aware=True,
57+
),
58+
)
59+
60+
cache = Cache()
61+
cache.init(data_manager=data_manager)
62+
question = "Explain quantum computing"
63+
answer = "Quantum computing uses quantum bits (qubits) that can exist in superposition..."
64+
embedding = onnx.to_embeddings(question)
65+
cache.import_data([question], [answer], [embedding])
66+
67+
68+
def wtinylfu_no_cost_example():
69+
"""
70+
W-TinyLFU without cost awareness (pure frequency-based admission).
71+
72+
When cost_aware=False, the admission decision uses only the TinyLFU
73+
frequency estimate, equivalent to Caffeine's default policy.
74+
"""
75+
onnx = Onnx()
76+
data_manager = get_data_manager(
77+
cache_base=CacheBase("sqlite"),
78+
vector_base=VectorBase("faiss", dimension=onnx.dimension),
79+
eviction_base=EvictionBase(
80+
"wtinylfu",
81+
maxsize=200,
82+
clean_size=50,
83+
cost_aware=False,
84+
),
85+
)
86+
87+
cache = Cache()
88+
cache.init(data_manager=data_manager)
89+
question = "What is machine learning?"
90+
answer = "A subset of AI that enables systems to learn from data."
91+
embedding = onnx.to_embeddings(question)
92+
cache.import_data([question], [answer], [embedding])
Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
"""4-bit packed Count-Min Sketch for frequency estimation.
2+
3+
Uses the same design as Caffeine/Theine: 4 hash functions, 4-bit counters
4+
packed 16 per uint64 word, periodic halving for aging.
5+
"""
6+
7+
import numpy as np
8+
9+
10+
def _next_power_of_2(n: int) -> int:
11+
if n <= 0:
12+
return 1
13+
n -= 1
14+
n |= n >> 1
15+
n |= n >> 2
16+
n |= n >> 4
17+
n |= n >> 8
18+
n |= n >> 16
19+
n |= n >> 32
20+
return n + 1
21+
22+
23+
def _rehash(h: int) -> int:
24+
h = (h ^ (h >> 32)) & 0xFFFFFFFFFFFFFFFF
25+
h = (h * 0x94D049BB133111EB) & 0xFFFFFFFFFFFFFFFF
26+
h = (h ^ (h >> 32)) & 0xFFFFFFFFFFFFFFFF
27+
return h
28+
29+
30+
_RESET_MASK = np.uint64(0x7777777777777777)
31+
_MAX_COUNT = 15
32+
33+
34+
class CountMinSketch:
35+
"""4-bit packed Count-Min Sketch with 4 hash functions.
36+
37+
Each counter is 4 bits (max value 15). 16 counters are packed into
38+
one uint64 word. The sketch uses 4 independent hash functions derived
39+
via iterative rehashing.
40+
41+
:param capacity: expected max number of tracked items (determines width)
42+
:param width_multiplier: width = next_power_of_2(capacity * multiplier)
43+
:param sample_size_multiplier: reset after this * capacity increments
44+
"""
45+
46+
def __init__(
47+
self,
48+
capacity: int,
49+
width_multiplier: int = 1,
50+
sample_size_multiplier: int = 10,
51+
):
52+
self._width = _next_power_of_2(max(capacity * width_multiplier, 16))
53+
self._mask = self._width - 1
54+
# 4 rows, each row has width counters, packed 16 per uint64
55+
words_per_row = max(self._width // 16, 1)
56+
self._table = np.zeros(4 * words_per_row, dtype=np.uint64)
57+
self._words_per_row = words_per_row
58+
self._additions = 0
59+
self._sample_size = max(capacity * sample_size_multiplier, 16)
60+
61+
def increment(self, key_hash: int) -> bool:
62+
"""Increment counters for the given hash. Returns True if any counter changed."""
63+
h0 = _rehash(key_hash)
64+
h1 = _rehash(h0)
65+
h2 = _rehash(h1)
66+
h3 = _rehash(h2)
67+
68+
added = self._inc_counter(0, h0 & self._mask)
69+
added |= self._inc_counter(1, h1 & self._mask)
70+
added |= self._inc_counter(2, h2 & self._mask)
71+
added |= self._inc_counter(3, h3 & self._mask)
72+
73+
if added:
74+
self._additions += 1
75+
76+
return added
77+
78+
def estimate(self, key_hash: int) -> int:
79+
"""Return the estimated frequency (minimum across all rows)."""
80+
h0 = _rehash(key_hash)
81+
h1 = _rehash(h0)
82+
h2 = _rehash(h1)
83+
h3 = _rehash(h2)
84+
85+
c0 = self._read_counter(0, h0 & self._mask)
86+
c1 = self._read_counter(1, h1 & self._mask)
87+
c2 = self._read_counter(2, h2 & self._mask)
88+
c3 = self._read_counter(3, h3 & self._mask)
89+
90+
return min(c0, c1, c2, c3)
91+
92+
def reset(self):
93+
"""Halve all counters (aging / decay)."""
94+
self._table = (self._table >> np.uint64(1)) & _RESET_MASK
95+
self._additions = self._additions // 2
96+
97+
def _inc_counter(self, row: int, index: int) -> bool:
98+
word_idx = row * self._words_per_row + index // 16
99+
nibble_pos = np.uint64((index % 16) * 4)
100+
current = int((self._table[word_idx] >> nibble_pos) & np.uint64(0xF))
101+
if current < _MAX_COUNT:
102+
self._table[word_idx] += np.uint64(1) << nibble_pos
103+
return True
104+
return False
105+
106+
def _read_counter(self, row: int, index: int) -> int:
107+
word_idx = row * self._words_per_row + index // 16
108+
nibble_pos = np.uint64((index % 16) * 4)
109+
return int((self._table[word_idx] >> nibble_pos) & np.uint64(0xF))
110+
111+
@property
112+
def additions(self) -> int:
113+
return self._additions
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
"""Bloom filter doorkeeper for TinyLFU admission control.
2+
3+
Filters out one-hit-wonders: only items seen at least twice get their
4+
Count-Min Sketch counters incremented. This prevents long-tail pollution.
5+
"""
6+
7+
import math
8+
9+
import numpy as np
10+
11+
12+
class Doorkeeper:
13+
"""Simple Bloom filter that tracks whether an item has been seen before.
14+
15+
:param capacity: expected number of insertions before reset
16+
:param fp_rate: target false positive rate (default 1%)
17+
"""
18+
19+
def __init__(self, capacity: int = 10000, fp_rate: float = 0.01):
20+
# Optimal sizing: m = -n*ln(p) / (ln2)^2, k = (m/n)*ln2
21+
if capacity <= 0:
22+
capacity = 16
23+
m = int(-capacity * math.log(fp_rate) / (math.log(2) ** 2))
24+
m = max(m, 64)
25+
self._num_bits = m
26+
self._num_hashes = max(int((m / capacity) * math.log(2)), 1)
27+
# Bit array stored as uint64 words
28+
self._words = np.zeros((m + 63) // 64, dtype=np.uint64)
29+
30+
def allow(self, key_hash: int) -> bool:
31+
"""Check if key was seen before, then add it.
32+
33+
Returns True if the key was already present (second+ access).
34+
Always adds the key regardless.
35+
"""
36+
already_present = self.contains(key_hash)
37+
self.add(key_hash)
38+
return already_present
39+
40+
def contains(self, key_hash: int) -> bool:
41+
"""Check membership without modifying the filter."""
42+
for i in range(self._num_hashes):
43+
bit_pos = self._hash_pos(key_hash, i)
44+
word_idx = bit_pos >> 6 # bit_pos // 64
45+
bit_idx = np.uint64(bit_pos & 63)
46+
if not (self._words[word_idx] & (np.uint64(1) << bit_idx)):
47+
return False
48+
return True
49+
50+
def add(self, key_hash: int):
51+
"""Add a key to the filter."""
52+
for i in range(self._num_hashes):
53+
bit_pos = self._hash_pos(key_hash, i)
54+
word_idx = bit_pos >> 6
55+
bit_idx = np.uint64(bit_pos & 63)
56+
self._words[word_idx] |= np.uint64(1) << bit_idx
57+
58+
def clear(self):
59+
"""Reset the filter (remove all entries)."""
60+
self._words[:] = np.uint64(0)
61+
62+
def _hash_pos(self, key_hash: int, i: int) -> int:
63+
# Double hashing: h(i) = (h1 + i*h2) mod m
64+
h1 = key_hash & 0xFFFFFFFF
65+
h2 = (key_hash >> 32) & 0xFFFFFFFF
66+
return ((h1 + i * h2) & 0xFFFFFFFFFFFFFFFF) % self._num_bits

gptcache/manager/eviction/manager.py

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,11 @@ def get(
4343
from gptcache.manager.eviction.distributed_cache import NoOpEviction
4444
eviction_base = NoOpEviction()
4545
return eviction_base
46+
if name == "wtinylfu":
47+
from gptcache.manager.eviction.wtinylfu_eviction import WTinyLFUEviction
48+
eviction_base = WTinyLFUEviction(
49+
maxsize=maxsize, clean_size=clean_size, on_evict=on_evict, **kwargs
50+
)
51+
return eviction_base
4652

47-
else:
48-
raise NotFoundError("eviction base", name)
53+
raise NotFoundError("eviction base", name)

0 commit comments

Comments
 (0)