Commit 79148ba
committed
[None][fix] Draft KV cache should not allocate host memory
When using one-model speculative decoding with separate draft KV cache
(e.g. EAGLE3), the draft cache inherits the target's KvCacheConfig which
may have a non-zero host_cache_size. This causes unnecessary host memory
allocation for the draft cache. Only the target model should use host
offloading since draft tokens are transient and may be rejected during
verification.
Fix: set host_cache_size=0 on the draft KV cache config before creating
the draft KV cache manager.
Signed-off-by: Shang-Pin Sheng <shang-pin@tmatehq.com>1 parent 64b5c79 commit 79148ba
1 file changed
Lines changed: 5 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
695 | 695 | | |
696 | 696 | | |
697 | 697 | | |
| 698 | + | |
| 699 | + | |
| 700 | + | |
| 701 | + | |
| 702 | + | |
698 | 703 | | |
699 | 704 | | |
700 | 705 | | |
| |||
0 commit comments