Currently MLXLMCommon has some basic support for a cache, however it isn't persisted across calls to generate().
Even though it appears there could be a way to pass a KVCache to generate(), it ultimately must pass through the Sendable boundary if the app is to manage the cache. This isn't possible as MLXArray is not Sendable and also isn't desirable or necessary.
A prompt cache could be managed by the ModelContainer actor and stored in its context ModelContext.promptCache. Note that the prompt cache is an array of KVCache. In mlx_lm the PromptCache object also stores the token ids of the cached prompt and the model key to check if the model has changed.
We could implement a similar struct:
public struct PromptCache {
public let cache: [KVCache]
public let modelKey: String
public let tokens: MLXArray
}
The PromptCache struct could also have functions for trimming.
Functions analogous to mlx_lm's get_prompt_cache could go in the ModelContainer actor.
I'm currently having a go at implementing this. Interested in any suggestions on the best approach.
Currently
MLXLMCommonhas some basic support for a cache, however it isn't persisted across calls togenerate().Even though it appears there could be a way to pass a
KVCachetogenerate(), it ultimately must pass through the Sendable boundary if the app is to manage the cache. This isn't possible asMLXArrayis not Sendable and also isn't desirable or necessary.A prompt cache could be managed by the
ModelContaineractor and stored in its contextModelContext.promptCache. Note that the prompt cache is an array ofKVCache. Inmlx_lmthePromptCacheobject also stores the token ids of the cached prompt and the model key to check if the model has changed.We could implement a similar struct:
The
PromptCachestruct could also have functions for trimming.Functions analogous to
mlx_lm'sget_prompt_cachecould go in theModelContaineractor.I'm currently having a go at implementing this. Interested in any suggestions on the best approach.