Feature: Prompt cache

Currently `MLXLMCommon` has some basic support for a cache, however it isn't persisted across calls to `generate()`.

Even though it appears there could be a way to pass a `KVCache` to `generate()`, it ultimately must pass through the Sendable boundary if the app is to manage the cache. This isn't possible as `MLXArray` is not Sendable and also isn't desirable or necessary.

A prompt cache could be managed by the `ModelContainer` actor and stored in its context `ModelContext.promptCache`. Note that the prompt cache is an array of `KVCache`. In `mlx_lm` the `PromptCache` object also stores the token ids of the cached prompt and the model key to check if the model has changed. 

We could implement a similar struct:

```swift
public struct PromptCache {
    public let cache: [KVCache]
    public let modelKey: String
    public let tokens: MLXArray
}
```

The `PromptCache` struct could also have functions for trimming.

Functions analogous to `mlx_lm`'s `get_prompt_cache` could go in the `ModelContainer` actor.

I'm currently having a go at implementing this. Interested in any suggestions on the best approach.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Prompt cache #84

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature: Prompt cache #84

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions