Feature Request: Add `--ngram-text` to prefill ngram cache from a text file #22262

zhzLuke96 · 2026-04-22T19:30:41Z

zhzLuke96
Apr 22, 2026

I've been using --spec-type ngram-mod and it's surprisingly good. On my machine it actually beats using a draft model. I'm genuinely impressed.

Right now the ngram cache only learns from the current conversation and recent generations. I want to give it a head start by feeding it a text file full of domain-specific content (like my own codebase, documentation, or outputs from bigger models).

What I want:
A simple parameter like --ngram-text ./my_stuff.txt. The server would read the text, tokenize it, and build the ngram cache from it at startup.

Why this would be awesome:

Achieve functionality similar to LoRA or MoE, switching the ngram model based on different scenarios.
Enable ngram tuning: we can control the ngram data to easily fine-tune the ngram model and run automated benchmarks.
Possibly transfer model capabilities: collect outputs from the strongest LLMs (models like GLM, Kimi, DeepSeek—there are many distilled datasets on HuggingFace) as ngram source material, allowing the ngram model to guide a smaller model and improve its performance.

Why this discussion:
I almost opened an issue but saw the note that new features should be discussed first. So here I am.

i think internal API already exists, just need a command line hook.

llama.cpp/common/ngram-cache.h

Lines 76 to 86 in 0d0764d

    
           // Try to draft tokens from ngram caches. 
        
           // inp:                the tokens generated so far. 
        
           // draft:              the token sequence to draft. Expected to initially contain the previously sampled token. 
        
           // n_draft:            maximum number of tokens to add to draft. 
        
           // ngram_min/gram_max: the min/max size of the ngrams in nc_context and nc_dynamic. 
        
           // nc_context:         ngram cache based on current context. 
        
           // nc_dynamic:         ngram cache based on previous user generations. 
        
           // nc_static:          ngram cache generated from a large text corpus, used for validation. 
        
           void common_ngram_cache_draft( 
        
               std::vector<llama_token> & inp, std::vector<llama_token> & draft, int n_draft, int ngram_min, int ngram_max, 
        
               common_ngram_cache & nc_context, common_ngram_cache & nc_dynamic, common_ngram_cache & nc_static);

Thanks!

tripletto · 2026-05-04T12:53:02Z

tripletto
May 4, 2026

That would be very awesome, if possible!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Add `--ngram-text` to prefill ngram cache from a text file #22262

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Feature Request: Add --ngram-text to prefill ngram cache from a text file #22262

Uh oh!

zhzLuke96 Apr 22, 2026

Replies: 1 comment

Uh oh!

tripletto May 4, 2026

Feature Request: Add `--ngram-text` to prefill ngram cache from a text file #22262

zhzLuke96
Apr 22, 2026

tripletto
May 4, 2026