Skip to content

Commit 8da1c15

Browse files
committed
feat(llama-cpp): expose new speculative-decoding option keys
Upstream `spec : parallel drafting support` (ggml-org/llama.cpp#22838) adds the `ngram_mod`, `ngram_map_k`, and `ngram_map_k4v` speculative families and beefs up the draft-model knobs. The previous bump only adapted the API; this exposes the new fields through the grpc-server options dictionary so model configs can drive them. New `options:` keys (all under `backend: llama-cpp`): ngram_mod (`ngram_mod` type): spec_ngram_mod_n_min / spec_ngram_mod_n_max / spec_ngram_mod_n_match ngram_map_k (`ngram_map_k` type): spec_ngram_map_k_size_n / spec_ngram_map_k_size_m / spec_ngram_map_k_min_hits ngram_map_k4v (`ngram_map_k4v` type): spec_ngram_map_k4v_size_n / spec_ngram_map_k4v_size_m / spec_ngram_map_k4v_min_hits ngram lookup caches (`ngram_cache` type): spec_lookup_cache_static / lookup_cache_static spec_lookup_cache_dynamic / lookup_cache_dynamic Draft-model tuning (active when `spec_type` is `draft`): draft_cache_type_k / spec_draft_cache_type_k draft_cache_type_v / spec_draft_cache_type_v draft_threads / spec_draft_threads draft_threads_batch / spec_draft_threads_batch draft_cpu_moe / spec_draft_cpu_moe (bool flag) draft_n_cpu_moe / spec_draft_n_cpu_moe (first N MoE layers on CPU) draft_override_tensor / spec_draft_override_tensor (comma-separated <tensor regex>=<buffer type>; re-implements upstream's static parse_tensor_buffer_overrides since it isn't exported) `spec_type` already accepted comma-separated lists after the previous commit, matching upstream's `common_speculative_types_from_names`. Docs: refresh `docs/content/advanced/model-configuration.md` with per-family tables and a note about multi-type chaining. Builds locally with `make docker-build-llama-cpp` (linux/amd64 cpu-llama-cpp AVX variant). Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
1 parent 39ee9a9 commit 8da1c15

2 files changed

Lines changed: 186 additions & 3 deletions

File tree

backend/cpp/llama-cpp/grpc-server.cpp

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,8 @@
3636
#include <cstdlib>
3737
#include <fstream>
3838
#include <iterator>
39+
#include <list>
40+
#include <map>
3941
#include <mutex>
4042
#include <signal.h>
4143
#include <thread>
@@ -728,6 +730,135 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
728730
// The draft context size is no longer a separate field upstream: the draft
729731
// shares the target context size. Accept the option for backward
730732
// compatibility but silently ignore it.
733+
734+
// --- ngram_mod family (upstream --spec-ngram-mod-*) ---
735+
} else if (!strcmp(optname, "spec_ngram_mod_n_min")) {
736+
if (optval != NULL) {
737+
try { params.speculative.ngram_mod.n_min = std::stoi(optval_str); } catch (...) {}
738+
}
739+
} else if (!strcmp(optname, "spec_ngram_mod_n_max")) {
740+
if (optval != NULL) {
741+
try { params.speculative.ngram_mod.n_max = std::stoi(optval_str); } catch (...) {}
742+
}
743+
} else if (!strcmp(optname, "spec_ngram_mod_n_match")) {
744+
if (optval != NULL) {
745+
try { params.speculative.ngram_mod.n_match = std::stoi(optval_str); } catch (...) {}
746+
}
747+
748+
// --- ngram_map_k family (upstream --spec-ngram-map-k-*) ---
749+
} else if (!strcmp(optname, "spec_ngram_map_k_size_n")) {
750+
if (optval != NULL) {
751+
try { params.speculative.ngram_map_k.size_n = (uint16_t)std::stoi(optval_str); } catch (...) {}
752+
}
753+
} else if (!strcmp(optname, "spec_ngram_map_k_size_m")) {
754+
if (optval != NULL) {
755+
try { params.speculative.ngram_map_k.size_m = (uint16_t)std::stoi(optval_str); } catch (...) {}
756+
}
757+
} else if (!strcmp(optname, "spec_ngram_map_k_min_hits")) {
758+
if (optval != NULL) {
759+
try { params.speculative.ngram_map_k.min_hits = (uint16_t)std::stoi(optval_str); } catch (...) {}
760+
}
761+
762+
// --- ngram_map_k4v family (upstream --spec-ngram-map-k4v-*) ---
763+
} else if (!strcmp(optname, "spec_ngram_map_k4v_size_n")) {
764+
if (optval != NULL) {
765+
try { params.speculative.ngram_map_k4v.size_n = (uint16_t)std::stoi(optval_str); } catch (...) {}
766+
}
767+
} else if (!strcmp(optname, "spec_ngram_map_k4v_size_m")) {
768+
if (optval != NULL) {
769+
try { params.speculative.ngram_map_k4v.size_m = (uint16_t)std::stoi(optval_str); } catch (...) {}
770+
}
771+
} else if (!strcmp(optname, "spec_ngram_map_k4v_min_hits")) {
772+
if (optval != NULL) {
773+
try { params.speculative.ngram_map_k4v.min_hits = (uint16_t)std::stoi(optval_str); } catch (...) {}
774+
}
775+
776+
// --- ngram lookup caches (upstream --lookup-cache-static / -dynamic) ---
777+
} else if (!strcmp(optname, "spec_lookup_cache_static") || !strcmp(optname, "lookup_cache_static")) {
778+
params.speculative.ngram_cache.lookup_cache_static = optval_str;
779+
} else if (!strcmp(optname, "spec_lookup_cache_dynamic") || !strcmp(optname, "lookup_cache_dynamic")) {
780+
params.speculative.ngram_cache.lookup_cache_dynamic = optval_str;
781+
782+
// --- draft model KV cache types (upstream --spec-draft-type-k / -v) ---
783+
} else if (!strcmp(optname, "draft_cache_type_k") || !strcmp(optname, "spec_draft_cache_type_k")) {
784+
params.speculative.draft.cache_type_k = kv_cache_type_from_str(optval_str);
785+
} else if (!strcmp(optname, "draft_cache_type_v") || !strcmp(optname, "spec_draft_cache_type_v")) {
786+
params.speculative.draft.cache_type_v = kv_cache_type_from_str(optval_str);
787+
788+
// --- draft model thread counts (upstream --spec-draft-threads / -batch) ---
789+
} else if (!strcmp(optname, "draft_threads") || !strcmp(optname, "spec_draft_threads")) {
790+
if (optval != NULL) {
791+
try {
792+
int n = std::stoi(optval_str);
793+
if (n <= 0) n = (int)std::thread::hardware_concurrency();
794+
params.speculative.draft.cpuparams.n_threads = n;
795+
} catch (...) {}
796+
}
797+
} else if (!strcmp(optname, "draft_threads_batch") || !strcmp(optname, "spec_draft_threads_batch")) {
798+
if (optval != NULL) {
799+
try {
800+
int n = std::stoi(optval_str);
801+
if (n <= 0) n = (int)std::thread::hardware_concurrency();
802+
params.speculative.draft.cpuparams_batch.n_threads = n;
803+
} catch (...) {}
804+
}
805+
806+
// --- draft model MoE on CPU (upstream --spec-draft-cpu-moe / --spec-draft-n-cpu-moe) ---
807+
} else if (!strcmp(optname, "draft_cpu_moe") || !strcmp(optname, "spec_draft_cpu_moe")) {
808+
// Bool-style flag: optval may be missing, "true"/"1"/"yes" enables.
809+
const bool enable = (optval == NULL) ||
810+
optval_str == "true" || optval_str == "1" || optval_str == "yes" ||
811+
optval_str == "on" || optval_str == "enabled";
812+
if (enable) {
813+
params.speculative.draft.tensor_buft_overrides.push_back(llm_ffn_exps_cpu_override());
814+
}
815+
} else if (!strcmp(optname, "draft_n_cpu_moe") || !strcmp(optname, "spec_draft_n_cpu_moe")) {
816+
if (optval != NULL) {
817+
try {
818+
int n = std::stoi(optval_str);
819+
if (n < 0) n = 0;
820+
// Keep override-name storage alive for the lifetime of the params struct
821+
// (mirrors upstream arg.cpp behavior with a function-local static).
822+
static std::list<std::string> buft_overrides_draft;
823+
for (int i = 0; i < n; ++i) {
824+
buft_overrides_draft.push_back(llm_ffn_exps_block_regex(i));
825+
params.speculative.draft.tensor_buft_overrides.push_back(
826+
{buft_overrides_draft.back().c_str(), ggml_backend_cpu_buffer_type()});
827+
}
828+
} catch (...) {}
829+
}
830+
831+
// --- draft model tensor buffer overrides (upstream --spec-draft-override-tensor) ---
832+
} else if (!strcmp(optname, "draft_override_tensor") || !strcmp(optname, "spec_draft_override_tensor")) {
833+
// Format: <tensor regex>=<buffer type>,<tensor regex>=<buffer type>,...
834+
// We replicate upstream's parse_tensor_buffer_overrides (static in arg.cpp).
835+
ggml_backend_load_all();
836+
std::map<std::string, ggml_backend_buffer_type_t> buft_list;
837+
for (size_t i = 0; i < ggml_backend_dev_count(); ++i) {
838+
auto * dev = ggml_backend_dev_get(i);
839+
auto * buft = ggml_backend_dev_buffer_type(dev);
840+
if (buft) {
841+
buft_list[ggml_backend_buft_name(buft)] = buft;
842+
}
843+
}
844+
static std::list<std::string> draft_override_names;
845+
std::string cur;
846+
auto flush = [&](const std::string & spec) {
847+
auto pos = spec.find('=');
848+
if (pos == std::string::npos) return;
849+
const std::string name = spec.substr(0, pos);
850+
const std::string type = spec.substr(pos + 1);
851+
auto it = buft_list.find(type);
852+
if (it == buft_list.end()) return; // unknown buffer type: ignore
853+
draft_override_names.push_back(name);
854+
params.speculative.draft.tensor_buft_overrides.push_back(
855+
{draft_override_names.back().c_str(), it->second});
856+
};
857+
for (char c : optval_str) {
858+
if (c == ',') { if (!cur.empty()) { flush(cur); cur.clear(); } }
859+
else { cur.push_back(c); }
860+
}
861+
if (!cur.empty()) flush(cur);
731862
}
732863
}
733864

docs/content/advanced/model-configuration.md

Lines changed: 55 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -251,18 +251,68 @@ options:
251251

252252
These are set via the `options:` array in the model configuration (format: `key:value`):
253253

254+
**Common options**
255+
254256
| Option | Type | Default | Description |
255257
|--------|------|---------|-------------|
256-
| `spec_type` | string | `none` | Speculative decoding type (see table below) |
258+
| `spec_type` / `speculative_type` | string | `none` | Speculative decoding type, or comma-separated list to chain multiple (see table below) |
257259
| `spec_n_max` / `draft_max` | int | 16 | Maximum number of tokens to draft per step |
258260
| `spec_n_min` / `draft_min` | int | 0 | Minimum draft tokens required to use speculation |
259261
| `spec_p_min` / `draft_p_min` | float | 0.75 | Minimum probability threshold for greedy acceptance |
260262
| `spec_p_split` | float | 0.1 | Split probability for tree-based branching |
263+
264+
**Draft-model options** (apply when `spec_type=draft`, i.e. a `draft_model` is configured)
265+
266+
| Option | Type | Default | Description |
267+
|--------|------|---------|-------------|
268+
| `draft_gpu_layers` | int | -1 | GPU layers for the draft model (-1 = use default) |
269+
| `draft_threads` / `spec_draft_threads` | int | same as main | Threads used by the draft model (`<= 0` = hardware concurrency) |
270+
| `draft_threads_batch` / `spec_draft_threads_batch` | int | same as `draft_threads` | Threads used by the draft model during batch / prompt processing |
271+
| `draft_cache_type_k` / `spec_draft_cache_type_k` | string | `f16` | KV cache K data type for the draft model (same values as `cache_type_k`) |
272+
| `draft_cache_type_v` / `spec_draft_cache_type_v` | string | `f16` | KV cache V data type for the draft model |
273+
| `draft_cpu_moe` / `spec_draft_cpu_moe` | bool | false | Keep all MoE expert weights of the draft model on CPU |
274+
| `draft_n_cpu_moe` / `spec_draft_n_cpu_moe` | int | 0 | Keep MoE expert weights of the first N draft-model layers on CPU |
275+
| `draft_override_tensor` / `spec_draft_override_tensor` | string | "" | Comma-separated `<tensor regex>=<buffer type>` overrides for the draft model |
276+
| `draft_ctx_size` | int | (ignored) | Deprecated upstream: the draft now shares the target context size. Accepted for backward compatibility but has no effect. |
277+
278+
**`ngram_simple` options** (used when `spec_type` includes `ngram_simple`)
279+
280+
| Option | Type | Default | Description |
281+
|--------|------|---------|-------------|
261282
| `spec_ngram_size_n` / `ngram_size_n` | int | 12 | N-gram lookup size |
262283
| `spec_ngram_size_m` / `ngram_size_m` | int | 48 | M-gram proposal size |
263284
| `spec_ngram_min_hits` / `ngram_min_hits` | int | 1 | Minimum hits for accepting n-gram proposals |
264-
| `draft_gpu_layers` | int | -1 | GPU layers for the draft model (-1 = use default) |
265-
| `draft_ctx_size` | int | 0 | Context size for the draft model (0 = auto) |
285+
286+
**`ngram_mod` options** (used when `spec_type` includes `ngram_mod`)
287+
288+
| Option | Type | Default | Description |
289+
|--------|------|---------|-------------|
290+
| `spec_ngram_mod_n_min` | int | 48 | Minimum number of ngram tokens to use |
291+
| `spec_ngram_mod_n_max` | int | 64 | Maximum number of ngram tokens to use |
292+
| `spec_ngram_mod_n_match` | int | 24 | Ngram lookup length |
293+
294+
**`ngram_map_k` options** (used when `spec_type` includes `ngram_map_k`)
295+
296+
| Option | Type | Default | Description |
297+
|--------|------|---------|-------------|
298+
| `spec_ngram_map_k_size_n` | int | 12 | N-gram lookup size |
299+
| `spec_ngram_map_k_size_m` | int | 48 | M-gram proposal size |
300+
| `spec_ngram_map_k_min_hits` | int | 1 | Minimum hits for accepting proposals |
301+
302+
**`ngram_map_k4v` options** (used when `spec_type` includes `ngram_map_k4v`)
303+
304+
| Option | Type | Default | Description |
305+
|--------|------|---------|-------------|
306+
| `spec_ngram_map_k4v_size_n` | int | 12 | N-gram lookup size |
307+
| `spec_ngram_map_k4v_size_m` | int | 48 | M-gram proposal size |
308+
| `spec_ngram_map_k4v_min_hits` | int | 1 | Minimum hits for accepting proposals |
309+
310+
**`ngram_cache` lookup files**
311+
312+
| Option | Type | Default | Description |
313+
|--------|------|---------|-------------|
314+
| `spec_lookup_cache_static` / `lookup_cache_static` | string | "" | Path to a static ngram lookup cache file |
315+
| `spec_lookup_cache_dynamic` / `lookup_cache_dynamic` | string | "" | Path to a dynamic ngram lookup cache file (updated by generation) |
266316

267317
#### Speculative Type Values
268318

@@ -277,6 +327,8 @@ These are set via the `options:` array in the model configuration (format: `key:
277327
| `ngram_mod` | Modified n-gram speculation |
278328
| `ngram_cache` | 3-level n-gram cache |
279329

330+
Multiple types can be chained by passing a comma-separated list to `spec_type` (e.g. `spec_type:ngram_simple,ngram_mod`). The runtime tries them in order and accepts the first proposal that meets the acceptance criteria.
331+
280332
{{% notice note %}}
281333
Speculative decoding is automatically disabled when multimodal models (with `mmproj`) are active. The `n_draft` parameter can also be overridden per-request.
282334
{{% /notice %}}

0 commit comments

Comments
 (0)