Skip to content

Commit e8abfc1

Browse files
committed
Surface model ftype (quantization) through the Java layer and /v1/models
Wires the new b9862 llama_ftype_name / llama_model_ftype quant-type info (exposed by server_context_meta::model_ftype) up through JNI to Java: - jllama.cpp getModelMetaJson now emits "ftype" from server_context_meta. - ModelMeta.getFtype() and the convenience LlamaModel.getModelFtype() expose the quant label (e.g. "Q4_K - Medium"; a guessed type is prefixed with "(guessed) "), empty when the native layer does not report it. - OpenAiCompatServer advertises it as data[].ftype in GET /v1/models, matching the upstream server's get_model_info() key. The value is threaded through OpenAiServerConfig.modelFtype (new field/builder/getter) from the loaded model, mirroring how supportsVision is threaded — keeping the "models built from config alone" invariant. The field is omitted when unknown/blank. Tests: +2 ModelMeta, +1 OpenAiServerConfig, +1 OpenAiSseFormatter (ftype present/omitted). Verified end-to-end: full native jllama build against b9862 with all six patches applied (100% link), model-free load smoke test green, and 63 model-free Java unit tests pass; clang-format 22.1.5 + spotless + javadoc all clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
1 parent 58628fd commit e8abfc1

11 files changed

Lines changed: 136 additions & 5 deletions

File tree

docs/history/llama-cpp-breaking-changes.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -412,6 +412,6 @@ Used during `llama.cpp` version bumps: when upgrading, scan this file from the r
412412
| b9842–b9859 | `common/arg.cpp` + `common/http.h` + `tools/server/server-{http,models}.cpp` + `tools/server/server-cors-proxy.h` | **IPv6 URL handling + hf-split primary fix**, all inside upstream-compiled TUs the project already builds. (1) `common/http.h` gains a `common_http_format_host()` helper that brackets an IPv6 literal host (`[::1]`) per RFC 3986, and `common_http_parse_url` now splits the authority so a bracketed IPv6 literal keeps its inner colons; `server-http.cpp` (listening-address string), `server-models.cpp` (proxy `Host` header) and `server-cors-proxy.h` (proxy log) each `#include "http.h"` and route the host through it. `server-http.cpp`/`server-models.cpp`/`server-cors-proxy.h` are already compiled into `jllama`; the project binds none of these symbols and passes host/port as plain params, so behaviour is unchanged for localhost binds. (2) `common/arg.cpp` `common_models_handler_apply` now threads a `primary` hf-split file (the `00001-of` part) through the `add_tasks` lambda instead of assuming index 0 — internal to the `--hf`/`--hf-repo-v`/`--spec-draft-hf` download planner, which the project never calls (`grep -rn "common_models_handler\|common_http_format_host" src/main/cpp src/test/cpp` → zero matches). No project source changes required. |
413413
| b9842–b9859 | `ggml/src/ggml-cpu/` + `ggml/src/ggml-cuda/` + `ggml/src/ggml-opencl/` + `ggml/src/ggml-vulkan/` + `ggml/src/ggml-webgpu/` + `ggml/src/ggml-hexagon/` + `ggml/src/ggml-backend.cpp` + `src/models/qwen3next.cpp` + `tools/ui/**` | Backend-internal only, no API surface visible to `jllama.cpp`. CPU adds an AVX2/AVX `ggml_vec_dot_nvfp4_q8_0` + a UE4M3 lookup table (`kvalues_mxfp4` renamed to shared `kvalues_fp4`); CUDA adds head-dim-512 flash-attention MMA/tile instances, a strided `get_rows_back` grid-clamp fix (new `test-backend-ops` case for row count > 65535), a gfx900 MMQ gate, and drops the CPU→CUDA async-copy path (scheduler now copies inputs synchronously); OpenCL adds full Q1_0 mul_mat/mul_mv + a `GGML_OPENCL_USE_ADRENO_BIN_KERNELS` prebuilt-binary-kernel loader (OFF by default; affects only the `opencl-*` classifiers); Vulkan rolls the mul_mm BK loop on Asahi/Honeykrisp; WebGPU adds NVFP4 support; Hexagon reworks HVX/HMX flash-attention (new `flash-attn-ops.h`/`hmx-fa-kernels.h`, MUL_MAT_ADD fusion). `qwen3next.cpp` records `t_layer_inp[il]` for MTP. All internal to upstream-compiled `libllama`/`ggml`/backends; the WebUI **auto-follows** the pinned `GIT_TAG` (the `build-webui` CI job rebuilds it), so its edits (PWA navigate-fallback, chat-store foreign-conversation guards) need no manual step. No project source changes required. |
414414
| b9842–b9859 | upstream verification (sandbox) | All four patches (`0001`–`0004`) re-verified to **apply cleanly** against b9859 via `git apply --check` over the actual b9859 sources fetched from `raw.githubusercontent.com` (github.com git-clone is blocked in this sandbox, so a full `FetchContent` build could not run — exit 0 for `common/arg.{cpp,h}`, `tests/test-arg-parser.cpp`, `tools/server/server-context.{cpp,h}`, `server-common.cpp`, `tests/test-chat.cpp`). The only patch-target file that changed in this range is `common/arg.cpp`, whose b9859 edit is in `common_models_handler_apply` (~L496) — disjoint from patch 0001's `make_utf8_argv`/`common_params_parse` hunks (~L931/L971) and the ~34 standalone-main flips (unchanged in this range), so patch 0001 still applies. Patches 0002/0003/0004 target files untouched in b9842→b9859, so their hunks are byte-identical to b9842. OuteTTS generator anchors hold (`tools/tts/tts.cpp` unchanged). Full build + `ctest` (target 459/459) to be confirmed by the CI pipeline. |
415-
| b9859–b9862 | `include/llama.h` + `src/llama-model-loader.cpp` + `src/llama-model.{cpp,h}` + `tools/server/server-context.{cpp,h}` + `tools/cli/cli.cpp` | **New feature (additive C API), no break.** Upstream promoted the previously-`static` `llama_model_ftype_name(llama_ftype)` (in `llama-model-loader.cpp`) to a **public** `LLAMA_API const char * llama_ftype_name(enum llama_ftype)` and added `LLAMA_API enum llama_ftype llama_model_ftype(const llama_model *)` (backed by a new `llama_model::ftype()` / `impl::ftype` cached from `ml.ftype` at `load_hparams`). `server_context::get_meta()` now fills a **new `std::string model_ftype`** field on `server_context_meta` (`server-context.h`) and `server_routes::get_model_info()` emits a `"ftype"` key — so the **NativeServer** mode's model-info/`/props` surface gains the quant type automatically (WebUI + `llama-server` clients). `cli.cpp` prints an `ftype :` line. **All inside upstream-compiled `libllama`/server TUs the project already links** — the project binds none of the new symbols (`grep` → only a *comment* mentions `server_context_meta` in `jllama.cpp`; nothing constructs it, and adding a trailing field is source-additive). No project source changes required. Optional future work: bind `llama_model_ftype`/`llama_ftype_name` into `LlamaModel` + surface `ftype` in the Java `OpenAiCompatServer` `propsJson` (the Java `/props` is a hand-written reimpl and does **not** inherit the upstream `get_model_info` field). |
415+
| b9859–b9862 | `include/llama.h` + `src/llama-model-loader.cpp` + `src/llama-model.{cpp,h}` + `tools/server/server-context.{cpp,h}` + `tools/cli/cli.cpp` | **New feature (additive C API), no break.** Upstream promoted the previously-`static` `llama_model_ftype_name(llama_ftype)` (in `llama-model-loader.cpp`) to a **public** `LLAMA_API const char * llama_ftype_name(enum llama_ftype)` and added `LLAMA_API enum llama_ftype llama_model_ftype(const llama_model *)` (backed by a new `llama_model::ftype()` / `impl::ftype` cached from `ml.ftype` at `load_hparams`). `server_context::get_meta()` now fills a **new `std::string model_ftype`** field on `server_context_meta` (`server-context.h`) and `server_routes::get_model_info()` emits a `"ftype"` key — so the **NativeServer** mode's model-info/`/props` surface gains the quant type automatically (WebUI + `llama-server` clients). `cli.cpp` prints an `ftype :` line. **All inside upstream-compiled `libllama`/server TUs the project already links** — the project binds none of the new symbols (`grep` → only a *comment* mentions `server_context_meta` in `jllama.cpp`; nothing constructs it, and adding a trailing field is source-additive). No project source changes required for the bump itself. **Follow-up (done):** the quant type is now also surfaced through the Java layer — `getModelMetaJson` emits `"ftype"` (from `server_context_meta::model_ftype`), `ModelMeta.getFtype()` / `LlamaModel.getModelFtype()` expose it, and the Java `OpenAiCompatServer` advertises it as `data[].ftype` in `GET /v1/models` (threaded through `OpenAiServerConfig.modelFtype`, mirroring how `supportsVision` is threaded), matching the upstream `get_model_info()` key. |
416416
| b9859–b9862 | `ggml/src/ggml-cuda/gated_delta_net.{cu,cuh}` + `ggml/src/ggml-cuda/ggml-cuda.cu` + `vendor/cpp-httplib/httplib.{cpp,h}` (v0.48.0→v0.49.0) | Backend/vendor-internal only, no API surface visible to `jllama.cpp`. (1) **CUDA gated-delta-net perf**: a fused `gated_delta_net → cpy` path (`ggml_cuda_op_gated_delta_net_fused_cache` + `ggml_cuda_try_gdn_cache_fusion`) lets the kernel scatter recurrent-state snapshots straight into the rollback cache and skip the follow-up strided copy (a decode win for gated-delta / hybrid-recurrent models, e.g. Qwen3-Next); plus a `ggml_cuda_is_view_or_noop` refactor. Affects only the `cuda13-*` classifiers. (2) **cpp-httplib bumped to v0.49.0** (the vendored copy inside llama.cpp, compiled into `jllama` via `server-http.cpp`): locale-independent ASCII classifiers (`is_ascii_digit/alpha/alnum` replacing `std::isdigit`/`isalnum`), a new additive `MultipartFormDataWriter` + `is_valid_multipart_boundary`, multipart field-name/filename escaping (WHATWG), an unsigned base64 accumulator (UB fix), a `ThreadPool` `idle_timeout_sec` ctor param (defaulted — backward-compatible), a `perform_websocket_handshake` `is_ssl` arg (internal), and a `path_encode_`-gated query-normalization skip. All internal to the compiled TU; the project binds no httplib symbol directly (it uses the upstream `server-http.cpp` transport). No project source changes required. |
417417
| b9859–b9862 | upstream verification (sandbox) | All **six** patches (`0001`–`0006`) re-verified against b9862. The b9859→b9862 diff touches only two patch-target files — `tools/server/server-context.cpp` and `server-context.h` (the `model_ftype`/`get_meta`/`get_model_info` additions at ~L3989/~L5121 and the new struct field at ~L50). Patches **0002** (load-progress guard, ~L1152), **0003** (slot-prompt-similarity getter/setter, ~L3965 + `server_context` struct ~L106) and **0005** (near-prompt-end checkpoints, `update_slots` ~L3560) were **applied in sequence** against the actual b9862 `server-context.{cpp,h}` fetched from `raw.githubusercontent.com` — all three applied cleanly (their regions are disjoint from and far from the b9862 additions). Patches **0001** (`common/arg.{cpp,h}`, `test-arg-parser.cpp`, ~34 standalone mains), **0004** (`server-common.cpp`, `test-chat.cpp`) and **0006** (`server.cpp`) target files **not present** in the b9859→b9862 changed-file list, so their hunks are byte-identical to b9859 and apply unchanged. OuteTTS generator anchors hold (`tools/tts/tts.cpp` unchanged in this range). Full build + `ctest` (target 459/459) to be confirmed by the CI pipeline. |

llama/src/main/cpp/jllama.cpp

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -802,6 +802,7 @@ JNIEXPORT jstring JNICALL Java_net_ladenthin_llama_LlamaModel_getModelMetaJson(J
802802
{"modalities", {{"vision", m.has_inp_image}, {"audio", m.has_inp_audio}}},
803803
{"name", m.model_name},
804804
{"architecture", std::string(arch_buf)},
805+
{"ftype", m.model_ftype},
805806
};
806807
// Resolved default chat template (Jinja); empty when the model ships none.
807808
const char *chat_tmpl = mdl != nullptr ? llama_model_chat_template(mdl, /*name*/ nullptr) : nullptr;

llama/src/main/java/net/ladenthin/llama/LlamaModel.java

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -856,6 +856,18 @@ public boolean supportsAudio() {
856856
return getModelMeta().supportsAudio();
857857
}
858858

859+
/**
860+
* Returns the loaded model's file type (quantization) as a human-readable string, e.g.
861+
* {@code "Q8_0"} or {@code "Q4_K - Medium"} (llama.cpp {@code llama_ftype_name}); a guessed
862+
* type is prefixed with {@code "(guessed) "}. Returns an empty string when the native layer does
863+
* not report it.
864+
*
865+
* @return the quantization file-type label, or {@code ""} if absent
866+
*/
867+
public String getModelFtype() {
868+
return getModelMeta().getFtype();
869+
}
870+
859871
native String getModelMetaJson();
860872

861873
/**

llama/src/main/java/net/ladenthin/llama/server/OpenAiCompatServer.java

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -748,7 +748,7 @@ private void handleModels(HttpExchange exchange) throws IOException {
748748
sendError(exchange, HTTP_UNAUTHORIZED, ERROR_TYPE_REQUEST, "Missing or invalid API key");
749749
return;
750750
}
751-
sendJson(exchange, HTTP_OK, OpenAiSseFormatter.modelsJson(config.getModelId()));
751+
sendJson(exchange, HTTP_OK, OpenAiSseFormatter.modelsJson(config.getModelId(), config.getModelFtype()));
752752
} finally {
753753
exchange.close();
754754
}
@@ -1064,7 +1064,7 @@ public static void main(String[] args) throws IOException {
10641064
"jllama-openai-shutdown"));
10651065

10661066
try (LlamaModel model = new LlamaModel(options.toModelParameters())) {
1067-
OpenAiServerConfig config = options.toServerConfig(model.supportsVision());
1067+
OpenAiServerConfig config = options.toServerConfig(model.supportsVision(), model.getModelFtype());
10681068
try (OpenAiCompatServer server = new OpenAiCompatServer(model, config)) {
10691069
server.start();
10701070
printReady(config, server.getPort());

llama/src/main/java/net/ladenthin/llama/server/OpenAiServerCli.java

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -620,7 +620,7 @@ public ModelParameters toModelParameters() {
620620
* @return the server configuration
621621
*/
622622
public OpenAiServerConfig toServerConfig() {
623-
return toServerConfig(mmproj != null);
623+
return toServerConfig(mmproj != null, "");
624624
}
625625

626626
/**
@@ -632,11 +632,25 @@ public OpenAiServerConfig toServerConfig() {
632632
* @return the server configuration
633633
*/
634634
public OpenAiServerConfig toServerConfig(boolean supportsVision) {
635+
return toServerConfig(supportsVision, "");
636+
}
637+
638+
/**
639+
* Build the server configuration with capability + metadata values obtained from the loaded
640+
* model. This overload lets the standalone launcher advertise the model's quantization file
641+
* type in {@code /v1/models} alongside the vision capability.
642+
*
643+
* @param supportsVision whether the loaded model reports usable vision input
644+
* @param modelFtype the model's file-type (quantization) label, or {@code ""} if unknown
645+
* @return the server configuration
646+
*/
647+
public OpenAiServerConfig toServerConfig(boolean supportsVision, String modelFtype) {
635648
final OpenAiServerConfig.Builder builder = OpenAiServerConfig.builder()
636649
.host(host)
637650
.port(port)
638651
.modelId(getModelId())
639-
.supportsVision(supportsVision);
652+
.supportsVision(supportsVision)
653+
.modelFtype(modelFtype);
640654
if (apiKey != null) {
641655
builder.apiKey(apiKey);
642656
}

llama/src/main/java/net/ladenthin/llama/server/OpenAiServerConfig.java

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,7 @@ public final class OpenAiServerConfig {
5555
private final String corsAllowOrigin;
5656
private final boolean supportsVision;
5757
private final int maxRequestBodyBytes;
58+
private final String modelFtype;
5859

5960
private OpenAiServerConfig(Builder builder) {
6061
this.host = builder.host;
@@ -67,6 +68,7 @@ private OpenAiServerConfig(Builder builder) {
6768
this.corsAllowOrigin = builder.corsAllowOrigin;
6869
this.supportsVision = builder.supportsVision;
6970
this.maxRequestBodyBytes = builder.maxRequestBodyBytes;
71+
this.modelFtype = builder.modelFtype;
7072
}
7173

7274
/**
@@ -169,6 +171,17 @@ public boolean isSupportsVision() {
169171
return supportsVision;
170172
}
171173

174+
/**
175+
* The served model's file type (quantization) as a human-readable string, e.g. {@code "Q8_0"}
176+
* or {@code "Q4_K - Medium"}, advertised in the {@code GET /v1/models} {@code data[].ftype} field
177+
* (matching the upstream llama.cpp server). Empty when unknown.
178+
*
179+
* @return the quantization file-type label, or {@code ""} if unknown
180+
*/
181+
public String getModelFtype() {
182+
return modelFtype;
183+
}
184+
172185
/**
173186
* Whether bearer-token authentication is enabled (an API key is configured).
174187
*
@@ -217,6 +230,7 @@ public static final class Builder {
217230
private String corsAllowOrigin = DEFAULT_CORS_ALLOW_ORIGIN;
218231
private boolean supportsVision;
219232
private int maxRequestBodyBytes = DEFAULT_MAX_REQUEST_BODY_BYTES;
233+
private String modelFtype = "";
220234

221235
private Builder() {}
222236

@@ -319,6 +333,18 @@ public Builder supportsVision(boolean supportsVision) {
319333
return this;
320334
}
321335

336+
/**
337+
* Sets the served model's file type (quantization) label to advertise in {@code /v1/models}.
338+
*
339+
* @param modelFtype the quantization file-type label (e.g. {@code "Q4_K - Medium"}); {@code null}
340+
* is treated as empty (unknown)
341+
* @return this builder
342+
*/
343+
public Builder modelFtype(@Nullable String modelFtype) {
344+
this.modelFtype = modelFtype == null ? "" : modelFtype;
345+
return this;
346+
}
347+
322348
/**
323349
* Sets the maximum accepted request-body size in bytes. Bodies larger than this are rejected
324350
* with HTTP 413 before being buffered.

llama/src/main/java/net/ladenthin/llama/server/OpenAiSseFormatter.java

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,10 +122,26 @@ static String ensureUsageCachedTokens(String chunkJson) {
122122
* @return an OpenAI model-list object serialized as JSON
123123
*/
124124
static String modelsJson(String modelId) {
125+
return modelsJson(modelId, "");
126+
}
127+
128+
/**
129+
* Build the {@code GET /v1/models} body advertising a single model, including the model's file
130+
* type (quantization) as a {@code data[].ftype} field when known — mirroring the upstream
131+
* llama.cpp server's {@code get_model_info()}.
132+
*
133+
* @param modelId the model id to advertise
134+
* @param ftype the model's file-type (quantization) label, or {@code ""}/{@code null} to omit it
135+
* @return an OpenAI model-list object serialized as JSON
136+
*/
137+
static String modelsJson(String modelId, @Nullable String ftype) {
125138
ObjectNode model = OBJECT_MAPPER.createObjectNode();
126139
model.put("id", modelId);
127140
model.put("object", "model");
128141
model.put("owned_by", "llama.cpp");
142+
if (ftype != null && !ftype.isEmpty()) {
143+
model.put("ftype", ftype);
144+
}
129145
ArrayNode data = OBJECT_MAPPER.createArrayNode();
130146
data.add(model);
131147
ObjectNode root = OBJECT_MAPPER.createObjectNode();

llama/src/main/java/net/ladenthin/llama/value/ModelMeta.java

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,18 @@ public String getModelName() {
129129
return node.path("name").asText("");
130130
}
131131

132+
/**
133+
* The model file type (quantization) as a human-readable string, e.g. {@code "Q8_0"} or
134+
* {@code "Q4_K - Medium"}, from the GGUF {@code general.file_type} the model was loaded with
135+
* (llama.cpp {@code llama_ftype_name}). A guessed type is prefixed with {@code "(guessed) "}.
136+
* Returns an empty string if the native layer does not report it (older native builds).
137+
*
138+
* @return the quantization file-type label, or {@code ""} if absent
139+
*/
140+
public String getFtype() {
141+
return node.path("ftype").asText("");
142+
}
143+
132144
/**
133145
* The model's resolved default chat template (Jinja), from GGUF
134146
* {@code tokenizer.chat_template} metadata.

llama/src/test/java/net/ladenthin/llama/server/OpenAiServerConfigTest.java

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,10 +29,19 @@ public void builderAppliesLocalhostDefaults() {
2929
assertThat(config.getHeartbeatMillis(), is(OpenAiServerConfig.DEFAULT_HEARTBEAT_MILLIS));
3030
assertThat(config.getCorsAllowOrigin(), is(OpenAiServerConfig.DEFAULT_CORS_ALLOW_ORIGIN));
3131
assertThat(config.isSupportsVision(), is(false));
32+
assertThat(config.getModelFtype(), is(""));
3233
assertThat(config.getApiKey(), is((String) null));
3334
assertThat(config.isAuthenticationEnabled(), is(false));
3435
}
3536

37+
@Test
38+
public void modelFtypeIsConfigurableAndNullBecomesEmpty() {
39+
assertThat(
40+
OpenAiServerConfig.builder().modelFtype("Q4_K - Medium").build().getModelFtype(), is("Q4_K - Medium"));
41+
// null is normalized to the empty "unknown" marker
42+
assertThat(OpenAiServerConfig.builder().modelFtype(null).build().getModelFtype(), is(""));
43+
}
44+
3645
@Test
3746
public void authenticationEnabledOnlyForNonEmptyKey() {
3847
assertThat(OpenAiServerConfig.builder().build().isAuthenticationEnabled(), is(false));

llama/src/test/java/net/ladenthin/llama/server/OpenAiSseFormatterTest.java

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,28 @@ public void modelsJsonAdvertisesTheConfiguredModel() throws IOException {
103103
assertThat(root.path("object").asText(), is("list"));
104104
assertThat(root.path("data").get(0).path("id").asText(), is("gemma-local"));
105105
assertThat(root.path("data").get(0).path("object").asText(), is("model"));
106+
// no ftype supplied -> the field is omitted entirely
107+
assertThat(root.path("data").get(0).has("ftype"), is(false));
108+
}
109+
110+
@Test
111+
public void modelsJsonIncludesFtypeWhenKnownAndOmitsWhenBlank() throws IOException {
112+
JsonNode withFtype = MAPPER.readTree(OpenAiSseFormatter.modelsJson("gemma-local", "Q4_K - Medium"));
113+
assertThat(withFtype.path("data").get(0).path("ftype").asText(), is("Q4_K - Medium"));
114+
115+
// empty and null are treated as "unknown" -> field omitted
116+
assertThat(
117+
MAPPER.readTree(OpenAiSseFormatter.modelsJson("gemma-local", ""))
118+
.path("data")
119+
.get(0)
120+
.has("ftype"),
121+
is(false));
122+
assertThat(
123+
MAPPER.readTree(OpenAiSseFormatter.modelsJson("gemma-local", null))
124+
.path("data")
125+
.get(0)
126+
.has("ftype"),
127+
is(false));
106128
}
107129

108130
@Test

0 commit comments

Comments
 (0)