Surface model ftype (quantization) through the Java layer and /v1/models

claude · claude · commit e8abfc1178da · 2026-07-03T10:54:18.000Z
Wires the new b9862 llama_ftype_name / llama_model_ftype quant-type info (exposed by server_context_meta::model_ftype) up through JNI to Java: - jllama.cpp getModelMetaJson now emits "ftype" from server_context_meta. - ModelMeta.getFtype() and the convenience LlamaModel.getModelFtype() expose the quant label (e.g. "Q4_K - Medium"; a guessed type is prefixed with "(guessed) "), empty when the native layer does not report it. - OpenAiCompatServer advertises it as data[].ftype in GET /v1/models, matching the upstream server's get_model_info() key. The value is threaded through OpenAiServerConfig.modelFtype (new field/builder/getter) from the loaded model, mirroring how supportsVision is threaded — keeping the "models built from config alone" invariant. The field is omitted when unknown/blank. Tests: +2 ModelMeta, +1 OpenAiServerConfig, +1 OpenAiSseFormatter (ftype present/omitted). Verified end-to-end: full native jllama build against b9862 with all six patches applied (100% link), model-free load smoke test green, and 63 model-free Java unit tests pass; clang-format 22.1.5 + spotless + javadoc all clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
diff --git a/docs/history/llama-cpp-breaking-changes.md b/docs/history/llama-cpp-breaking-changes.md
@@ -412,6 +412,6 @@ Used during `llama.cpp` version bumps: when upgrading, scan this file from the r
 | b9842–b9859 | `common/arg.cpp` + `common/http.h` + `tools/server/server-{http,models}.cpp` + `tools/server/server-cors-proxy.h` | **IPv6 URL handling + hf-split primary fix**, all inside upstream-compiled TUs the project already builds. (1) `common/http.h` gains a `common_http_format_host()` helper that brackets an IPv6 literal host (`[::1]`) per RFC 3986, and `common_http_parse_url` now splits the authority so a bracketed IPv6 literal keeps its inner colons; `server-http.cpp` (listening-address string), `server-models.cpp` (proxy `Host` header) and `server-cors-proxy.h` (proxy log) each `#include "http.h"` and route the host through it. `server-http.cpp`/`server-models.cpp`/`server-cors-proxy.h` are already compiled into `jllama`; the project binds none of these symbols and passes host/port as plain params, so behaviour is unchanged for localhost binds. (2) `common/arg.cpp` `common_models_handler_apply` now threads a `primary` hf-split file (the `00001-of` part) through the `add_tasks` lambda instead of assuming index 0 — internal to the `--hf`/`--hf-repo-v`/`--spec-draft-hf` download planner, which the project never calls (`grep -rn "common_models_handler\|common_http_format_host" src/main/cpp src/test/cpp` → zero matches). No project source changes required. |
 | b9842–b9859 | `ggml/src/ggml-cpu/` + `ggml/src/ggml-cuda/` + `ggml/src/ggml-opencl/` + `ggml/src/ggml-vulkan/` + `ggml/src/ggml-webgpu/` + `ggml/src/ggml-hexagon/` + `ggml/src/ggml-backend.cpp` + `src/models/qwen3next.cpp` + `tools/ui/**` | Backend-internal only, no API surface visible to `jllama.cpp`. CPU adds an AVX2/AVX `ggml_vec_dot_nvfp4_q8_0` + a UE4M3 lookup table (`kvalues_mxfp4` renamed to shared `kvalues_fp4`); CUDA adds head-dim-512 flash-attention MMA/tile instances, a strided `get_rows_back` grid-clamp fix (new `test-backend-ops` case for row count > 65535), a gfx900 MMQ gate, and drops the CPU→CUDA async-copy path (scheduler now copies inputs synchronously); OpenCL adds full Q1_0 mul_mat/mul_mv + a `GGML_OPENCL_USE_ADRENO_BIN_KERNELS` prebuilt-binary-kernel loader (OFF by default; affects only the `opencl-*` classifiers); Vulkan rolls the mul_mm BK loop on Asahi/Honeykrisp; WebGPU adds NVFP4 support; Hexagon reworks HVX/HMX flash-attention (new `flash-attn-ops.h`/`hmx-fa-kernels.h`, MUL_MAT_ADD fusion). `qwen3next.cpp` records `t_layer_inp[il]` for MTP. All internal to upstream-compiled `libllama`/`ggml`/backends; the WebUI **auto-follows** the pinned `GIT_TAG` (the `build-webui` CI job rebuilds it), so its edits (PWA navigate-fallback, chat-store foreign-conversation guards) need no manual step. No project source changes required. |
 | b9842–b9859 | upstream verification (sandbox) | All four patches (`0001`–`0004`) re-verified to **apply cleanly** against b9859 via `git apply --check` over the actual b9859 sources fetched from `raw.githubusercontent.com` (github.com git-clone is blocked in this sandbox, so a full `FetchContent` build could not run — exit 0 for `common/arg.{cpp,h}`, `tests/test-arg-parser.cpp`, `tools/server/server-context.{cpp,h}`, `server-common.cpp`, `tests/test-chat.cpp`). The only patch-target file that changed in this range is `common/arg.cpp`, whose b9859 edit is in `common_models_handler_apply` (~L496) — disjoint from patch 0001's `make_utf8_argv`/`common_params_parse` hunks (~L931/L971) and the ~34 standalone-main flips (unchanged in this range), so patch 0001 still applies. Patches 0002/0003/0004 target files untouched in b9842→b9859, so their hunks are byte-identical to b9842. OuteTTS generator anchors hold (`tools/tts/tts.cpp` unchanged). Full build + `ctest` (target 459/459) to be confirmed by the CI pipeline. |
-| b9859–b9862 | `include/llama.h` + `src/llama-model-loader.cpp` + `src/llama-model.{cpp,h}` + `tools/server/server-context.{cpp,h}` + `tools/cli/cli.cpp` | **New feature (additive C API), no break.** Upstream promoted the previously-`static` `llama_model_ftype_name(llama_ftype)` (in `llama-model-loader.cpp`) to a **public** `LLAMA_API const char * llama_ftype_name(enum llama_ftype)` and added `LLAMA_API enum llama_ftype llama_model_ftype(const llama_model *)` (backed by a new `llama_model::ftype()` / `impl::ftype` cached from `ml.ftype` at `load_hparams`). `server_context::get_meta()` now fills a **new `std::string model_ftype`** field on `server_context_meta` (`server-context.h`) and `server_routes::get_model_info()` emits a `"ftype"` key — so the **NativeServer** mode's model-info/`/props` surface gains the quant type automatically (WebUI + `llama-server` clients). `cli.cpp` prints an `ftype :` line. **All inside upstream-compiled `libllama`/server TUs the project already links** — the project binds none of the new symbols (`grep` → only a *comment* mentions `server_context_meta` in `jllama.cpp`; nothing constructs it, and adding a trailing field is source-additive). No project source changes required. Optional future work: bind `llama_model_ftype`/`llama_ftype_name` into `LlamaModel` + surface `ftype` in the Java `OpenAiCompatServer` `propsJson` (the Java `/props` is a hand-written reimpl and does **not** inherit the upstream `get_model_info` field). |
+| b9859–b9862 | `include/llama.h` + `src/llama-model-loader.cpp` + `src/llama-model.{cpp,h}` + `tools/server/server-context.{cpp,h}` + `tools/cli/cli.cpp` | **New feature (additive C API), no break.** Upstream promoted the previously-`static` `llama_model_ftype_name(llama_ftype)` (in `llama-model-loader.cpp`) to a **public** `LLAMA_API const char * llama_ftype_name(enum llama_ftype)` and added `LLAMA_API enum llama_ftype llama_model_ftype(const llama_model *)` (backed by a new `llama_model::ftype()` / `impl::ftype` cached from `ml.ftype` at `load_hparams`). `server_context::get_meta()` now fills a **new `std::string model_ftype`** field on `server_context_meta` (`server-context.h`) and `server_routes::get_model_info()` emits a `"ftype"` key — so the **NativeServer** mode's model-info/`/props` surface gains the quant type automatically (WebUI + `llama-server` clients). `cli.cpp` prints an `ftype :` line. **All inside upstream-compiled `libllama`/server TUs the project already links** — the project binds none of the new symbols (`grep` → only a *comment* mentions `server_context_meta` in `jllama.cpp`; nothing constructs it, and adding a trailing field is source-additive). No project source changes required for the bump itself. **Follow-up (done):** the quant type is now also surfaced through the Java layer — `getModelMetaJson` emits `"ftype"` (from `server_context_meta::model_ftype`), `ModelMeta.getFtype()` / `LlamaModel.getModelFtype()` expose it, and the Java `OpenAiCompatServer` advertises it as `data[].ftype` in `GET /v1/models` (threaded through `OpenAiServerConfig.modelFtype`, mirroring how `supportsVision` is threaded), matching the upstream `get_model_info()` key. |
 | b9859–b9862 | `ggml/src/ggml-cuda/gated_delta_net.{cu,cuh}` + `ggml/src/ggml-cuda/ggml-cuda.cu` + `vendor/cpp-httplib/httplib.{cpp,h}` (v0.48.0→v0.49.0) | Backend/vendor-internal only, no API surface visible to `jllama.cpp`. (1) **CUDA gated-delta-net perf**: a fused `gated_delta_net → cpy` path (`ggml_cuda_op_gated_delta_net_fused_cache` + `ggml_cuda_try_gdn_cache_fusion`) lets the kernel scatter recurrent-state snapshots straight into the rollback cache and skip the follow-up strided copy (a decode win for gated-delta / hybrid-recurrent models, e.g. Qwen3-Next); plus a `ggml_cuda_is_view_or_noop` refactor. Affects only the `cuda13-*` classifiers. (2) **cpp-httplib bumped to v0.49.0** (the vendored copy inside llama.cpp, compiled into `jllama` via `server-http.cpp`): locale-independent ASCII classifiers (`is_ascii_digit/alpha/alnum` replacing `std::isdigit`/`isalnum`), a new additive `MultipartFormDataWriter` + `is_valid_multipart_boundary`, multipart field-name/filename escaping (WHATWG), an unsigned base64 accumulator (UB fix), a `ThreadPool` `idle_timeout_sec` ctor param (defaulted — backward-compatible), a `perform_websocket_handshake` `is_ssl` arg (internal), and a `path_encode_`-gated query-normalization skip. All internal to the compiled TU; the project binds no httplib symbol directly (it uses the upstream `server-http.cpp` transport). No project source changes required. |
 | b9859–b9862 | upstream verification (sandbox) | All **six** patches (`0001`–`0006`) re-verified against b9862. The b9859→b9862 diff touches only two patch-target files — `tools/server/server-context.cpp` and `server-context.h` (the `model_ftype`/`get_meta`/`get_model_info` additions at ~L3989/~L5121 and the new struct field at ~L50). Patches **0002** (load-progress guard, ~L1152), **0003** (slot-prompt-similarity getter/setter, ~L3965 + `server_context` struct ~L106) and **0005** (near-prompt-end checkpoints, `update_slots` ~L3560) were **applied in sequence** against the actual b9862 `server-context.{cpp,h}` fetched from `raw.githubusercontent.com` — all three applied cleanly (their regions are disjoint from and far from the b9862 additions). Patches **0001** (`common/arg.{cpp,h}`, `test-arg-parser.cpp`, ~34 standalone mains), **0004** (`server-common.cpp`, `test-chat.cpp`) and **0006** (`server.cpp`) target files **not present** in the b9859→b9862 changed-file list, so their hunks are byte-identical to b9859 and apply unchanged. OuteTTS generator anchors hold (`tools/tts/tts.cpp` unchanged in this range). Full build + `ctest` (target 459/459) to be confirmed by the CI pipeline. |
diff --git a/llama/src/main/cpp/jllama.cpp b/llama/src/main/cpp/jllama.cpp
@@ -802,6 +802,7 @@ JNIEXPORT jstring JNICALL Java_net_ladenthin_llama_LlamaModel_getModelMetaJson(J
         {"modalities", {{"vision", m.has_inp_image}, {"audio", m.has_inp_audio}}},
         {"name", m.model_name},
         {"architecture", std::string(arch_buf)},
+        {"ftype", m.model_ftype},
     };
     // Resolved default chat template (Jinja); empty when the model ships none.
     const char *chat_tmpl = mdl != nullptr ? llama_model_chat_template(mdl, /*name*/ nullptr) : nullptr;
diff --git a/llama/src/main/java/net/ladenthin/llama/LlamaModel.java b/llama/src/main/java/net/ladenthin/llama/LlamaModel.java
@@ -856,6 +856,18 @@ public boolean supportsAudio() {
         return getModelMeta().supportsAudio();
     }
 
+    /**
+     * Returns the loaded model's file type (quantization) as a human-readable string, e.g.
+     * {@code "Q8_0"} or {@code "Q4_K - Medium"} (llama.cpp {@code llama_ftype_name}); a guessed
+     * type is prefixed with {@code "(guessed) "}. Returns an empty string when the native layer does
+     * not report it.
+     *
+     * @return the quantization file-type label, or {@code ""} if absent
+     */
+    public String getModelFtype() {
+        return getModelMeta().getFtype();
+    }
+
     native String getModelMetaJson();
 
     /**
diff --git a/llama/src/main/java/net/ladenthin/llama/server/OpenAiCompatServer.java b/llama/src/main/java/net/ladenthin/llama/server/OpenAiCompatServer.java
@@ -748,7 +748,7 @@ private void handleModels(HttpExchange exchange) throws IOException {
                 sendError(exchange, HTTP_UNAUTHORIZED, ERROR_TYPE_REQUEST, "Missing or invalid API key");
                 return;
             }
-            sendJson(exchange, HTTP_OK, OpenAiSseFormatter.modelsJson(config.getModelId()));
+            sendJson(exchange, HTTP_OK, OpenAiSseFormatter.modelsJson(config.getModelId(), config.getModelFtype()));
         } finally {
             exchange.close();
         }
@@ -1064,7 +1064,7 @@ public static void main(String[] args) throws IOException {
                         "jllama-openai-shutdown"));
 
         try (LlamaModel model = new LlamaModel(options.toModelParameters())) {
-            OpenAiServerConfig config = options.toServerConfig(model.supportsVision());
+            OpenAiServerConfig config = options.toServerConfig(model.supportsVision(), model.getModelFtype());
             try (OpenAiCompatServer server = new OpenAiCompatServer(model, config)) {
                 server.start();
                 printReady(config, server.getPort());
diff --git a/llama/src/main/java/net/ladenthin/llama/server/OpenAiServerCli.java b/llama/src/main/java/net/ladenthin/llama/server/OpenAiServerCli.java
@@ -620,7 +620,7 @@ public ModelParameters toModelParameters() {
          * @return the server configuration
          */
         public OpenAiServerConfig toServerConfig() {
-            return toServerConfig(mmproj != null);
+            return toServerConfig(mmproj != null, "");
         }
 
         /**
@@ -632,11 +632,25 @@ public OpenAiServerConfig toServerConfig() {
          * @return the server configuration
          */
         public OpenAiServerConfig toServerConfig(boolean supportsVision) {
+            return toServerConfig(supportsVision, "");
+        }
+
+        /**
+         * Build the server configuration with capability + metadata values obtained from the loaded
+         * model. This overload lets the standalone launcher advertise the model's quantization file
+         * type in {@code /v1/models} alongside the vision capability.
+         *
+         * @param supportsVision whether the loaded model reports usable vision input
+         * @param modelFtype the model's file-type (quantization) label, or {@code ""} if unknown
+         * @return the server configuration
+         */
+        public OpenAiServerConfig toServerConfig(boolean supportsVision, String modelFtype) {
             final OpenAiServerConfig.Builder builder = OpenAiServerConfig.builder()
                     .host(host)
                     .port(port)
                     .modelId(getModelId())
-                    .supportsVision(supportsVision);
+                    .supportsVision(supportsVision)
+                    .modelFtype(modelFtype);
             if (apiKey != null) {
                 builder.apiKey(apiKey);
             }
diff --git a/llama/src/main/java/net/ladenthin/llama/server/OpenAiServerConfig.java b/llama/src/main/java/net/ladenthin/llama/server/OpenAiServerConfig.java
@@ -55,6 +55,7 @@ public final class OpenAiServerConfig {
     private final String corsAllowOrigin;
     private final boolean supportsVision;
     private final int maxRequestBodyBytes;
+    private final String modelFtype;
 
     private OpenAiServerConfig(Builder builder) {
         this.host = builder.host;
@@ -67,6 +68,7 @@ private OpenAiServerConfig(Builder builder) {
         this.corsAllowOrigin = builder.corsAllowOrigin;
         this.supportsVision = builder.supportsVision;
         this.maxRequestBodyBytes = builder.maxRequestBodyBytes;
+        this.modelFtype = builder.modelFtype;
     }
 
     /**
@@ -169,6 +171,17 @@ public boolean isSupportsVision() {
         return supportsVision;
     }
 
+    /**
+     * The served model's file type (quantization) as a human-readable string, e.g. {@code "Q8_0"}
+     * or {@code "Q4_K - Medium"}, advertised in the {@code GET /v1/models} {@code data[].ftype} field
+     * (matching the upstream llama.cpp server). Empty when unknown.
+     *
+     * @return the quantization file-type label, or {@code ""} if unknown
+     */
+    public String getModelFtype() {
+        return modelFtype;
+    }
+
     /**
      * Whether bearer-token authentication is enabled (an API key is configured).
      *
@@ -217,6 +230,7 @@ public static final class Builder {
         private String corsAllowOrigin = DEFAULT_CORS_ALLOW_ORIGIN;
         private boolean supportsVision;
         private int maxRequestBodyBytes = DEFAULT_MAX_REQUEST_BODY_BYTES;
+        private String modelFtype = "";
 
         private Builder() {}
 
@@ -319,6 +333,18 @@ public Builder supportsVision(boolean supportsVision) {
             return this;
         }
 
+        /**
+         * Sets the served model's file type (quantization) label to advertise in {@code /v1/models}.
+         *
+         * @param modelFtype the quantization file-type label (e.g. {@code "Q4_K - Medium"}); {@code null}
+         *     is treated as empty (unknown)
+         * @return this builder
+         */
+        public Builder modelFtype(@Nullable String modelFtype) {
+            this.modelFtype = modelFtype == null ? "" : modelFtype;
+            return this;
+        }
+
         /**
          * Sets the maximum accepted request-body size in bytes. Bodies larger than this are rejected
          * with HTTP 413 before being buffered.
diff --git a/llama/src/main/java/net/ladenthin/llama/server/OpenAiSseFormatter.java b/llama/src/main/java/net/ladenthin/llama/server/OpenAiSseFormatter.java
@@ -122,10 +122,26 @@ static String ensureUsageCachedTokens(String chunkJson) {
      * @return an OpenAI model-list object serialized as JSON
      */
     static String modelsJson(String modelId) {
+        return modelsJson(modelId, "");
+    }
+
+    /**
+     * Build the {@code GET /v1/models} body advertising a single model, including the model's file
+     * type (quantization) as a {@code data[].ftype} field when known — mirroring the upstream
+     * llama.cpp server's {@code get_model_info()}.
+     *
+     * @param modelId the model id to advertise
+     * @param ftype the model's file-type (quantization) label, or {@code ""}/{@code null} to omit it
+     * @return an OpenAI model-list object serialized as JSON
+     */
+    static String modelsJson(String modelId, @Nullable String ftype) {
         ObjectNode model = OBJECT_MAPPER.createObjectNode();
         model.put("id", modelId);
         model.put("object", "model");
         model.put("owned_by", "llama.cpp");
+        if (ftype != null && !ftype.isEmpty()) {
+            model.put("ftype", ftype);
+        }
         ArrayNode data = OBJECT_MAPPER.createArrayNode();
         data.add(model);
         ObjectNode root = OBJECT_MAPPER.createObjectNode();
diff --git a/llama/src/main/java/net/ladenthin/llama/value/ModelMeta.java b/llama/src/main/java/net/ladenthin/llama/value/ModelMeta.java
@@ -129,6 +129,18 @@ public String getModelName() {
         return node.path("name").asText("");
     }
 
+    /**
+     * The model file type (quantization) as a human-readable string, e.g. {@code "Q8_0"} or
+     * {@code "Q4_K - Medium"}, from the GGUF {@code general.file_type} the model was loaded with
+     * (llama.cpp {@code llama_ftype_name}). A guessed type is prefixed with {@code "(guessed) "}.
+     * Returns an empty string if the native layer does not report it (older native builds).
+     *
+     * @return the quantization file-type label, or {@code ""} if absent
+     */
+    public String getFtype() {
+        return node.path("ftype").asText("");
+    }
+
     /**
      * The model's resolved default chat template (Jinja), from GGUF
      * {@code tokenizer.chat_template} metadata.
diff --git a/llama/src/test/java/net/ladenthin/llama/server/OpenAiServerConfigTest.java b/llama/src/test/java/net/ladenthin/llama/server/OpenAiServerConfigTest.java
@@ -29,10 +29,19 @@ public void builderAppliesLocalhostDefaults() {
         assertThat(config.getHeartbeatMillis(), is(OpenAiServerConfig.DEFAULT_HEARTBEAT_MILLIS));
         assertThat(config.getCorsAllowOrigin(), is(OpenAiServerConfig.DEFAULT_CORS_ALLOW_ORIGIN));
         assertThat(config.isSupportsVision(), is(false));
+        assertThat(config.getModelFtype(), is(""));
         assertThat(config.getApiKey(), is((String) null));
         assertThat(config.isAuthenticationEnabled(), is(false));
     }
 
+    @Test
+    public void modelFtypeIsConfigurableAndNullBecomesEmpty() {
+        assertThat(
+                OpenAiServerConfig.builder().modelFtype("Q4_K - Medium").build().getModelFtype(), is("Q4_K - Medium"));
+        // null is normalized to the empty "unknown" marker
+        assertThat(OpenAiServerConfig.builder().modelFtype(null).build().getModelFtype(), is(""));
+    }
+
     @Test
     public void authenticationEnabledOnlyForNonEmptyKey() {
         assertThat(OpenAiServerConfig.builder().build().isAuthenticationEnabled(), is(false));
diff --git a/llama/src/test/java/net/ladenthin/llama/server/OpenAiSseFormatterTest.java b/llama/src/test/java/net/ladenthin/llama/server/OpenAiSseFormatterTest.java
@@ -103,6 +103,28 @@ public void modelsJsonAdvertisesTheConfiguredModel() throws IOException {
         assertThat(root.path("object").asText(), is("list"));
         assertThat(root.path("data").get(0).path("id").asText(), is("gemma-local"));
         assertThat(root.path("data").get(0).path("object").asText(), is("model"));
+        // no ftype supplied -> the field is omitted entirely
+        assertThat(root.path("data").get(0).has("ftype"), is(false));
+    }
+
+    @Test
+    public void modelsJsonIncludesFtypeWhenKnownAndOmitsWhenBlank() throws IOException {
+        JsonNode withFtype = MAPPER.readTree(OpenAiSseFormatter.modelsJson("gemma-local", "Q4_K - Medium"));
+        assertThat(withFtype.path("data").get(0).path("ftype").asText(), is("Q4_K - Medium"));
+
+        // empty and null are treated as "unknown" -> field omitted
+        assertThat(
+                MAPPER.readTree(OpenAiSseFormatter.modelsJson("gemma-local", ""))
+                        .path("data")
+                        .get(0)
+                        .has("ftype"),
+                is(false));
+        assertThat(
+                MAPPER.readTree(OpenAiSseFormatter.modelsJson("gemma-local", null))
+                        .path("data")
+                        .get(0)
+                        .has("ftype"),
+                is(false));
     }
 
     @Test
diff --git a/llama/src/test/java/net/ladenthin/llama/value/ModelMetaTest.java b/llama/src/test/java/net/ladenthin/llama/value/ModelMetaTest.java