bernardladenthin · bernardladenthin · Jun 18, 2026 · Jun 18, 2026 · Jun 18, 2026 · Jun 18, 2026
@@ -0,0 +1,34 @@
+# SPDX-FileCopyrightText: 2026 Bernard Ladenthin <bernard.ladenthin@gmail.com>
+#
+# SPDX-License-Identifier: MIT
+
+name: clang-format
+on:
+  push:
+  pull_request:
+  workflow_dispatch:
+
+# Enforces a single, pinned clang-format across all C++ sources so formatting is
+# reproducible between contributors and CI. Bump CLANG_FORMAT_VERSION here and in
+# CLAUDE.md (Code Formatting) together, then reformat the tree with the same version.
+env:
+  CLANG_FORMAT_VERSION: "22.1.5"
+
+jobs:
+  clang-format:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v6
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.x"
+      - name: Install pinned clang-format
+        run: pip install "clang-format==${CLANG_FORMAT_VERSION}"
+      - name: Check C++ formatting
+        run: |
+          clang-format --version
+          # All hand-written C++ sources; the generated JNI header (src/main/cpp/jllama.h,
+          # produced by `javac -h`) is intentionally excluded.
+          files=$(find src/main/cpp src/test/cpp -type f \( -name '*.cpp' -o -name '*.hpp' \) | sort)
+          echo "Checking:"; echo "$files"
+          clang-format --dry-run --Werror $files
@@ -392,10 +392,21 @@ not track the loader's own Java package). This is the same
 `spotbugs-exclude.xml`, PIT `targetClasses`, and `CMakeLists.txt` OSInfo repairs.
 
 ### Code Formatting
+
+C++ formatting is **enforced in CI** (`.github/workflows/clang-format.yml`) with a **pinned**
+clang-format — currently **22.1.5**, installed via `pip install clang-format==22.1.5`. Format with
+that exact version before committing; a different clang-format version reflows code differently and
+will fail the check.
+
 ```bash
-clang-format -i src/main/cpp/*.cpp src/main/cpp/*.hpp   # Format C++ code
+pip install "clang-format==22.1.5"
+clang-format -i src/main/cpp/*.cpp src/main/cpp/*.hpp src/test/cpp/*.cpp   # Format C++ code
 ```
 
+The generated JNI header `src/main/cpp/jllama.h` (produced by `javac -h`) is intentionally excluded.
+To bump the enforced version, update the pin in **both** the workflow (`CLANG_FORMAT_VERSION`) and
+this line, then reformat the whole tree with the new version in the same commit.
+
 ### Javadoc — must build cleanly before `mvn package`
 
 The release packaging job runs `mvn package` with the `release` profile, which attaches
@@ -452,6 +463,7 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in
 - `LlamaIterator` / `LlamaIterable` — Streaming generation via Java `Iterator`/`Iterable`.
 - `LlamaLoader` — Extracts the platform-specific native library from the JAR to a temp directory, or finds it on `java.library.path`.
 - `OSInfo` — Detects OS and architecture for library resolution.
+- `server.OpenAiCompatServer` — Optional OpenAI-compatible HTTP endpoint built on the JDK's `com.sun.net.httpserver` (no new dependency). Serves `POST /v1/chat/completions` (streaming via SSE + non-streaming) and `GET /v1/models` by delegating to `LlamaModel.chatComplete` / `LlamaModel.streamChatCompletion`, so editors that speak the OpenAI protocol (e.g. VS Code Copilot "Custom Endpoint") can drive a local model. Streaming uses the native OAI chunk path (`requestChatCompletionStream` / `receiveChatCompletionChunk`), preserving `delta.tool_calls`.
 
 **Native layer** (`src/main/cpp/`):
 - `jllama.cpp` — JNI implementation bridging Java calls to llama.cpp. ~1,215 lines; 17 native methods.
@@ -476,7 +488,7 @@ The project C++ helpers follow a strict semantic split:
 
 Functions: `get_result_error_message`, `results_to_json`, `rerank_results_to_json`,
 `parse_encoding_format`, `extract_embedding_prompt`, `is_infill_request`,
-`parse_slot_prompt_similarity`, `parse_positive_int_config`.
+`parse_slot_prompt_similarity`, `parse_positive_int_config`, `wrap_stream_chunk`.
 
 **`log_helpers.hpp`** — Pure log-formatting transforms.
 - Input: `ggml_log_level`, message text (`const char*`), an explicit `std::time_t` timestamp.
@@ -582,11 +594,11 @@ ctest --test-dir build --output-on-failure -R "ResultsToJson"
 |------|-------|-------|
 | `src/test/cpp/test_utils.cpp` | 156 | Upstream helpers: `server_tokens`, `server_grammar_trigger`, `gen_tool_call_id`, `json_value`, `json_get_nested_values`, UTF-8 helpers, `format_response_rerank`, `format_embeddings_response_oaicompat`, `oaicompat_completion_params_parse`, `oaicompat_chat_params_parse`, `are_lora_equal`, `strip_flag_from_argv`, `token_piece_value`, `json_is_array_and_contains_numbers`, `format_oai_sse`, `format_oai_resp_sse`, `format_anthropic_sse` |
 | `src/test/cpp/test_server.cpp` | 188 | Upstream result types: `result_timings`, `task_params::to_json()` (incl. `dry_sequence_breakers`, `preserved_tokens`, `timings_per_token`), `completion_token_output`, `server_task_result_cmpl_partial` (non-oaicompat + `to_json_oaicompat` + logprobs + `to_json_oaicompat_chat` + `to_json_anthropic` + dispatcher), `server_task_result_cmpl_final` (non-oaicompat + `to_json_oaicompat` + `to_json_oaicompat_chat` + `to_json_oaicompat_chat_stream` + `to_json_anthropic` + `to_json_anthropic_stream` + tool_calls + dispatcher), `server_task_result_embd`, `server_task_result_rerank`, `server_task_result_metrics`, `server_task_result_slot_save_load`, `server_task_result_slot_erase`, `server_task_result_apply_lora`, `server_task_result_error`, `format_error_response`, `server_task::need_sampling()`, `server_task::n_tokens()`, `server_task::params_from_json_cmpl()` (parsing pipeline + grammar routing + error paths), `response_fields` projection |
-| `src/test/cpp/test_json_helpers.cpp` | 42 | All functions in `json_helpers.hpp`: `get_result_error_message`, `results_to_json`, `rerank_results_to_json`, `parse_encoding_format`, `extract_embedding_prompt`, `is_infill_request`, `parse_slot_prompt_similarity`, `parse_positive_int_config` |
+| `src/test/cpp/test_json_helpers.cpp` | 47 | All functions in `json_helpers.hpp`: `get_result_error_message`, `results_to_json`, `rerank_results_to_json`, `parse_encoding_format`, `extract_embedding_prompt`, `is_infill_request`, `parse_slot_prompt_similarity`, `parse_positive_int_config`, `wrap_stream_chunk` |
 | `src/test/cpp/test_log_helpers.cpp` | 13 | All functions in `log_helpers.hpp`: `log_level_name`, `format_log_as_json` |
 | `src/test/cpp/test_jni_helpers.cpp` | 41 | All functions in `jni_helpers.hpp` using a zero-filled `JNINativeInterface_` mock |
 
-**Current total: 440 tests (all passing).**
+**Current total: 445 tests (all passing).**
 
 #### Upstream source location (in CMake build tree)
 

@@ -396,6 +396,77 @@ a JSON response, matching the HTTP server's contract:
 Server state is exposed via `getMetrics()`, `eraseSlot(int)`, `saveSlot(int, String)`,
 `restoreSlot(int, String)`, and `getModelMeta()`.
 
+### Local OpenAI endpoint for VS Code Copilot (and other OpenAI clients)
+
+`net.ladenthin.llama.server.OpenAiCompatServer` turns a loaded model into a local
+OpenAI-compatible HTTP endpoint using only the JDK's built-in `com.sun.net.httpserver` — no extra
+dependency and no separate server process. It serves:
+
+- `POST /v1/chat/completions` — streaming (Server-Sent Events) and non-streaming, forwarding
+  `messages`/`tools` verbatim. The streaming path carries `delta.tool_calls`, so agent/tool-calling
+  clients work.
+- `GET /v1/models` — advertises the configured model id.
+
+Embed it in your app:
+
+```java
+ModelParameters modelParams = new ModelParameters().setModel("models/model.gguf").setParallel(2);
+OpenAiServerConfig config = OpenAiServerConfig.builder().port(8080).modelId("local-model").build();
+try (LlamaModel model = new LlamaModel(modelParams);
+     OpenAiCompatServer server = new OpenAiCompatServer(model, config).start()) {
+    Thread.currentThread().join(); // serve until interrupted
+}
+```
+
+…or run it standalone:
+
+```bash
+java -cp target/llama-<version>.jar net.ladenthin.llama.server.OpenAiCompatServer \
+  --model models/model.gguf --port 8080 --model-id local-model
+```
+
+Verify with curl:
+
+```bash
+curl -N http://127.0.0.1:8080/v1/chat/completions \
+  -H 'Content-Type: application/json' \
+  -d '{"model":"local-model","stream":true,"messages":[{"role":"user","content":"hi"}]}'
+```
+
+**VS Code Copilot setup:** Command Palette → **Chat: Manage Language Models** → **Add Models** →
+**Custom Endpoint**; enter a group name, a display name and any non-empty API key, and pick API type
+**Chat Completions**. VS Code then opens `chatLanguageModels.json` — set the model `url` to your
+endpoint (the host/port go here, not in the form):
+
+```json
+[
+  {
+    "name": "Local llama.cpp",
+    "vendor": "customendpoint",
+    "apiKey": "local-dummy-key",
+    "apiType": "chat-completions",
+    "models": [
+      {
+        "id": "local-model",
+        "name": "Local model",
+        "url": "http://127.0.0.1:8080/v1/chat/completions",
+        "toolCalling": true,
+        "vision": false,
+        "maxInputTokens": 6144,
+        "maxOutputTokens": 2048
+      }
+    ]
+  }
+]
+```
+
+Notes: BYOK powers the chat/agent experience only (inline completions and embeddings still require a
+GitHub account). On CPU, prefer a smaller model and a modest context window — the server emits SSE
+heartbeats so a long prompt prefill does not trip the client's stream-inactivity timeout. Agent-mode
+tool calling depends on the model's own tool-calling quality. Pass `--api-key` (or
+`OpenAiServerConfig.apiKey(...)`) to require an `Authorization: Bearer` token; the server binds to
+`127.0.0.1` by default.
+
 ### Model/Inference Configuration
 
 There are two sets of parameters you can configure, `ModelParameters` and `InferenceParameters`. Both provide builder 

@@ -13,6 +13,26 @@ cross-cutting initiative.
 
 ## Open — jllama-specific
 
+### OpenAI-compatible HTTP endpoint (shipped; follow-ups open)
+
+`net.ladenthin.llama.server.OpenAiCompatServer` exposes `POST /v1/chat/completions` (streaming via
+SSE + non-streaming) and `GET /v1/models` over the JDK's built-in `com.sun.net.httpserver` (no new
+dependency), so editors that speak the OpenAI protocol (e.g. VS Code Copilot "Custom Endpoint") can
+drive a local model. Streaming uses the native OAI chunk path (`requestChatCompletionStream` /
+`receiveChatCompletionChunk`), preserving `delta.tool_calls` for agent mode. Follow-ups, deferred
+until requested:
+
+- **Multi-model registry.** Only one model id is advertised/served today; support several models
+  chosen by the request `model` field (and listed in `/v1/models`).
+- **`stream_options.include_usage` passthrough** so the final streamed `usage` chunk is emitted
+  (needs a generic raw-param passthrough on `InferenceParameters`, or explicit mapping).
+- **Additional `apiType`s.** VS Code "Custom Endpoint" also offers Anthropic `messages` and OpenAI
+  `responses`; only `chat-completions` is implemented. Also consider `/v1/completions` and
+  `/v1/embeddings` routes.
+- **Gemma 4 tool-calling validation.** Confirm the pinned llama.cpp (`b9682`) includes the Gemma 4
+  tool-call parser fixes (landed upstream ~Apr 2026); if not, bump per the upgrade procedure so
+  streamed/blocking `tool_calls` come through for Gemma 4 GGUFs.
+
 ### llama.cpp upstream feature exposure (queued, deferred by policy)
 
 These are JNI plumbing items for upstream API additions. Policy: add only after a real user request — they are mostly relevant to specific model families or specialized workflows.