Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions .github/workflows/clang-format.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# SPDX-FileCopyrightText: 2026 Bernard Ladenthin <bernard.ladenthin@gmail.com>
#
# SPDX-License-Identifier: MIT

name: clang-format
on:
push:
pull_request:
workflow_dispatch:

# Enforces a single, pinned clang-format across all C++ sources so formatting is
# reproducible between contributors and CI. Bump CLANG_FORMAT_VERSION here and in
# CLAUDE.md (Code Formatting) together, then reformat the tree with the same version.
env:
CLANG_FORMAT_VERSION: "22.1.5"

jobs:
clang-format:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
- uses: actions/setup-python@v5
with:
python-version: "3.x"
- name: Install pinned clang-format
run: pip install "clang-format==${CLANG_FORMAT_VERSION}"
- name: Check C++ formatting
run: |
clang-format --version
# All hand-written C++ sources; the generated JNI header (src/main/cpp/jllama.h,
# produced by `javac -h`) is intentionally excluded.
files=$(find src/main/cpp src/test/cpp -type f \( -name '*.cpp' -o -name '*.hpp' \) | sort)
echo "Checking:"; echo "$files"
clang-format --dry-run --Werror $files
20 changes: 16 additions & 4 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -392,10 +392,21 @@ not track the loader's own Java package). This is the same
`spotbugs-exclude.xml`, PIT `targetClasses`, and `CMakeLists.txt` OSInfo repairs.

### Code Formatting

C++ formatting is **enforced in CI** (`.github/workflows/clang-format.yml`) with a **pinned**
clang-format — currently **22.1.5**, installed via `pip install clang-format==22.1.5`. Format with
that exact version before committing; a different clang-format version reflows code differently and
will fail the check.

```bash
clang-format -i src/main/cpp/*.cpp src/main/cpp/*.hpp # Format C++ code
pip install "clang-format==22.1.5"
clang-format -i src/main/cpp/*.cpp src/main/cpp/*.hpp src/test/cpp/*.cpp # Format C++ code
```

The generated JNI header `src/main/cpp/jllama.h` (produced by `javac -h`) is intentionally excluded.
To bump the enforced version, update the pin in **both** the workflow (`CLANG_FORMAT_VERSION`) and
this line, then reformat the whole tree with the new version in the same commit.

### Javadoc — must build cleanly before `mvn package`

The release packaging job runs `mvn package` with the `release` profile, which attaches
Expand Down Expand Up @@ -452,6 +463,7 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in
- `LlamaIterator` / `LlamaIterable` — Streaming generation via Java `Iterator`/`Iterable`.
- `LlamaLoader` — Extracts the platform-specific native library from the JAR to a temp directory, or finds it on `java.library.path`.
- `OSInfo` — Detects OS and architecture for library resolution.
- `server.OpenAiCompatServer` — Optional OpenAI-compatible HTTP endpoint built on the JDK's `com.sun.net.httpserver` (no new dependency). Serves `POST /v1/chat/completions` (streaming via SSE + non-streaming) and `GET /v1/models` by delegating to `LlamaModel.chatComplete` / `LlamaModel.streamChatCompletion`, so editors that speak the OpenAI protocol (e.g. VS Code Copilot "Custom Endpoint") can drive a local model. Streaming uses the native OAI chunk path (`requestChatCompletionStream` / `receiveChatCompletionChunk`), preserving `delta.tool_calls`.

**Native layer** (`src/main/cpp/`):
- `jllama.cpp` — JNI implementation bridging Java calls to llama.cpp. ~1,215 lines; 17 native methods.
Expand All @@ -476,7 +488,7 @@ The project C++ helpers follow a strict semantic split:

Functions: `get_result_error_message`, `results_to_json`, `rerank_results_to_json`,
`parse_encoding_format`, `extract_embedding_prompt`, `is_infill_request`,
`parse_slot_prompt_similarity`, `parse_positive_int_config`.
`parse_slot_prompt_similarity`, `parse_positive_int_config`, `wrap_stream_chunk`.

**`log_helpers.hpp`** — Pure log-formatting transforms.
- Input: `ggml_log_level`, message text (`const char*`), an explicit `std::time_t` timestamp.
Expand Down Expand Up @@ -582,11 +594,11 @@ ctest --test-dir build --output-on-failure -R "ResultsToJson"
|------|-------|-------|
| `src/test/cpp/test_utils.cpp` | 156 | Upstream helpers: `server_tokens`, `server_grammar_trigger`, `gen_tool_call_id`, `json_value`, `json_get_nested_values`, UTF-8 helpers, `format_response_rerank`, `format_embeddings_response_oaicompat`, `oaicompat_completion_params_parse`, `oaicompat_chat_params_parse`, `are_lora_equal`, `strip_flag_from_argv`, `token_piece_value`, `json_is_array_and_contains_numbers`, `format_oai_sse`, `format_oai_resp_sse`, `format_anthropic_sse` |
| `src/test/cpp/test_server.cpp` | 188 | Upstream result types: `result_timings`, `task_params::to_json()` (incl. `dry_sequence_breakers`, `preserved_tokens`, `timings_per_token`), `completion_token_output`, `server_task_result_cmpl_partial` (non-oaicompat + `to_json_oaicompat` + logprobs + `to_json_oaicompat_chat` + `to_json_anthropic` + dispatcher), `server_task_result_cmpl_final` (non-oaicompat + `to_json_oaicompat` + `to_json_oaicompat_chat` + `to_json_oaicompat_chat_stream` + `to_json_anthropic` + `to_json_anthropic_stream` + tool_calls + dispatcher), `server_task_result_embd`, `server_task_result_rerank`, `server_task_result_metrics`, `server_task_result_slot_save_load`, `server_task_result_slot_erase`, `server_task_result_apply_lora`, `server_task_result_error`, `format_error_response`, `server_task::need_sampling()`, `server_task::n_tokens()`, `server_task::params_from_json_cmpl()` (parsing pipeline + grammar routing + error paths), `response_fields` projection |
| `src/test/cpp/test_json_helpers.cpp` | 42 | All functions in `json_helpers.hpp`: `get_result_error_message`, `results_to_json`, `rerank_results_to_json`, `parse_encoding_format`, `extract_embedding_prompt`, `is_infill_request`, `parse_slot_prompt_similarity`, `parse_positive_int_config` |
| `src/test/cpp/test_json_helpers.cpp` | 47 | All functions in `json_helpers.hpp`: `get_result_error_message`, `results_to_json`, `rerank_results_to_json`, `parse_encoding_format`, `extract_embedding_prompt`, `is_infill_request`, `parse_slot_prompt_similarity`, `parse_positive_int_config`, `wrap_stream_chunk` |
| `src/test/cpp/test_log_helpers.cpp` | 13 | All functions in `log_helpers.hpp`: `log_level_name`, `format_log_as_json` |
| `src/test/cpp/test_jni_helpers.cpp` | 41 | All functions in `jni_helpers.hpp` using a zero-filled `JNINativeInterface_` mock |

**Current total: 440 tests (all passing).**
**Current total: 445 tests (all passing).**

#### Upstream source location (in CMake build tree)

Expand Down
71 changes: 71 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -396,6 +396,77 @@ a JSON response, matching the HTTP server's contract:
Server state is exposed via `getMetrics()`, `eraseSlot(int)`, `saveSlot(int, String)`,
`restoreSlot(int, String)`, and `getModelMeta()`.

### Local OpenAI endpoint for VS Code Copilot (and other OpenAI clients)

`net.ladenthin.llama.server.OpenAiCompatServer` turns a loaded model into a local
OpenAI-compatible HTTP endpoint using only the JDK's built-in `com.sun.net.httpserver` — no extra
dependency and no separate server process. It serves:

- `POST /v1/chat/completions` — streaming (Server-Sent Events) and non-streaming, forwarding
`messages`/`tools` verbatim. The streaming path carries `delta.tool_calls`, so agent/tool-calling
clients work.
- `GET /v1/models` — advertises the configured model id.

Embed it in your app:

```java
ModelParameters modelParams = new ModelParameters().setModel("models/model.gguf").setParallel(2);
OpenAiServerConfig config = OpenAiServerConfig.builder().port(8080).modelId("local-model").build();
try (LlamaModel model = new LlamaModel(modelParams);
OpenAiCompatServer server = new OpenAiCompatServer(model, config).start()) {
Thread.currentThread().join(); // serve until interrupted
}
```

…or run it standalone:

```bash
java -cp target/llama-<version>.jar net.ladenthin.llama.server.OpenAiCompatServer \
--model models/model.gguf --port 8080 --model-id local-model
```

Verify with curl:

```bash
curl -N http://127.0.0.1:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"local-model","stream":true,"messages":[{"role":"user","content":"hi"}]}'
```

**VS Code Copilot setup:** Command Palette → **Chat: Manage Language Models** → **Add Models** →
**Custom Endpoint**; enter a group name, a display name and any non-empty API key, and pick API type
**Chat Completions**. VS Code then opens `chatLanguageModels.json` — set the model `url` to your
endpoint (the host/port go here, not in the form):

```json
[
{
"name": "Local llama.cpp",
"vendor": "customendpoint",
"apiKey": "local-dummy-key",
"apiType": "chat-completions",
"models": [
{
"id": "local-model",
"name": "Local model",
"url": "http://127.0.0.1:8080/v1/chat/completions",
"toolCalling": true,
"vision": false,
"maxInputTokens": 6144,
"maxOutputTokens": 2048
}
]
}
]
```

Notes: BYOK powers the chat/agent experience only (inline completions and embeddings still require a
GitHub account). On CPU, prefer a smaller model and a modest context window — the server emits SSE
heartbeats so a long prompt prefill does not trip the client's stream-inactivity timeout. Agent-mode
tool calling depends on the model's own tool-calling quality. Pass `--api-key` (or
`OpenAiServerConfig.apiKey(...)`) to require an `Authorization: Bearer` token; the server binds to
`127.0.0.1` by default.

### Model/Inference Configuration

There are two sets of parameters you can configure, `ModelParameters` and `InferenceParameters`. Both provide builder
Expand Down
20 changes: 20 additions & 0 deletions TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,26 @@ cross-cutting initiative.

## Open — jllama-specific

### OpenAI-compatible HTTP endpoint (shipped; follow-ups open)

`net.ladenthin.llama.server.OpenAiCompatServer` exposes `POST /v1/chat/completions` (streaming via
SSE + non-streaming) and `GET /v1/models` over the JDK's built-in `com.sun.net.httpserver` (no new
dependency), so editors that speak the OpenAI protocol (e.g. VS Code Copilot "Custom Endpoint") can
drive a local model. Streaming uses the native OAI chunk path (`requestChatCompletionStream` /
`receiveChatCompletionChunk`), preserving `delta.tool_calls` for agent mode. Follow-ups, deferred
until requested:

- **Multi-model registry.** Only one model id is advertised/served today; support several models
chosen by the request `model` field (and listed in `/v1/models`).
- **`stream_options.include_usage` passthrough** so the final streamed `usage` chunk is emitted
(needs a generic raw-param passthrough on `InferenceParameters`, or explicit mapping).
- **Additional `apiType`s.** VS Code "Custom Endpoint" also offers Anthropic `messages` and OpenAI
`responses`; only `chat-completions` is implemented. Also consider `/v1/completions` and
`/v1/embeddings` routes.
- **Gemma 4 tool-calling validation.** Confirm the pinned llama.cpp (`b9682`) includes the Gemma 4
tool-call parser fixes (landed upstream ~Apr 2026); if not, bump per the upgrade procedure so
streamed/blocking `tool_calls` come through for Gemma 4 GGUFs.

### llama.cpp upstream feature exposure (queued, deferred by policy)

These are JNI plumbing items for upstream API additions. Policy: add only after a real user request — they are mostly relevant to specific model families or specialized workflows.
Expand Down
Loading
Loading