You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CLAUDE.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -229,6 +229,7 @@ For the full record of upstream API breaks across version ranges (b5022 →
229
229
mvn compile # Compiles Java and generates JNI headers
230
230
mvn test# Run all tests (requires native library and model files)
231
231
mvn package # Build JAR
232
+
mvn -P assembly package # Also build the fat jar-with-dependencies uber JAR (library + Java deps + native libs); CI builds it and uploads it in the `llama-jars` artifact
232
233
mvn test -Dtest=LlamaModelTest#testGenerate # Run a single test method
233
234
```
234
235
@@ -452,6 +453,7 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in
452
453
-`LlamaIterator` / `LlamaIterable` — Streaming generation via Java `Iterator`/`Iterable`.
453
454
-`LlamaLoader` — Extracts the platform-specific native library from the JAR to a temp directory, or finds it on `java.library.path`.
454
455
-`OSInfo` — Detects OS and architecture for library resolution.
456
+
-`server.LlamaServer` — Optional OpenAI-compatible HTTP server and the fat-jar `Main-Class`. `LlamaServerArgs` parses the CLI; `OaiRouter` / `OaiHttpServer` (NanoHTTPD) map `POST /v1/chat/completions`, `/v1/completions`, `/v1/embeddings` and `GET /v1/models` to the `LlamaModel.handle*` methods. NanoHTTPD is an `<optional>` dependency (bundled only in the fat jar, not inherited by library consumers). The `server` package is a dedicated top layer in the ArchUnit `layeredArchitecture` rule (the only layer allowed to access the root `Api`). See README "OpenAI-compatible HTTP server".
- Pre-built native binaries for Linux (x86-64, aarch64), macOS (x86-64, arm64), and Windows (x86-64, x86); CUDA, Metal, and Vulkan supported via local build.
102
103
@@ -396,6 +397,37 @@ a JSON response, matching the HTTP server's contract:
396
397
Server state is exposed via `getMetrics()`, `eraseSlot(int)`, `saveSlot(int, String)`,
397
398
`restoreSlot(int, String)`, and `getModelMeta()`.
398
399
400
+
### OpenAI-compatible HTTP server
401
+
402
+
The fat jar built by the `assembly` profile (`mvn -P assembly package`) is runnable: its
403
+
`Main-Class` is `net.ladenthin.llama.server.LlamaServer`, a small [NanoHTTPD](https://github.com/NanoHttpd/nanohttpd)
404
+
server that loads a GGUF model in-process and serves OpenAI-compatible endpoints by forwarding each
405
+
request body to the matching `LlamaModel.handle*` method:
Copy file name to clipboardExpand all lines: TODO.md
+34Lines changed: 34 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -55,6 +55,40 @@ These are JNI plumbing items for upstream API additions. Policy: add only after
55
55
56
56
**Out of scope until evidence supports it**: actually implementing any of the above. This entry exists so that when someone asks "can I ship java-llama.cpp as a single 30 MB binary?" the answer points to a concrete investigation plan rather than restarting from zero.
0 commit comments