|
| 1 | +# langchain4j-jllama |
| 2 | + |
| 3 | +[LangChain4j](https://github.com/langchain4j/langchain4j) adapters backed by an **in-process** |
| 4 | +[java-llama.cpp](https://github.com/bernardladenthin/java-llama.cpp) model over JNI — no HTTP server, |
| 5 | +no separate process. |
| 6 | + |
| 7 | +This is a **separate Maven artifact** on purpose: it depends on `langchain4j-core`, but the core |
| 8 | +`net.ladenthin:llama` binding does **not** depend on langchain4j, so plain java-llama.cpp users never |
| 9 | +pull langchain4j (or its Java 17 floor) transitively. |
| 10 | + |
| 11 | +> **Already have an OpenAI-compatible setup?** java-llama.cpp also ships |
| 12 | +> `net.ladenthin.llama.server.OpenAiCompatServer`, so you can point langchain4j's `langchain4j-open-ai` |
| 13 | +> client at a running server with zero code from this module. Use *this* module when you want the |
| 14 | +> in-process path (no HTTP hop, single process — e.g. desktop/Android/embedded). |
| 15 | +
|
| 16 | +## Adapters |
| 17 | + |
| 18 | +| Class | langchain4j interface | java-llama.cpp call | |
| 19 | +|-------|-----------------------|---------------------| |
| 20 | +| `JllamaChatModel` | `ChatModel` | `LlamaModel.chat(...)` | |
| 21 | +| `JllamaStreamingChatModel` | `StreamingChatModel` | `LlamaModel.generateChat(...)` (token streaming) | |
| 22 | +| `JllamaEmbeddingModel` | `EmbeddingModel` | `LlamaModel.embed(...)` | |
| 23 | +| `JllamaScoringModel` | `ScoringModel` (re-ranking) | `LlamaModel.handleRerank(...)` | |
| 24 | + |
| 25 | +## Lifecycle: the model is *borrowed* |
| 26 | + |
| 27 | +Every adapter takes a `LlamaModel` you already loaded and **keeps owning**. The adapter never loads |
| 28 | +or closes the native model — you manage it (try-with-resources or explicit `close()`). One |
| 29 | +`LlamaModel` can back several adapters at once. |
| 30 | + |
| 31 | +```java |
| 32 | +try (LlamaModel llama = new LlamaModel(new ModelParameters().setModel("models/qwen3-0.6b.gguf"))) { |
| 33 | + ChatModel chat = new JllamaChatModel(llama); |
| 34 | + |
| 35 | + String reply = chat.chat("Write a haiku about lazy senior devs."); |
| 36 | + System.out.println(reply); |
| 37 | +} |
| 38 | +``` |
| 39 | + |
| 40 | +Streaming: |
| 41 | + |
| 42 | +```java |
| 43 | +StreamingChatModel chat = new JllamaStreamingChatModel(llama); |
| 44 | +chat.chat("Tell me a story.", new StreamingChatResponseHandler() { |
| 45 | + @Override public void onPartialResponse(String token) { System.out.print(token); } |
| 46 | + @Override public void onCompleteResponse(ChatResponse response) { /* done */ } |
| 47 | + @Override public void onError(Throwable error) { error.printStackTrace(); } |
| 48 | +}); |
| 49 | +``` |
| 50 | + |
| 51 | +Embeddings (model loaded with `enableEmbedding()`) and re-ranking |
| 52 | +(`enableReranking()`) plug straight into langchain4j RAG: |
| 53 | + |
| 54 | +```java |
| 55 | +EmbeddingModel embeddings = new JllamaEmbeddingModel(embeddingLlama); |
| 56 | +ScoringModel reranker = new JllamaScoringModel(rerankLlama); |
| 57 | +``` |
| 58 | + |
| 59 | +## Dependency |
| 60 | + |
| 61 | +```xml |
| 62 | +<dependency> |
| 63 | + <groupId>net.ladenthin</groupId> |
| 64 | + <artifactId>langchain4j-jllama</artifactId> |
| 65 | + <version>5.0.4-SNAPSHOT</version> |
| 66 | +</dependency> |
| 67 | +``` |
| 68 | + |
| 69 | +`langchain4j-core` is pulled transitively. You still supply a java-llama.cpp native library for your |
| 70 | +platform the usual way (bundled in the `net.ladenthin:llama` JAR or on `java.library.path`). |
| 71 | + |
| 72 | +## Building |
| 73 | + |
| 74 | +This is a **sibling module**, not part of the root reactor. Install the core artifact first, then |
| 75 | +build here: |
| 76 | + |
| 77 | +```bash |
| 78 | +# from the repo root: publish the core net.ladenthin:llama jar to your local ~/.m2 |
| 79 | +mvn -DskipTests install |
| 80 | + |
| 81 | +# then build/test this module |
| 82 | +cd langchain4j-jllama |
| 83 | +mvn test |
| 84 | +``` |
| 85 | + |
| 86 | +The end-to-end test (`JllamaChatModelIntegrationTest`) self-skips unless you pass a model: |
| 87 | + |
| 88 | +```bash |
| 89 | +mvn test -Dnet.ladenthin.llama.model.path=/abs/path/to/model.gguf |
| 90 | +``` |
| 91 | + |
| 92 | +## Not mapped yet |
| 93 | + |
| 94 | +- **Tool calling.** `ChatRequest.toolSpecifications()` are not forwarded, so the chat adapters return |
| 95 | + assistant *text*, not `AiMessage.toolExecutionRequests()`. (java-llama.cpp itself supports tool |
| 96 | + calling via `LlamaModel.chatWithTools` / typed `ToolDefinition`; bridging that to langchain4j |
| 97 | + `ToolSpecification` is the planned next step.) |
| 98 | +- **Multimodal user input.** A multi-content `UserMessage` is flattened to its text parts; image/audio |
| 99 | + content is dropped. |
| 100 | +- **Per-token tool-call / thinking stream events.** Streaming forwards plain text via |
| 101 | + `onPartialResponse`. |
| 102 | +- **`response_format` (JSON mode).** `ChatRequest.responseFormat()` (json_object / json_schema) is not |
| 103 | + forwarded; `modelName()` is ignored since one model is bound per adapter. |
| 104 | + |
| 105 | +Mapped request parameters: `temperature`, `topP`, `topK`, `maxOutputTokens`, `frequencyPenalty`, |
| 106 | +`presencePenalty`, `stopSequences`. The non-streaming chat response carries the model's real finish |
| 107 | +reason (`stop`/`length`/`tool_calls`) and token usage; the streaming completion carries assembled text |
| 108 | +(no per-token usage). |
| 109 | + |
| 110 | +Requires Java 17+ (langchain4j 1.x baseline). Targets `langchain4j-core` 1.17.1. |
0 commit comments