Skip to content

Commit 2987cf7

Browse files
committed
test(server): gated integration round-trips for embeddings, rerank, completion/infill/generate
Completes the live end-to-end coverage of the IDE-backend surfaces. Each fixture boots a real server over a real socket in the matching model mode, reuses a model CI already downloads, self-skips when absent, and asserts structural shapes only: - OpenAiServerEmbeddingsIntegrationTest (CodeLlama-7B + enableEmbedding): POST /v1/embeddings returns an OpenAI {object:list, data:[{object:embedding, embedding:[…]}]} shape; also covers the bare /embeddings alias. - OpenAiServerRerankIntegrationTest (jina-reranker + enableReranking): POST /v1/rerank returns sorted {index, relevance_score} results capped by top_n, with the `data` alias. - OpenAiServerCompletionIntegrationTest (CodeLlama-7B): POST /v1/completions, /infill, and Ollama /api/generate (plain + FIM via `suffix`) — CodeLlama is FIM-capable per LlamaModelTest#testGenerateInfill. Also: add TestConstants.RERANKING_MODEL_PATH and route RerankingModelTest through it (removes the duplicated literal). Used Java-8-safe idioms throughout. These run in the same CI job that already round-trips the OpenAI chat path, so the Ollama/Anthropic/Responses/embeddings/rerank/completion surfaces are now all validated end-to-end against real models; only manual editor-client validation remains (TODO). Server + arch suite green (integration fixtures self-skip without models locally); javadoc + spotless clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
1 parent 30dd843 commit 2987cf7

6 files changed

Lines changed: 307 additions & 13 deletions

File tree

TODO.md

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -48,10 +48,17 @@ primary goal: agentic tool-calling with Qwen):
4848
/v1/responses`, SSE events — `ResponsesApiSupport` + `ResponsesStreamTranslator`).
4949
- **`GET /props`** (llama.cpp-native): `default_generation_settings.n_ctx` + `modalities` so autocomplete
5050
clients (llama.vscode) size their context window (`OpenAiSseFormatter.propsJson`).
51-
- Gated **integration round-trips** (`OpenAiCompatServerIntegrationTest`, Qwen3-0.6B over a real socket;
52-
runs in CI's `test-java-linux-x86_64` job, self-skips when the model is absent): OpenAI chat
53-
(non-stream/stream/tools/models) plus Ollama `/api/chat` + discovery, Anthropic `/v1/messages`, OpenAI
54-
`/v1/responses` (non-stream + stream) and `/props` — structural assertions only.
51+
- Gated **integration round-trips** over a real socket, run in CI's `test-java-linux-x86_64` job,
52+
self-skipping when the model is absent — structural assertions only:
53+
- `OpenAiCompatServerIntegrationTest` (Qwen3-0.6B, chat mode): OpenAI chat (non-stream/stream/tools/
54+
models) plus Ollama `/api/chat` + discovery, Anthropic `/v1/messages`, OpenAI `/v1/responses`
55+
(non-stream + stream) and `/props`.
56+
- `OpenAiServerEmbeddingsIntegrationTest` (CodeLlama-7B + `enableEmbedding`): `/v1/embeddings` (+ bare
57+
alias).
58+
- `OpenAiServerRerankIntegrationTest` (jina-reranker + `enableReranking`): `/v1/rerank` (sorted
59+
`results`/`data`, `top_n` cap).
60+
- `OpenAiServerCompletionIntegrationTest` (CodeLlama-7B): `/v1/completions`, `/infill`, and Ollama
61+
`/api/generate` (plain + FIM via `suffix`).
5562

5663
**Open follow-ups (deferred):**
5764

@@ -71,12 +78,10 @@ primary goal: agentic tool-calling with Qwen):
7178
`suffix`) applies the model's FIM tokens server-side, so this is lower value.
7279
- **Multi-model registry.** Only one model id is advertised/served today; serving several would need
7380
multi-model load + lifecycle management.
74-
- **Remaining live validation.** Gated server-side round-trips now exist for all four protocols (above).
75-
Still open: (a) manual validation against the actual editor clients — point Copilot's Ollama provider /
76-
a Custom Endpoint, Claude Code, and a Responses client at the running server; (b) gated round-trips for
77-
`/v1/embeddings`, `/v1/rerank` and `/infill`, which need their own server fixtures in the matching mode
78-
(`enableEmbedding` / `enableReranking` / a FIM-capable model). The models are already downloaded in CI
79-
(nomic-embed, jina-reranker, CodeLlama-7B), so only the test fixtures are missing.
81+
- **Manual real-client validation.** Gated server-side round-trips now exist for every surface (above).
82+
What remains is manual validation against the actual editor clients — point Copilot's Ollama provider /
83+
a Custom Endpoint, Claude Code, and a Responses client at the running server — since a server-side
84+
round-trip confirms the wire shapes but not each client's own parser.
8085
- **Gemma 4 tool-calling validation.** Confirm the pinned llama.cpp (`b9682`) includes the Gemma 4
8186
tool-call parser fixes; if not, bump per the upgrade procedure.
8287

src/test/java/net/ladenthin/llama/RerankingModelTest.java

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -33,12 +33,11 @@ public class RerankingModelTest {
3333
@BeforeAll
3434
public static void setup() {
3535
Assumptions.assumeTrue(
36-
new File("models/jina-reranker-v1-tiny-en-Q4_0.gguf").exists(),
37-
"Reranking model not available, skipping tests");
36+
new File(TestConstants.RERANKING_MODEL_PATH).exists(), "Reranking model not available, skipping tests");
3837
int gpuLayers = Integer.getInteger(TestConstants.PROP_TEST_NGL, TestConstants.DEFAULT_TEST_NGL);
3938
model = new LlamaModel(new ModelParameters()
4039
.setCtxSize(128)
41-
.setModel("models/jina-reranker-v1-tiny-en-Q4_0.gguf")
40+
.setModel(TestConstants.RERANKING_MODEL_PATH)
4241
.setGpuLayers(gpuLayers)
4342
.enableReranking()
4443
.enableLogTimestamps()

src/test/java/net/ladenthin/llama/TestConstants.java

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,9 @@ public class TestConstants {
2323
/** Path to the Qwen3 thinking model used for reasoning budget tests. */
2424
public static final String REASONING_MODEL_PATH = "models/Qwen3-0.6B-Q4_K_M.gguf";
2525

26+
/** Path to the reranking model used in tests (loaded with {@code enableReranking()}). */
27+
public static final String RERANKING_MODEL_PATH = "models/jina-reranker-v1-tiny-en-Q4_0.gguf";
28+
2629
/**
2730
* System property holding a path to a Nomic embedding model
2831
* ({@code nomic-embed-text-v1.5.f16.gguf} or a compatible BERT-family encoder).
Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
// SPDX-FileCopyrightText: 2026 Bernard Ladenthin <bernard.ladenthin@gmail.com>
2+
//
3+
// SPDX-License-Identifier: MIT
4+
5+
package net.ladenthin.llama.server;
6+
7+
import static org.hamcrest.MatcherAssert.assertThat;
8+
import static org.hamcrest.Matchers.is;
9+
10+
import com.fasterxml.jackson.databind.JsonNode;
11+
import com.fasterxml.jackson.databind.ObjectMapper;
12+
import java.io.File;
13+
import java.io.IOException;
14+
import net.ladenthin.llama.LlamaModel;
15+
import net.ladenthin.llama.TestConstants;
16+
import net.ladenthin.llama.parameters.ModelParameters;
17+
import org.junit.jupiter.api.AfterAll;
18+
import org.junit.jupiter.api.Assumptions;
19+
import org.junit.jupiter.api.BeforeAll;
20+
import org.junit.jupiter.api.Test;
21+
22+
/**
23+
* End-to-end integration test for the completion-family routes — {@code POST /v1/completions},
24+
* {@code POST /infill} (fill-in-the-middle) and the Ollama {@code POST /api/generate} (plain + FIM via a
25+
* {@code suffix}) — against a real model over a real socket. Reuses the CI text model (CodeLlama-7B,
26+
* {@link TestConstants#MODEL_PATH}), which is FIM-capable (see {@code LlamaModelTest#testGenerateInfill}).
27+
* Self-skips when the model file is absent. Assertions are structural (valid response envelopes) rather
28+
* than value-specific. HTTP plumbing is inherited from {@link OpenAiServerTestSupport}.
29+
*/
30+
public class OpenAiServerCompletionIntegrationTest extends OpenAiServerTestSupport {
31+
32+
private static final ObjectMapper MAPPER = new ObjectMapper();
33+
private static final String MODEL_ID = "completion-local";
34+
35+
private static LlamaModel model;
36+
private static OpenAiCompatServer server;
37+
private static int port;
38+
39+
@BeforeAll
40+
public static void setup() throws IOException {
41+
Assumptions.assumeTrue(
42+
new File(TestConstants.MODEL_PATH).exists(),
43+
"Text model (CodeLlama-7B) not found, skipping completion server integration test");
44+
int gpuLayers = Integer.getInteger(TestConstants.PROP_TEST_NGL, TestConstants.DEFAULT_TEST_NGL);
45+
model = new LlamaModel(new ModelParameters()
46+
.setModel(TestConstants.MODEL_PATH)
47+
.setCtxSize(512)
48+
.setGpuLayers(gpuLayers));
49+
server = new OpenAiCompatServer(
50+
model,
51+
OpenAiServerConfig.builder().port(0).modelId(MODEL_ID).build())
52+
.start();
53+
port = server.getPort();
54+
}
55+
56+
@AfterAll
57+
public static void tearDown() {
58+
if (server != null) {
59+
server.close();
60+
}
61+
if (model != null) {
62+
model.close();
63+
}
64+
}
65+
66+
@Test
67+
public void completionsReturnsTextChoice() throws IOException {
68+
String body = "{\"model\":\"" + MODEL_ID + "\",\"max_tokens\":16,\"prompt\":\"def add(a, b):\\n return\"}";
69+
Response response = post(port, "/v1/completions", body, "");
70+
assertThat(response.code, is(200));
71+
JsonNode json = MAPPER.readTree(response.body);
72+
assertThat(json.path("object").asText(), is("text_completion"));
73+
assertThat(json.path("choices").get(0).path("text").isTextual(), is(true));
74+
}
75+
76+
@Test
77+
public void infillReturnsContent() throws IOException {
78+
String body = "{\"input_prefix\":\"def add(a, b):\\n return \",\"input_suffix\":\"\\n\",\"n_predict\":16}";
79+
Response response = post(port, "/infill", body, "");
80+
assertThat(response.code, is(200));
81+
// The native infill response carries the generated middle under "content".
82+
assertThat(MAPPER.readTree(response.body).path("content").isTextual(), is(true));
83+
}
84+
85+
@Test
86+
public void ollamaGenerateNonStreamingRoundTrip() throws IOException {
87+
String body = "{\"model\":\"" + MODEL_ID + "\",\"stream\":false,"
88+
+ "\"prompt\":\"def add(a, b):\\n return\",\"options\":{\"num_predict\":16}}";
89+
Response response = post(port, "/api/generate", body, "");
90+
assertThat(response.code, is(200));
91+
JsonNode json = MAPPER.readTree(response.body);
92+
assertThat(json.path("response").isTextual(), is(true));
93+
assertThat(json.path("done").asBoolean(), is(true));
94+
}
95+
96+
@Test
97+
public void ollamaGenerateWithSuffixUsesInfill() throws IOException {
98+
String body = "{\"model\":\"" + MODEL_ID + "\",\"stream\":false,"
99+
+ "\"prompt\":\"def add(a, b):\\n return \",\"suffix\":\"\\n\",\"options\":{\"num_predict\":16}}";
100+
Response response = post(port, "/api/generate", body, "");
101+
assertThat(response.code, is(200));
102+
assertThat(MAPPER.readTree(response.body).path("response").isTextual(), is(true));
103+
}
104+
}
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
// SPDX-FileCopyrightText: 2026 Bernard Ladenthin <bernard.ladenthin@gmail.com>
2+
//
3+
// SPDX-License-Identifier: MIT
4+
5+
package net.ladenthin.llama.server;
6+
7+
import static org.hamcrest.MatcherAssert.assertThat;
8+
import static org.hamcrest.Matchers.greaterThan;
9+
import static org.hamcrest.Matchers.is;
10+
11+
import com.fasterxml.jackson.databind.JsonNode;
12+
import com.fasterxml.jackson.databind.ObjectMapper;
13+
import java.io.File;
14+
import java.io.IOException;
15+
import net.ladenthin.llama.LlamaModel;
16+
import net.ladenthin.llama.TestConstants;
17+
import net.ladenthin.llama.parameters.ModelParameters;
18+
import org.junit.jupiter.api.AfterAll;
19+
import org.junit.jupiter.api.Assumptions;
20+
import org.junit.jupiter.api.BeforeAll;
21+
import org.junit.jupiter.api.Test;
22+
23+
/**
24+
* End-to-end integration test for the {@code POST /v1/embeddings} route against a real model loaded in
25+
* embedding mode ({@code enableEmbedding()}), served over a real socket. Reuses the CI text model
26+
* (CodeLlama-7B, {@link TestConstants#MODEL_PATH}) — the same model {@code LlamaEmbeddingsTest} drives in
27+
* embedding mode. Self-skips when the model file is absent (e.g. a local checkout without models), so it
28+
* never breaks a model-free run. Assertions are structural (valid OpenAI embeddings shape) rather than
29+
* value-specific. HTTP plumbing is inherited from {@link OpenAiServerTestSupport}.
30+
*/
31+
public class OpenAiServerEmbeddingsIntegrationTest extends OpenAiServerTestSupport {
32+
33+
private static final ObjectMapper MAPPER = new ObjectMapper();
34+
private static final String MODEL_ID = "embed-local";
35+
36+
private static LlamaModel model;
37+
private static OpenAiCompatServer server;
38+
private static int port;
39+
40+
@BeforeAll
41+
public static void setup() throws IOException {
42+
Assumptions.assumeTrue(
43+
new File(TestConstants.MODEL_PATH).exists(),
44+
"Text model (CodeLlama-7B) not found, skipping embeddings server integration test");
45+
int gpuLayers = Integer.getInteger(TestConstants.PROP_TEST_NGL, TestConstants.DEFAULT_TEST_NGL);
46+
model = new LlamaModel(new ModelParameters()
47+
.setModel(TestConstants.MODEL_PATH)
48+
.setCtxSize(512)
49+
.setGpuLayers(gpuLayers)
50+
.enableEmbedding());
51+
server = new OpenAiCompatServer(
52+
model,
53+
OpenAiServerConfig.builder().port(0).modelId(MODEL_ID).build())
54+
.start();
55+
port = server.getPort();
56+
}
57+
58+
@AfterAll
59+
public static void tearDown() {
60+
if (server != null) {
61+
server.close();
62+
}
63+
if (model != null) {
64+
model.close();
65+
}
66+
}
67+
68+
@Test
69+
public void embeddingsReturnsAVector() throws IOException {
70+
String body = "{\"model\":\"" + MODEL_ID + "\",\"input\":\"hello world\"}";
71+
Response response = post(port, "/v1/embeddings", body, "");
72+
assertThat(response.code, is(200));
73+
JsonNode json = MAPPER.readTree(response.body);
74+
assertThat(json.path("object").asText(), is("list"));
75+
JsonNode first = json.path("data").get(0);
76+
assertThat(first.path("object").asText(), is("embedding"));
77+
assertThat(first.path("embedding").isArray(), is(true));
78+
assertThat(first.path("embedding").size(), greaterThan(0));
79+
}
80+
81+
@Test
82+
public void embeddingsReachableWithoutV1Prefix() throws IOException {
83+
String body = "{\"model\":\"" + MODEL_ID + "\",\"input\":\"alias check\"}";
84+
Response response = post(port, "/embeddings", body, "");
85+
assertThat(response.code, is(200));
86+
assertThat(
87+
MAPPER.readTree(response.body)
88+
.path("data")
89+
.get(0)
90+
.path("embedding")
91+
.isArray(),
92+
is(true));
93+
}
94+
}
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
// SPDX-FileCopyrightText: 2026 Bernard Ladenthin <bernard.ladenthin@gmail.com>
2+
//
3+
// SPDX-License-Identifier: MIT
4+
5+
package net.ladenthin.llama.server;
6+
7+
import static org.hamcrest.MatcherAssert.assertThat;
8+
import static org.hamcrest.Matchers.greaterThan;
9+
import static org.hamcrest.Matchers.is;
10+
import static org.hamcrest.Matchers.lessThanOrEqualTo;
11+
12+
import com.fasterxml.jackson.databind.JsonNode;
13+
import com.fasterxml.jackson.databind.ObjectMapper;
14+
import java.io.File;
15+
import java.io.IOException;
16+
import net.ladenthin.llama.LlamaModel;
17+
import net.ladenthin.llama.TestConstants;
18+
import net.ladenthin.llama.parameters.ModelParameters;
19+
import org.junit.jupiter.api.AfterAll;
20+
import org.junit.jupiter.api.Assumptions;
21+
import org.junit.jupiter.api.BeforeAll;
22+
import org.junit.jupiter.api.Test;
23+
24+
/**
25+
* End-to-end integration test for the {@code POST /v1/rerank} route against a real model loaded in
26+
* reranking mode ({@code enableReranking()}), served over a real socket. Reuses the CI reranking model
27+
* (jina-reranker, {@link TestConstants#RERANKING_MODEL_PATH}). Self-skips when the model file is absent.
28+
* Assertions are structural (sorted {@code results}/{@code data} of {@code index}+{@code relevance_score})
29+
* and check the {@code top_n} cap; exact scores are model-dependent. HTTP plumbing is inherited from
30+
* {@link OpenAiServerTestSupport}.
31+
*/
32+
public class OpenAiServerRerankIntegrationTest extends OpenAiServerTestSupport {
33+
34+
private static final ObjectMapper MAPPER = new ObjectMapper();
35+
private static final String MODEL_ID = "rerank-local";
36+
37+
private static LlamaModel model;
38+
private static OpenAiCompatServer server;
39+
private static int port;
40+
41+
@BeforeAll
42+
public static void setup() throws IOException {
43+
Assumptions.assumeTrue(
44+
new File(TestConstants.RERANKING_MODEL_PATH).exists(),
45+
"Reranking model (jina-reranker) not found, skipping rerank server integration test");
46+
int gpuLayers = Integer.getInteger(TestConstants.PROP_TEST_NGL, TestConstants.DEFAULT_TEST_NGL);
47+
model = new LlamaModel(new ModelParameters()
48+
.setModel(TestConstants.RERANKING_MODEL_PATH)
49+
.setCtxSize(512)
50+
.setGpuLayers(gpuLayers)
51+
.enableReranking()
52+
.skipWarmup());
53+
server = new OpenAiCompatServer(
54+
model,
55+
OpenAiServerConfig.builder().port(0).modelId(MODEL_ID).build())
56+
.start();
57+
port = server.getPort();
58+
}
59+
60+
@AfterAll
61+
public static void tearDown() {
62+
if (server != null) {
63+
server.close();
64+
}
65+
if (model != null) {
66+
model.close();
67+
}
68+
}
69+
70+
@Test
71+
public void rerankReturnsScoredResultsCappedByTopN() throws IOException {
72+
String body = "{\"model\":\"" + MODEL_ID + "\",\"query\":\"Machine learning is\","
73+
+ "\"documents\":[\"A machine applies forces to perform an action.\","
74+
+ "\"Machine learning is a field of artificial intelligence.\","
75+
+ "\"Paris is the capital of France.\"],\"top_n\":2}";
76+
Response response = post(port, "/v1/rerank", body, "");
77+
assertThat(response.code, is(200));
78+
JsonNode json = MAPPER.readTree(response.body);
79+
assertThat(json.path("object").asText(), is("list"));
80+
JsonNode results = json.path("results");
81+
assertThat(results.isArray(), is(true));
82+
assertThat(results.size(), greaterThan(0));
83+
assertThat(results.size(), lessThanOrEqualTo(2)); // top_n cap
84+
assertThat(results.get(0).path("index").isInt(), is(true));
85+
assertThat(results.get(0).path("relevance_score").isNumber(), is(true));
86+
// `data` is an alias of `results` for Continue (#6478).
87+
assertThat(json.path("data").size(), is(results.size()));
88+
}
89+
}

0 commit comments

Comments
 (0)