Skip to content

Commit 3248c1c

Browse files
committed
Add per-run timing line on net.ladenthin.llama.timings SLF4J logger
Emits a single info-level summary line at the end of every non-streaming generation (complete / chat), mirroring what the llama.cpp CLI prints: prompt: 12 tok in 84.3 ms (142.4 tok/s) | gen: 256 tok in 5031.7 ms (50.9 tok/s) | cache: 0 Speculative-decoding runs append: | draft: 50 (35 accepted) Implementation: - New TimingsLogger utility class with two public methods: format(Timings) -> single-line String (exposed so CLI sinks can reuse) log(Timings) -> emits format(...) at INFO on net.ladenthin.llama.timings (dedicated logger so users can suppress it via logback without touching the rest of net.ladenthin.llama). - log() is a no-op for null and for all-zero Timings (typical on parse failure / early cancellation). No noise from non-event paths. - Wired into both result parsers right after the Timings instance is built: json/CompletionResponseParser#parseCompletionResult json/ChatResponseParser#parseResponse - Tests: 7 unit tests in TimingsLoggerTest pin the format byte-exact for the standard case, draft segment presence/absence, cache-hit rendering, dedicated-logger SLF4J pipeline delivery, all-zero no-op, and null no-op. Uses LogCaptor (the same harness LoggingSmokeTest uses for OSInfo). Streaming generation (LlamaIterable / LlamaIterator) is not yet hooked. The streaming iterator does not surface a clean "I am done" callback visible from the public API today; threading that through is a separate follow-up. Non-streaming covers the most common code path and gives users an immediate signal. The remaining first-batch items from the feature-investigation backlog in CLAUDE.md are now the UTF-8 boundary-safe streaming decoder and a jbang single-file example. Tests run (per the test-execution policy — no full surefire): - mvn compile / mvn test-compile: clean. - mvn test -Dtest='TimingsLoggerTest': 7/7 pass. - mvn test -Dtest='CompletionResponseParserTest,ChatResponseParserTest, LoggingSmokeTest,LlamaArchitectureTest': 58/58 pass (covers both parser wire-in sites plus the architecture invariants).
1 parent 28dc9e6 commit 3248c1c

5 files changed

Lines changed: 212 additions & 1 deletion

File tree

CLAUDE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -694,7 +694,7 @@ See [`../workspace/policies/jqwik-prompt-injection.md`](../workspace/policies/jq
694694
695695
- ~~**Abstract the Java and test writing guidelines to a workspace-level shared layer.**~~ **DONE.** This repo is Java 8; follow the workspace version chain at [`../workspace/guides/src/CODE_WRITING_GUIDE-8.md`](../workspace/guides/src/CODE_WRITING_GUIDE-8.md) and [`../workspace/guides/test/TEST_WRITING_GUIDE-8.md`](../workspace/guides/test/TEST_WRITING_GUIDE-8.md). Canonical TDD skill at [`../workspace/.claude/skills/java-tdd-guide/SKILL.md`](../workspace/.claude/skills/java-tdd-guide/SKILL.md). This repo has no project-specific writing-guide supplements.
696696
697-
- **Feature backlog from similar projects.** See [`docs/feature-investigation-similar-projects.md`](docs/feature-investigation-similar-projects.md) for the consolidated investigation across the 5 pure-Java sibling runtimes ([llama3.java](https://github.com/mukel/llama3.java), [gemma4.java](https://github.com/mukel/gemma4.java), [gptoss.java](https://github.com/mukel/gptoss.java), [qwen35.java](https://github.com/mukel/qwen35.java), [nemotron3.java](https://github.com/mukel/nemotron3.java)) plus the dormant alternative JNI binding [llamacpp4j](https://github.com/sebicom/llamacpp4j). The doc captures 18 candidate items grouped into cross-cutting themes (UTF-8 streaming boundary safety, thinking-channel router, operator timing line, jbang single-file example, README system-properties table, etc.) and per-repo unique findings (Harmony channel decoder, Qwen empty-`<think>` injection, llama_state_* save/load, llama_adapter_lora_* hot-apply, etc.), each with effort sizing (XS / S / M / L) and a prioritised backlog. **Recommended first batch** (items 1, 3, 4, 5): UTF-8 boundary-safe streaming decoder + per-run timing line + one jbang-runnable example + a README system-properties table; ~1-2 days total, no JNI changes.
697+
- **Feature backlog from similar projects.** See [`docs/feature-investigation-similar-projects.md`](docs/feature-investigation-similar-projects.md) for the consolidated investigation across the 5 pure-Java sibling runtimes ([llama3.java](https://github.com/mukel/llama3.java), [gemma4.java](https://github.com/mukel/gemma4.java), [gptoss.java](https://github.com/mukel/gptoss.java), [qwen35.java](https://github.com/mukel/qwen35.java), [nemotron3.java](https://github.com/mukel/nemotron3.java)) plus the dormant alternative JNI binding [llamacpp4j](https://github.com/sebicom/llamacpp4j). The doc captures 18 candidate items grouped into cross-cutting themes (UTF-8 streaming boundary safety, thinking-channel router, operator timing line, jbang single-file example, README system-properties table, etc.) and per-repo unique findings (Harmony channel decoder, Qwen empty-`<think>` injection, llama_state_* save/load, llama_adapter_lora_* hot-apply, etc.), each with effort sizing (XS / S / M / L) and a prioritised backlog. **Recommended first batch** (items 1, 3, 4, 5): UTF-8 boundary-safe streaming decoder + ~~per-run timing line~~ + one jbang-runnable example + ~~a README system-properties table~~; ~1-2 days total, no JNI changes. **DONE so far:** README system-properties table (`e36f631`, with two cleanups in `3ae6c81` + `28dc9e6`); per-run timing line (`TimingsLogger` class + wire-in to `CompletionResponseParser` and `ChatResponseParser`; format mirrors what `llama.cpp` CLI prints — `prompt: N tok in X ms (Y tok/s) | gen: … | cache: N | draft: …`; dedicated SLF4J logger `net.ladenthin.llama.timings` so users can suppress it independently; 7 unit tests pin format + pipeline behaviour). **Remaining first-batch items:** UTF-8 boundary-safe streaming decoder + jbang example.
698698
699699
- **Evaluate GraalVM Native Image as an alternative distribution target.** Reference: [GraalVM Native Image](https://www.graalvm.org/latest/reference-manual/native-image/). The pure-Java sibling projects in the README's "Similar Projects" list (mukel's `llama3.java` / `gemma4.java` / `gptoss.java` / `qwen35.java` / `nemotron3.java`) demonstrate that single-jar, no-JNI Java inference is viable for individual model architectures. Native Image opens an orthogonal direction for THIS project: AOT-compile the Java layer + JNI bridge to a self-contained binary that bundles the libjllama.so (or per-OS equivalent) and starts in milliseconds without a JVM, which would make jllama usable in CLI tools, serverless functions, and short-lived processes where JVM startup is the dominant cost.
700700
Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
// SPDX-FileCopyrightText: 2026 Bernard Ladenthin <bernard.ladenthin@gmail.com>
2+
//
3+
// SPDX-License-Identifier: MIT
4+
package net.ladenthin.llama;
5+
6+
import java.util.Locale;
7+
import org.slf4j.Logger;
8+
import org.slf4j.LoggerFactory;
9+
10+
/**
11+
* Emits a single-line per-run timing summary to the SLF4J logger
12+
* {@value #LOGGER_NAME}, mirroring what the {@code llama.cpp} command-line tool
13+
* prints at the end of a generation.
14+
*
15+
* <p>Format:</p>
16+
* <pre>
17+
* prompt: 12 tok in 84.3 ms (142.4 tok/s) | gen: 256 tok in 5031.7 ms (50.9 tok/s) | cache: 0
18+
* </pre>
19+
*
20+
* <p>Speculative-decoding runs append a {@code | draft: N (M accepted)} segment.
21+
* Empty {@link Timings} (both {@code promptN} and {@code predictedN} zero) are
22+
* skipped &mdash; logging the all-zero fallback on a parse failure or on early
23+
* cancellation is pure noise.</p>
24+
*
25+
* <p>The dedicated logger name lets users suppress just this per-run line in
26+
* logback without touching the rest of the {@code net.ladenthin.llama} logging
27+
* tree, e.g.:</p>
28+
* <pre>
29+
* &lt;logger name=&quot;net.ladenthin.llama.timings&quot; level=&quot;OFF&quot;/&gt;
30+
* </pre>
31+
*/
32+
public final class TimingsLogger {
33+
34+
/** Dedicated SLF4J logger name for the per-run timing line. */
35+
public static final String LOGGER_NAME = "net.ladenthin.llama.timings";
36+
37+
private static final Logger LOGGER = LoggerFactory.getLogger(LOGGER_NAME);
38+
39+
private TimingsLogger() {
40+
// utility class; not instantiable.
41+
}
42+
43+
/**
44+
* Formats a single-line timing summary suitable for the {@value #LOGGER_NAME}
45+
* SLF4J logger. Exposed for callers that want to emit the same line through
46+
* a different sink (e.g. {@code System.err} in a CLI tool).
47+
*
48+
* @param t the timings to format
49+
* @return a single-line summary (no trailing newline)
50+
*/
51+
public static String format(Timings t) {
52+
StringBuilder sb = new StringBuilder()
53+
.append("prompt: ").append(t.getPromptN()).append(" tok in ")
54+
.append(formatMs(t.getPromptMs())).append(" ms (")
55+
.append(formatRate(t.getPromptPerSecond())).append(" tok/s)")
56+
.append(" | gen: ").append(t.getPredictedN()).append(" tok in ")
57+
.append(formatMs(t.getPredictedMs())).append(" ms (")
58+
.append(formatRate(t.getPredictedPerSecond())).append(" tok/s)")
59+
.append(" | cache: ").append(t.getCacheN());
60+
if (t.getDraftN() > 0) {
61+
sb.append(" | draft: ").append(t.getDraftN())
62+
.append(" (").append(t.getDraftNAccepted()).append(" accepted)");
63+
}
64+
return sb.toString();
65+
}
66+
67+
/**
68+
* Logs the per-run timing summary at {@code INFO} level on the dedicated
69+
* {@value #LOGGER_NAME} logger.
70+
*
71+
* <p>No-op when the timings carry no useful data (both prompt and predicted
72+
* token counts are zero &mdash; typically a parse failure or an early
73+
* cancellation) or when the logger is below {@code INFO}.</p>
74+
*
75+
* @param t the timings to log; may be {@code null} (no-op)
76+
*/
77+
public static void log(Timings t) {
78+
if (t == null) {
79+
return;
80+
}
81+
if (t.getPromptN() == 0 && t.getPredictedN() == 0) {
82+
return;
83+
}
84+
if (LOGGER.isInfoEnabled()) {
85+
LOGGER.info(format(t));
86+
}
87+
}
88+
89+
private static String formatMs(double ms) {
90+
return String.format(Locale.ROOT, "%.1f", ms);
91+
}
92+
93+
private static String formatRate(double rate) {
94+
return String.format(Locale.ROOT, "%.1f", rate);
95+
}
96+
}

src/main/java/net/ladenthin/llama/json/ChatResponseParser.java

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
import net.ladenthin.llama.ChatMessage;
1616
import net.ladenthin.llama.ChatResponse;
1717
import net.ladenthin.llama.Timings;
18+
import net.ladenthin.llama.TimingsLogger;
1819
import net.ladenthin.llama.ToolCall;
1920
import net.ladenthin.llama.Usage;
2021

@@ -154,6 +155,7 @@ public ChatResponse parseResponse(String json) {
154155
node.path("usage").path("prompt_tokens").asLong(0L),
155156
node.path("usage").path("completion_tokens").asLong(0L));
156157
Timings timings = Timings.fromJson(node.path("timings"));
158+
TimingsLogger.log(timings);
157159
return new ChatResponse(id, choices, usage, timings, json);
158160
} catch (IOException e) {
159161
return new ChatResponse(

src/main/java/net/ladenthin/llama/json/CompletionResponseParser.java

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
import net.ladenthin.llama.LlamaOutput;
1919
import net.ladenthin.llama.StopReason;
2020
import net.ladenthin.llama.Timings;
21+
import net.ladenthin.llama.TimingsLogger;
2122
import net.ladenthin.llama.TokenLogprob;
2223
import net.ladenthin.llama.Usage;
2324

@@ -191,6 +192,7 @@ public CompletionResult parseCompletionResult(String json) {
191192
node.path("tokens_evaluated").asLong(0L),
192193
node.path("tokens_predicted").asLong(0L));
193194
Timings timings = Timings.fromJson(node.path("timings"));
195+
TimingsLogger.log(timings);
194196
List<TokenLogprob> logprobs = parseLogprobs(node);
195197
StopReason stopReason =
196198
StopReason.fromStopType(node.path("stop_type").asText(""));
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
// SPDX-FileCopyrightText: 2026 Bernard Ladenthin <bernard.ladenthin@gmail.com>
2+
//
3+
// SPDX-License-Identifier: MIT
4+
5+
package net.ladenthin.llama;
6+
7+
import static org.junit.jupiter.api.Assertions.assertEquals;
8+
import static org.junit.jupiter.api.Assertions.assertFalse;
9+
import static org.junit.jupiter.api.Assertions.assertTrue;
10+
11+
import nl.altindag.log.LogCaptor;
12+
import org.junit.jupiter.api.Test;
13+
14+
@ClaudeGenerated(
15+
purpose = "Pin the per-run timing-line format (TimingsLogger#format) byte-for-byte "
16+
+ "and verify the SLF4J pipeline on the dedicated 'net.ladenthin.llama.timings' "
17+
+ "logger so a future format regression or accidental log-suppression is caught "
18+
+ "at test time.")
19+
public class TimingsLoggerTest {
20+
21+
/** Format check on a typical generation (no speculative decoding). */
22+
@Test
23+
public void format_standardGeneration_singleLineWithAllSegments() {
24+
Timings t = new Timings(
25+
/*cacheN*/ 0,
26+
/*promptN*/ 12,
27+
/*promptMs*/ 84.3,
28+
/*promptPerSec*/142.4,
29+
/*predictedN*/ 256,
30+
/*predictedMs*/5031.7,
31+
/*predictedPerSec*/50.9,
32+
/*draftN*/ 0,
33+
/*draftNAccepted*/0);
34+
35+
String line = TimingsLogger.format(t);
36+
37+
assertEquals(
38+
"prompt: 12 tok in 84.3 ms (142.4 tok/s)"
39+
+ " | gen: 256 tok in 5031.7 ms (50.9 tok/s)"
40+
+ " | cache: 0",
41+
line);
42+
}
43+
44+
/** Speculative-decoding runs append a {@code | draft: N (M accepted)} segment. */
45+
@Test
46+
public void format_speculativeDecoding_includesDraftSegment() {
47+
Timings t = new Timings(0, 4, 10.0, 400.0, 100, 1000.0, 100.0, 50, 35);
48+
49+
String line = TimingsLogger.format(t);
50+
51+
assertTrue(line.contains(" | draft: 50 (35 accepted)"), line);
52+
}
53+
54+
/** Non-speculative runs do NOT append the draft segment. */
55+
@Test
56+
public void format_nonSpeculativeRun_omitsDraftSegment() {
57+
Timings t = new Timings(0, 4, 10.0, 400.0, 100, 1000.0, 100.0, 0, 0);
58+
59+
String line = TimingsLogger.format(t);
60+
61+
assertFalse(line.contains("draft"), line);
62+
}
63+
64+
/** Cache-hit count is rendered as-is so users can spot prompt-prefix reuse. */
65+
@Test
66+
public void format_cacheHits_renderedExactly() {
67+
Timings t = new Timings(64, 12, 84.3, 142.4, 256, 5031.7, 50.9, 0, 0);
68+
69+
String line = TimingsLogger.format(t);
70+
71+
assertTrue(line.contains(" | cache: 64"), line);
72+
}
73+
74+
/**
75+
* Pipeline check: emit through the dedicated SLF4J logger and assert
76+
* LogCaptor sees the formatted line at INFO level.
77+
*/
78+
@Test
79+
public void log_pipelineDelivery_emitsFormattedLineAtInfo() {
80+
Timings t = new Timings(0, 12, 84.3, 142.4, 256, 5031.7, 50.9, 0, 0);
81+
82+
try (LogCaptor captor = LogCaptor.forName(TimingsLogger.LOGGER_NAME)) {
83+
TimingsLogger.log(t);
84+
85+
assertEquals(1, captor.getInfoLogs().size());
86+
assertEquals(TimingsLogger.format(t), captor.getInfoLogs().get(0));
87+
}
88+
}
89+
90+
/** Empty timings (all-zero, typically a parse failure) are not logged. */
91+
@Test
92+
public void log_allZeroTimings_skipsEmptyLine() {
93+
Timings allZero = Timings.fromJson(null);
94+
95+
try (LogCaptor captor = LogCaptor.forName(TimingsLogger.LOGGER_NAME)) {
96+
TimingsLogger.log(allZero);
97+
98+
assertTrue(captor.getInfoLogs().isEmpty(), "expected no log lines for all-zero timings");
99+
}
100+
}
101+
102+
/** Null is treated as a no-op so callers don't need to null-check. */
103+
@Test
104+
public void log_nullTimings_isNoOp() {
105+
try (LogCaptor captor = LogCaptor.forName(TimingsLogger.LOGGER_NAME)) {
106+
TimingsLogger.log(null);
107+
108+
assertTrue(captor.getInfoLogs().isEmpty(), "expected no log lines when input is null");
109+
}
110+
}
111+
}

0 commit comments

Comments
 (0)