Add per-run timing line on net.ladenthin.llama.timings SLF4J logger

claude · claude · commit 3248c1c8af83 · 2026-06-04T22:15:16.000Z
Emits a single info-level summary line at the end of every non-streaming
generation (complete / chat), mirroring what the llama.cpp CLI prints:

  prompt: 12 tok in 84.3 ms (142.4 tok/s) | gen: 256 tok in 5031.7 ms (50.9 tok/s) | cache: 0

Speculative-decoding runs append:
  | draft: 50 (35 accepted)

Implementation:
- New TimingsLogger utility class with two public methods:
    format(Timings) -&gt; single-line String (exposed so CLI sinks can reuse)
    log(Timings)    -&gt; emits format(...) at INFO on
                       net.ladenthin.llama.timings (dedicated logger so
                       users can suppress it via logback without touching
                       the rest of net.ladenthin.llama).
- log() is a no-op for null and for all-zero Timings (typical on parse
  failure / early cancellation). No noise from non-event paths.
- Wired into both result parsers right after the Timings instance is
  built:
    json/CompletionResponseParser#parseCompletionResult
    json/ChatResponseParser#parseResponse
- Tests: 7 unit tests in TimingsLoggerTest pin the format byte-exact for
  the standard case, draft segment presence/absence, cache-hit
  rendering, dedicated-logger SLF4J pipeline delivery, all-zero
  no-op, and null no-op. Uses LogCaptor (the same harness LoggingSmokeTest
  uses for OSInfo).

Streaming generation (LlamaIterable / LlamaIterator) is not yet hooked.
The streaming iterator does not surface a clean "I am done" callback
visible from the public API today; threading that through is a separate
follow-up. Non-streaming covers the most common code path and gives
users an immediate signal.

The remaining first-batch items from the feature-investigation backlog
in CLAUDE.md are now the UTF-8 boundary-safe streaming decoder and a
jbang single-file example.

Tests run (per the test-execution policy — no full surefire):
- mvn compile / mvn test-compile: clean.
- mvn test -Dtest='TimingsLoggerTest': 7/7 pass.
- mvn test -Dtest='CompletionResponseParserTest,ChatResponseParserTest,
  LoggingSmokeTest,LlamaArchitectureTest': 58/58 pass (covers both
  parser wire-in sites plus the architecture invariants).
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -694,7 +694,7 @@ See [`../workspace/policies/jqwik-prompt-injection.md`](../workspace/policies/jq
 
 - ~~**Abstract the Java and test writing guidelines to a workspace-level shared layer.**~~ **DONE.** This repo is Java 8; follow the workspace version chain at [`../workspace/guides/src/CODE_WRITING_GUIDE-8.md`](../workspace/guides/src/CODE_WRITING_GUIDE-8.md) and [`../workspace/guides/test/TEST_WRITING_GUIDE-8.md`](../workspace/guides/test/TEST_WRITING_GUIDE-8.md). Canonical TDD skill at [`../workspace/.claude/skills/java-tdd-guide/SKILL.md`](../workspace/.claude/skills/java-tdd-guide/SKILL.md). This repo has no project-specific writing-guide supplements.
 
-- **Feature backlog from similar projects.** See [`docs/feature-investigation-similar-projects.md`](docs/feature-investigation-similar-projects.md) for the consolidated investigation across the 5 pure-Java sibling runtimes ([llama3.java](https://github.com/mukel/llama3.java), [gemma4.java](https://github.com/mukel/gemma4.java), [gptoss.java](https://github.com/mukel/gptoss.java), [qwen35.java](https://github.com/mukel/qwen35.java), [nemotron3.java](https://github.com/mukel/nemotron3.java)) plus the dormant alternative JNI binding [llamacpp4j](https://github.com/sebicom/llamacpp4j). The doc captures 18 candidate items grouped into cross-cutting themes (UTF-8 streaming boundary safety, thinking-channel router, operator timing line, jbang single-file example, README system-properties table, etc.) and per-repo unique findings (Harmony channel decoder, Qwen empty-`<think>` injection, llama_state_* save/load, llama_adapter_lora_* hot-apply, etc.), each with effort sizing (XS / S / M / L) and a prioritised backlog. **Recommended first batch** (items 1, 3, 4, 5): UTF-8 boundary-safe streaming decoder + per-run timing line + one jbang-runnable example + a README system-properties table; ~1-2 days total, no JNI changes.
+- **Feature backlog from similar projects.** See [`docs/feature-investigation-similar-projects.md`](docs/feature-investigation-similar-projects.md) for the consolidated investigation across the 5 pure-Java sibling runtimes ([llama3.java](https://github.com/mukel/llama3.java), [gemma4.java](https://github.com/mukel/gemma4.java), [gptoss.java](https://github.com/mukel/gptoss.java), [qwen35.java](https://github.com/mukel/qwen35.java), [nemotron3.java](https://github.com/mukel/nemotron3.java)) plus the dormant alternative JNI binding [llamacpp4j](https://github.com/sebicom/llamacpp4j). The doc captures 18 candidate items grouped into cross-cutting themes (UTF-8 streaming boundary safety, thinking-channel router, operator timing line, jbang single-file example, README system-properties table, etc.) and per-repo unique findings (Harmony channel decoder, Qwen empty-`<think>` injection, llama_state_* save/load, llama_adapter_lora_* hot-apply, etc.), each with effort sizing (XS / S / M / L) and a prioritised backlog. **Recommended first batch** (items 1, 3, 4, 5): UTF-8 boundary-safe streaming decoder + ~~per-run timing line~~ + one jbang-runnable example + ~~a README system-properties table~~; ~1-2 days total, no JNI changes. **DONE so far:** README system-properties table (`e36f631`, with two cleanups in `3ae6c81` + `28dc9e6`); per-run timing line (`TimingsLogger` class + wire-in to `CompletionResponseParser` and `ChatResponseParser`; format mirrors what `llama.cpp` CLI prints — `prompt: N tok in X ms (Y tok/s) | gen: … | cache: N | draft: …`; dedicated SLF4J logger `net.ladenthin.llama.timings` so users can suppress it independently; 7 unit tests pin format + pipeline behaviour). **Remaining first-batch items:** UTF-8 boundary-safe streaming decoder + jbang example.
 
 - **Evaluate GraalVM Native Image as an alternative distribution target.** Reference: [GraalVM Native Image](https://www.graalvm.org/latest/reference-manual/native-image/). The pure-Java sibling projects in the README's "Similar Projects" list (mukel's `llama3.java` / `gemma4.java` / `gptoss.java` / `qwen35.java` / `nemotron3.java`) demonstrate that single-jar, no-JNI Java inference is viable for individual model architectures. Native Image opens an orthogonal direction for THIS project: AOT-compile the Java layer + JNI bridge to a self-contained binary that bundles the libjllama.so (or per-OS equivalent) and starts in milliseconds without a JVM, which would make jllama usable in CLI tools, serverless functions, and short-lived processes where JVM startup is the dominant cost.
 
diff --git a/src/main/java/net/ladenthin/llama/TimingsLogger.java b/src/main/java/net/ladenthin/llama/TimingsLogger.java
@@ -0,0 +1,96 @@
+// SPDX-FileCopyrightText: 2026 Bernard Ladenthin <bernard.ladenthin@gmail.com>
+//
+// SPDX-License-Identifier: MIT
+package net.ladenthin.llama;
+
+import java.util.Locale;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Emits a single-line per-run timing summary to the SLF4J logger
+ * {@value #LOGGER_NAME}, mirroring what the {@code llama.cpp} command-line tool
+ * prints at the end of a generation.
+ *
+ * <p>Format:</p>
+ * <pre>
+ * prompt: 12 tok in 84.3 ms (142.4 tok/s) | gen: 256 tok in 5031.7 ms (50.9 tok/s) | cache: 0
+ * </pre>
+ *
+ * <p>Speculative-decoding runs append a {@code | draft: N (M accepted)} segment.
+ * Empty {@link Timings} (both {@code promptN} and {@code predictedN} zero) are
+ * skipped &mdash; logging the all-zero fallback on a parse failure or on early
+ * cancellation is pure noise.</p>
+ *
+ * <p>The dedicated logger name lets users suppress just this per-run line in
+ * logback without touching the rest of the {@code net.ladenthin.llama} logging
+ * tree, e.g.:</p>
+ * <pre>
+ * &lt;logger name=&quot;net.ladenthin.llama.timings&quot; level=&quot;OFF&quot;/&gt;
+ * </pre>
+ */
+public final class TimingsLogger {
+
+    /** Dedicated SLF4J logger name for the per-run timing line. */
+    public static final String LOGGER_NAME = "net.ladenthin.llama.timings";
+
+    private static final Logger LOGGER = LoggerFactory.getLogger(LOGGER_NAME);
+
+    private TimingsLogger() {
+        // utility class; not instantiable.
+    }
+
+    /**
+     * Formats a single-line timing summary suitable for the {@value #LOGGER_NAME}
+     * SLF4J logger. Exposed for callers that want to emit the same line through
+     * a different sink (e.g. {@code System.err} in a CLI tool).
+     *
+     * @param t the timings to format
+     * @return a single-line summary (no trailing newline)
+     */
+    public static String format(Timings t) {
+        StringBuilder sb = new StringBuilder()
+                .append("prompt: ").append(t.getPromptN()).append(" tok in ")
+                .append(formatMs(t.getPromptMs())).append(" ms (")
+                .append(formatRate(t.getPromptPerSecond())).append(" tok/s)")
+                .append(" | gen: ").append(t.getPredictedN()).append(" tok in ")
+                .append(formatMs(t.getPredictedMs())).append(" ms (")
+                .append(formatRate(t.getPredictedPerSecond())).append(" tok/s)")
+                .append(" | cache: ").append(t.getCacheN());
+        if (t.getDraftN() > 0) {
+            sb.append(" | draft: ").append(t.getDraftN())
+                    .append(" (").append(t.getDraftNAccepted()).append(" accepted)");
+        }
+        return sb.toString();
+    }
+
+    /**
+     * Logs the per-run timing summary at {@code INFO} level on the dedicated
+     * {@value #LOGGER_NAME} logger.
+     *
+     * <p>No-op when the timings carry no useful data (both prompt and predicted
+     * token counts are zero &mdash; typically a parse failure or an early
+     * cancellation) or when the logger is below {@code INFO}.</p>
+     *
+     * @param t the timings to log; may be {@code null} (no-op)
+     */
+    public static void log(Timings t) {
+        if (t == null) {
+            return;
+        }
+        if (t.getPromptN() == 0 && t.getPredictedN() == 0) {
+            return;
+        }
+        if (LOGGER.isInfoEnabled()) {
+            LOGGER.info(format(t));
+        }
+    }
+
+    private static String formatMs(double ms) {
+        return String.format(Locale.ROOT, "%.1f", ms);
+    }
+
+    private static String formatRate(double rate) {
+        return String.format(Locale.ROOT, "%.1f", rate);
+    }
+}
diff --git a/src/main/java/net/ladenthin/llama/json/ChatResponseParser.java b/src/main/java/net/ladenthin/llama/json/ChatResponseParser.java
@@ -15,6 +15,7 @@
 import net.ladenthin.llama.ChatMessage;
 import net.ladenthin.llama.ChatResponse;
 import net.ladenthin.llama.Timings;
+import net.ladenthin.llama.TimingsLogger;
 import net.ladenthin.llama.ToolCall;
 import net.ladenthin.llama.Usage;
 
@@ -154,6 +155,7 @@ public ChatResponse parseResponse(String json) {
                     node.path("usage").path("prompt_tokens").asLong(0L),
                     node.path("usage").path("completion_tokens").asLong(0L));
             Timings timings = Timings.fromJson(node.path("timings"));
+            TimingsLogger.log(timings);
             return new ChatResponse(id, choices, usage, timings, json);
         } catch (IOException e) {
             return new ChatResponse(
diff --git a/src/main/java/net/ladenthin/llama/json/CompletionResponseParser.java b/src/main/java/net/ladenthin/llama/json/CompletionResponseParser.java
@@ -18,6 +18,7 @@
 import net.ladenthin.llama.LlamaOutput;
 import net.ladenthin.llama.StopReason;
 import net.ladenthin.llama.Timings;
+import net.ladenthin.llama.TimingsLogger;
 import net.ladenthin.llama.TokenLogprob;
 import net.ladenthin.llama.Usage;
 
@@ -191,6 +192,7 @@ public CompletionResult parseCompletionResult(String json) {
                     node.path("tokens_evaluated").asLong(0L),
                     node.path("tokens_predicted").asLong(0L));
             Timings timings = Timings.fromJson(node.path("timings"));
+            TimingsLogger.log(timings);
             List<TokenLogprob> logprobs = parseLogprobs(node);
             StopReason stopReason =
                     StopReason.fromStopType(node.path("stop_type").asText(""));
diff --git a/src/test/java/net/ladenthin/llama/TimingsLoggerTest.java b/src/test/java/net/ladenthin/llama/TimingsLoggerTest.java
@@ -0,0 +1,111 @@
+// SPDX-FileCopyrightText: 2026 Bernard Ladenthin <bernard.ladenthin@gmail.com>
+//
+// SPDX-License-Identifier: MIT
+
+package net.ladenthin.llama;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+import nl.altindag.log.LogCaptor;
+import org.junit.jupiter.api.Test;
+
+@ClaudeGenerated(
+        purpose = "Pin the per-run timing-line format (TimingsLogger#format) byte-for-byte "
+                + "and verify the SLF4J pipeline on the dedicated 'net.ladenthin.llama.timings' "
+                + "logger so a future format regression or accidental log-suppression is caught "
+                + "at test time.")
+public class TimingsLoggerTest {
+
+    /** Format check on a typical generation (no speculative decoding). */
+    @Test
+    public void format_standardGeneration_singleLineWithAllSegments() {
+        Timings t = new Timings(
+                /*cacheN*/        0,
+                /*promptN*/      12,
+                /*promptMs*/    84.3,
+                /*promptPerSec*/142.4,
+                /*predictedN*/  256,
+                /*predictedMs*/5031.7,
+                /*predictedPerSec*/50.9,
+                /*draftN*/        0,
+                /*draftNAccepted*/0);
+
+        String line = TimingsLogger.format(t);
+
+        assertEquals(
+                "prompt: 12 tok in 84.3 ms (142.4 tok/s)"
+                        + " | gen: 256 tok in 5031.7 ms (50.9 tok/s)"
+                        + " | cache: 0",
+                line);
+    }
+
+    /** Speculative-decoding runs append a {@code | draft: N (M accepted)} segment. */
+    @Test
+    public void format_speculativeDecoding_includesDraftSegment() {
+        Timings t = new Timings(0, 4, 10.0, 400.0, 100, 1000.0, 100.0, 50, 35);
+
+        String line = TimingsLogger.format(t);
+
+        assertTrue(line.contains(" | draft: 50 (35 accepted)"), line);
+    }
+
+    /** Non-speculative runs do NOT append the draft segment. */
+    @Test
+    public void format_nonSpeculativeRun_omitsDraftSegment() {
+        Timings t = new Timings(0, 4, 10.0, 400.0, 100, 1000.0, 100.0, 0, 0);
+
+        String line = TimingsLogger.format(t);
+
+        assertFalse(line.contains("draft"), line);
+    }
+
+    /** Cache-hit count is rendered as-is so users can spot prompt-prefix reuse. */
+    @Test
+    public void format_cacheHits_renderedExactly() {
+        Timings t = new Timings(64, 12, 84.3, 142.4, 256, 5031.7, 50.9, 0, 0);
+
+        String line = TimingsLogger.format(t);
+
+        assertTrue(line.contains(" | cache: 64"), line);
+    }
+
+    /**
+     * Pipeline check: emit through the dedicated SLF4J logger and assert
+     * LogCaptor sees the formatted line at INFO level.
+     */
+    @Test
+    public void log_pipelineDelivery_emitsFormattedLineAtInfo() {
+        Timings t = new Timings(0, 12, 84.3, 142.4, 256, 5031.7, 50.9, 0, 0);
+
+        try (LogCaptor captor = LogCaptor.forName(TimingsLogger.LOGGER_NAME)) {
+            TimingsLogger.log(t);
+
+            assertEquals(1, captor.getInfoLogs().size());
+            assertEquals(TimingsLogger.format(t), captor.getInfoLogs().get(0));
+        }
+    }
+
+    /** Empty timings (all-zero, typically a parse failure) are not logged. */
+    @Test
+    public void log_allZeroTimings_skipsEmptyLine() {
+        Timings allZero = Timings.fromJson(null);
+
+        try (LogCaptor captor = LogCaptor.forName(TimingsLogger.LOGGER_NAME)) {
+            TimingsLogger.log(allZero);
+
+            assertTrue(captor.getInfoLogs().isEmpty(), "expected no log lines for all-zero timings");
+        }
+    }
+
+    /** Null is treated as a no-op so callers don't need to null-check. */
+    @Test
+    public void log_nullTimings_isNoOp() {
+        try (LogCaptor captor = LogCaptor.forName(TimingsLogger.LOGGER_NAME)) {
+            TimingsLogger.log(null);
+
+            assertTrue(captor.getInfoLogs().isEmpty(), "expected no log lines when input is null");
+        }
+    }
+}