Merge pull request #23 from Deep-CodeAI/feat/963-max-tokens-budget

Skobeltsyn · web-flow · commit eb4446b19966 · 2026-05-03T22:53:02.000+03:00
feat(#963): maxTokens budget + TokenUsage on LlmResponse
diff --git a/README.md b/README.md
@@ -91,7 +91,7 @@ These APIs work in `main`, are unit-tested, and are exercised by integration tes
 - **Memory bank** — `memory(MemoryBank())` auto-injects `memory_read` / `memory_write` / `memory_search` tools. See [Agent Memory](#agent-memory).
 - **LLM skill routing** — manual `skillSelection { }` or LLM router with `skillSelectionConfidenceThreshold`; `SkillRoute(name, confidence, rationale)` is structured (#641). See [Skill Selection](#skill-selection).
 - **Tool error recovery** — per-tool `onError`, per-skill default, agent default; built-in `escalate` and `throwException` agents. See [Tool Error Recovery](#tool-error-recovery).
-- **Budget controls** — `budget { maxTurns; maxToolCalls; maxDuration; perToolTimeout }` (sacrificial-thread enforcement) (#637).
+- **Budget controls** — `budget { maxTurns; maxToolCalls; maxDuration; perToolTimeout; maxTokens }` (sacrificial-thread enforcement; token counts cumulative across turns when the provider reports usage) (#637, #963).
 - **MCP client** — `mcp { server() }` over HTTP / stdio / TCP; Bearer auth; namespaced tools (`server.tool`). See [MCP Integration](#mcp-integration).
 - **MCP server** — `McpServer.from(agent)` exposes an agent as an MCP-conformant server with explicit `tools/listChanged: false` capability (#619).
 - **`McpRunner` standalone** — picocli-style one-liner main for shipping agents as MCP services.
@@ -121,7 +121,7 @@ What the framework enforces today:
 | Repaired args | Re-validated through the typed schema before reaching the executor | #658 |
 | Tool output trust | Tool results wrapped in untrusted envelope so the model can't forge framework messages | #642 |
 | Provider errors | Surface as `LlmProviderException` — never confused with model output | #702 |
-| Budget caps | `maxTurns`, `maxToolCalls`, `maxDuration`, `perToolTimeout` (sacrificial-thread enforced) | #637 |
+| Budget caps | `maxTurns`, `maxToolCalls`, `maxDuration`, `perToolTimeout`, `maxTokens` (sacrificial-thread enforced; token cap is cumulative across turns when provider reports usage) | #637, #963 |
 
 What the framework does **not** enforce — your responsibility:
 
@@ -1193,7 +1193,7 @@ For the full contributor guide — running the live-LLM and MCP integration test
 - [x] `.branch {}` — conditional routing on sealed types, composable with `then`
 - [x] `@Generable("desc")` / `@Guide` / `@LlmDescription` — runtime reflection: `toLlmDescription()`, `jsonSchema()`, `promptFragment()`, `fromLlmOutput<T>()`, `PartiallyGenerated<T>`
 - [x] `model { }` — Ollama backend; `host`, `port`, `temperature`; injectable `ModelClient` for tests; auto-fallback to inline JSON tool-call format for models without native tool support (#706)
-- [x] Agentic execution loop — multi-turn tool calling with budget controls (`maxTurns`, `maxToolCalls`, `maxDuration`, `perToolTimeout`) + `onToolUse` observability hook (#637)
+- [x] Agentic execution loop — multi-turn tool calling with budget controls (`maxTurns`, `maxToolCalls`, `maxDuration`, `perToolTimeout`, `maxTokens`) + `onToolUse` observability hook (#637, #963)
 - [x] Skill selection — manual `skillSelection {}` + automatic LLM routing when multiple skills match
 - [x] `onError { Throwable -> }` — infrastructure-error observability hook (LLM transport, response parse, budget); pure observability — original exception always rethrows (#962)
 - [ ] `>>` — security/education wrap
diff --git a/docs/prd.md b/docs/prd.md
@@ -3922,7 +3922,8 @@ Notation: `[x]` shipped, `[ ]` planned. Mirrors the README's roadmap so contribu
 - [x] DDD package structure: `agents_engine.core` (entities) + `agents_engine.composition` (operators)
 - [x] Single-placement rule — each agent instance participates in at most one structure
 - [x] `model { }` — Ollama backend; `host`, `port`, `temperature`; injectable `ModelClient` for tests; auto-fallback to inline JSON tool-call format for models without native tool support (#706)
-- [x] Agentic execution loop — multi-turn tool calling with budget controls (`maxTurns`, `maxToolCalls`, `maxDuration`, `perToolTimeout`) + `onToolUse` observability hook (#637)
+- [x] Agentic execution loop — multi-turn tool calling with budget controls (`maxTurns`, `maxToolCalls`, `maxDuration`, `perToolTimeout`, `maxTokens`) + `onToolUse` observability hook (#637, #963)
+- [x] `TokenUsage` on `LlmResponse` — `prompt_eval_count` + `eval_count` parsed from Ollama; cumulative across turns, surfaces `BudgetReason.TOKENS` on overrun (#963)
 - [x] Skill selection — manual `skillSelection {}` + automatic LLM routing when multiple skills match
 - [x] `onSkillChosen { name -> }` — fires when an agent selects a skill to execute
 - [x] `onKnowledgeUsed { name, content -> }` — fires when the LLM fetches a knowledge entry (tools model)
@@ -3946,7 +3947,7 @@ Notation: `[x]` shipped, `[ ]` planned. Mirrors the README's roadmap so contribu
 - [ ] KSP annotation processor for compile-time `@Generable` (replaces runtime reflection); constrained decoding (Ollama/vLLM) + guided JSON mode (Anthropic/OpenAI)
 - [ ] Native CLI binary (GraalVM — no JRE required); `brew`, npm, pip, curl, apt
 - [ ] jlink minimal JRE bundle for runtime (~35 MB)
-- [ ] Agentic execution loop — extend budget controls with `maxTokens` + structure-level budgets (§5.6)
+- [ ] Structure-level budgets — `budget { }` on Pipeline / Forum / Parallel / Loop (§5.6)
 
 **Secondary (stretch):**
 - [ ] `Prompt<IN, OUT>` entity definition and DSL — typed public interface for agents (§8.6)
diff --git a/src/main/kotlin/agents_engine/model/AgenticLoop.kt b/src/main/kotlin/agents_engine/model/AgenticLoop.kt
@@ -103,6 +103,7 @@ suspend fun <IN> executeAgentic(
 
     var turns = 0
     var toolCalls = 0
+    var totalTokens = 0
     val invocationStartNanos = System.nanoTime()
     while (true) {
         val elapsedNanos = System.nanoTime() - invocationStartNanos
@@ -121,6 +122,21 @@ suspend fun <IN> executeAgentic(
         val response = withContext(Dispatchers.IO) { client.chat(messages) }
         turns++
 
+        // #963: accumulate tokens only when the provider reported usage —
+        // a missing `tokenUsage` does NOT count as zero toward the cap.
+        // Check after the round-trip so the LAST turn's tokens are counted
+        // even if it tips us over: the throw still surfaces the breach.
+        response.tokenUsage?.let { usage ->
+            totalTokens += usage.total
+            val cap = budget.maxTokens
+            if (cap != null && totalTokens > cap) {
+                throw BudgetExceededException(
+                    "Agent '${agent.name}' exceeded token budget of $cap (used $totalTokens)",
+                    BudgetReason.TOKENS,
+                )
+            }
+        }
+
         when (response) {
             is LlmResponse.Text -> {
                 return skill.outputTransformer?.invoke(response.content)
diff --git a/src/main/kotlin/agents_engine/model/BudgetConfig.kt b/src/main/kotlin/agents_engine/model/BudgetConfig.kt
@@ -18,29 +18,36 @@ import kotlin.time.Duration.Companion.minutes
  * @property maxDuration wall-clock cap from agentic invocation start.
  *   Default 5 minutes.
  * @property perToolTimeout per-tool wall-clock cap. Null = no per-tool cap.
+ * @property maxTokens hard cap on cumulative LLM tokens (prompt + completion)
+ *   across all turns of the invocation. Null = no token cap. Tokens are only
+ *   accumulated when the provider reports usage on the response (#963); turns
+ *   with null `tokenUsage` count zero toward the cap.
  */
 data class BudgetConfig(
     val maxTurns: Int = 8,
     val maxToolCalls: Int = 32,
     val maxDuration: Duration = 5.minutes,
     val perToolTimeout: Duration? = null,
+    val maxTokens: Int? = null,
 )
 
 class BudgetBuilder {
     var maxTurns: Int = 8
     var maxToolCalls: Int = 32
     var maxDuration: Duration = 5.minutes
     var perToolTimeout: Duration? = null
+    var maxTokens: Int? = null
 
     internal fun build() = BudgetConfig(
         maxTurns = maxTurns,
         maxToolCalls = maxToolCalls,
         maxDuration = maxDuration,
         perToolTimeout = perToolTimeout,
+        maxTokens = maxTokens,
     )
 }
 
-enum class BudgetReason { TURNS, TOOL_CALLS, DURATION, PER_TOOL_TIMEOUT }
+enum class BudgetReason { TURNS, TOOL_CALLS, DURATION, PER_TOOL_TIMEOUT, TOKENS }
 
 class BudgetExceededException(
     message: String,
diff --git a/src/main/kotlin/agents_engine/model/ModelClient.kt b/src/main/kotlin/agents_engine/model/ModelClient.kt
@@ -13,9 +13,31 @@ data class ToolCall(
     val invalidArgumentsError: String? = null,
 )
 
+/**
+ * Token consumption for one LLM round-trip — null on the response when the
+ * provider doesn't report it. Sum of prompt + completion is what counts toward
+ * [BudgetConfig.maxTokens]. See #963.
+ */
+data class TokenUsage(
+    val promptTokens: Int,
+    val completionTokens: Int,
+) {
+    val total: Int get() = promptTokens + completionTokens
+}
+
 sealed interface LlmResponse {
-    data class Text(val content: String) : LlmResponse
-    data class ToolCalls(val calls: List<ToolCall>) : LlmResponse
+    /** Token usage for this response, or null if the provider didn't report it. */
+    val tokenUsage: TokenUsage?
+
+    data class Text(
+        val content: String,
+        override val tokenUsage: TokenUsage? = null,
+    ) : LlmResponse
+
+    data class ToolCalls(
+        val calls: List<ToolCall>,
+        override val tokenUsage: TokenUsage? = null,
+    ) : LlmResponse
 }
 
 fun interface ModelClient {
diff --git a/src/main/kotlin/agents_engine/model/OllamaClient.kt b/src/main/kotlin/agents_engine/model/OllamaClient.kt
@@ -205,8 +205,14 @@ open class OllamaClient(
         (root["error"] as? String)?.let { errorText ->
             throw LlmProviderException("Ollama returned an error: $errorText")
         }
+        // #963: Ollama reports prompt + completion token counts at the response root.
+        // Both must be present for the count to be trustworthy — partial reports get
+        // dropped (null) so the loop's accumulator can distinguish "0 tokens used"
+        // from "provider didn't say."
+        val tokenUsage = extractOllamaTokenUsage(root)
+
         val message = root["message"] as? Map<*, *>
-            ?: return LlmResponse.Text(body)
+            ?: return LlmResponse.Text(body, tokenUsage)
         val content = message["content"] as? String ?: ""
 
         // Native Ollama tool_calls field (models with built-in tool support)
@@ -223,14 +229,20 @@ open class OllamaClient(
                     invalidArgumentsError = parsedArgs.parseError,
                 )
             }
-            if (calls.isNotEmpty()) return LlmResponse.ToolCalls(calls)
+            if (calls.isNotEmpty()) return LlmResponse.ToolCalls(calls, tokenUsage)
         }
 
         // Inline JSON tool call in content (models without native tool support)
         val toolCall = InlineToolCallParser.parse(content)
-        if (toolCall != null) return LlmResponse.ToolCalls(listOf(toolCall))
+        if (toolCall != null) return LlmResponse.ToolCalls(listOf(toolCall), tokenUsage)
+
+        return LlmResponse.Text(content, tokenUsage)
+    }
 
-        return LlmResponse.Text(content)
+    private fun extractOllamaTokenUsage(root: Map<*, *>): TokenUsage? {
+        val prompt = (root["prompt_eval_count"] as? Number)?.toInt()
+        val completion = (root["eval_count"] as? Number)?.toInt()
+        return if (prompt != null && completion != null) TokenUsage(prompt, completion) else null
     }
 }
 
diff --git a/src/test/kotlin/agents_engine/model/MaxTokensBudgetTest.kt b/src/test/kotlin/agents_engine/model/MaxTokensBudgetTest.kt
@@ -0,0 +1,203 @@
+package agents_engine.model
+
+import agents_engine.core.agent
+import org.junit.jupiter.api.assertThrows
+import kotlin.test.Test
+import kotlin.test.assertEquals
+import kotlin.test.assertNotNull
+import kotlin.test.assertNull
+
+// Tests for #963 — token-based budget control.
+// Plumbing: Ollama reports prompt_eval_count + eval_count → ModelClient
+// surfaces TokenUsage on LlmResponse → AgenticLoop accumulates → throws
+// BudgetExceededException(TOKENS) when over cap.
+class MaxTokensBudgetTest {
+
+    @Test
+    fun `TokenUsage total is the sum of prompt and completion`() {
+        val u = TokenUsage(promptTokens = 30, completionTokens = 12)
+        assertEquals(42, u.total)
+    }
+
+    @Test
+    fun `LlmResponse Text exposes tokenUsage when constructed with one`() {
+        val r = LlmResponse.Text("hello", TokenUsage(10, 5))
+        val usage = r.tokenUsage
+        assertNotNull(usage)
+        assertEquals(15, usage.total)
+    }
+
+    @Test
+    fun `LlmResponse ToolCalls exposes tokenUsage when constructed with one`() {
+        val r = LlmResponse.ToolCalls(emptyList(), TokenUsage(20, 7))
+        val usage = r.tokenUsage
+        assertNotNull(usage)
+        assertEquals(27, usage.total)
+    }
+
+    @Test
+    fun `LlmResponse default tokenUsage is null (back-compat)`() {
+        // Existing call sites (FakeModelClient { LlmResponse.Text("x") })
+        // must continue to work without specifying token usage.
+        assertNull(LlmResponse.Text("hi").tokenUsage)
+        assertNull(LlmResponse.ToolCalls(emptyList()).tokenUsage)
+    }
+
+    @Test
+    fun `BudgetConfig maxTokens default is null (no cap)`() {
+        assertNull(BudgetConfig().maxTokens)
+    }
+
+    @Test
+    fun `BudgetBuilder exposes maxTokens via DSL`() {
+        val b = BudgetBuilder()
+        b.maxTokens = 1000
+        assertEquals(1000, b.build().maxTokens)
+    }
+
+    @Test
+    fun `OllamaClient parseResponse extracts both prompt and completion counts`() {
+        // Realistic Ollama response shape — token counts at the root, not on `message`.
+        val body = """
+            {
+              "model": "llama3",
+              "message": {"role": "assistant", "content": "hello"},
+              "done": true,
+              "prompt_eval_count": 25,
+              "eval_count": 8
+            }
+        """.trimIndent()
+        val client = OllamaClient(model = "llama3")
+        val resp = client.parseResponse(body)
+        val usage = resp.tokenUsage
+        assertNotNull(usage)
+        assertEquals(25, usage.promptTokens)
+        assertEquals(8, usage.completionTokens)
+        assertEquals(33, usage.total)
+    }
+
+    @Test
+    fun `OllamaClient parseResponse drops partial token reports`() {
+        // If only one of prompt_eval_count / eval_count is present, the count
+        // is untrustworthy — surface it as null rather than half-attributing.
+        val body = """
+            {
+              "model": "llama3",
+              "message": {"role": "assistant", "content": "hi"},
+              "done": true,
+              "prompt_eval_count": 10
+            }
+        """.trimIndent()
+        val resp = OllamaClient(model = "llama3").parseResponse(body)
+        assertNull(resp.tokenUsage)
+    }
+
+    @Test
+    fun `OllamaClient parseResponse handles missing token counts`() {
+        // Provider didn't report anything — null, not zero.
+        val body = """
+            {
+              "model": "llama3",
+              "message": {"role": "assistant", "content": "hi"},
+              "done": true
+            }
+        """.trimIndent()
+        val resp = OllamaClient(model = "llama3").parseResponse(body)
+        assertNull(resp.tokenUsage)
+    }
+
+    @Test
+    fun `agentic loop accumulates tokens across turns`() {
+        // Two turns: a tool call followed by a final text. Cap is generous
+        // so the loop succeeds; we then verify the cumulative count by
+        // observing that a tighter cap would have tripped (separate test).
+        val responses = ArrayDeque<LlmResponse>()
+        responses.add(LlmResponse.ToolCalls(
+            listOf(ToolCall(name = "noop", arguments = emptyMap())),
+            TokenUsage(promptTokens = 10, completionTokens = 5),
+        ))
+        responses.add(LlmResponse.Text(
+            "done",
+            TokenUsage(promptTokens = 15, completionTokens = 7),
+        ))
+        val mock = ModelClient { _ -> responses.removeFirst() }
+
+        val a = agent<String, String>("a") {
+            model { ollama("llama3"); client = mock }
+            budget { maxTokens = 100 }
+            tools { tool("noop", "") { _ -> "ok" } }
+            skills { skill<String, String>("s", "s") { tools("noop") } }
+        }
+
+        val out = a("input")
+        assertEquals("done", out)
+    }
+
+    @Test
+    fun `agentic loop throws BudgetExceededException(TOKENS) when sum exceeds maxTokens`() {
+        // First turn alone (10 + 5 = 15) is over the cap of 10.
+        val responses = ArrayDeque<LlmResponse>()
+        responses.add(LlmResponse.Text(
+            "done",
+            TokenUsage(promptTokens = 10, completionTokens = 5),
+        ))
+        val mock = ModelClient { _ -> responses.removeFirst() }
+
+        val a = agent<String, String>("a") {
+            model { ollama("llama3"); client = mock }
+            budget { maxTokens = 10 }
+            skills { skill<String, String>("s", "s") { tools() } }
+        }
+
+        val ex = assertThrows<BudgetExceededException> { a("input") }
+        assertEquals(BudgetReason.TOKENS, ex.reason)
+        // Message should mention both the cap and the actual usage so users
+        // can see how badly they overshot.
+        val msg = ex.message.orEmpty()
+        assertEquals(true, msg.contains("10"), "message should mention cap: $msg")
+        assertEquals(true, msg.contains("15"), "message should mention used: $msg")
+    }
+
+    @Test
+    fun `agentic loop overrun triggers across cumulative turns, not per-turn`() {
+        // Each turn is 5 + 5 = 10 tokens. Cap is 15. Turn 1 lands at 10
+        // (under cap). Turn 2 brings cumulative to 20 (over) — that's where
+        // the throw must happen.
+        val responses = ArrayDeque<LlmResponse>()
+        responses.add(LlmResponse.ToolCalls(
+            listOf(ToolCall(name = "noop", arguments = emptyMap())),
+            TokenUsage(5, 5),
+        ))
+        responses.add(LlmResponse.Text("late", TokenUsage(5, 5)))
+        val mock = ModelClient { _ -> responses.removeFirst() }
+
+        val a = agent<String, String>("a") {
+            model { ollama("llama3"); client = mock }
+            budget { maxTokens = 15 }
+            tools { tool("noop", "") { _ -> "ok" } }
+            skills { skill<String, String>("s", "s") { tools("noop") } }
+        }
+
+        val ex = assertThrows<BudgetExceededException> { a("input") }
+        assertEquals(BudgetReason.TOKENS, ex.reason)
+    }
+
+    @Test
+    fun `loop with null tokenUsage on responses ignores the token cap entirely`() {
+        // Provider doesn't report token usage. The loop must not accumulate
+        // anything (a null is not zero) and the cap effectively does nothing —
+        // matching the "best-effort" contract documented on BudgetConfig.
+        // If the implementation accidentally treated null as zero, no cap
+        // would fire either; the key assertion is that the loop completes
+        // normally rather than tripping a phantom budget.
+        val mock = ModelClient { _ -> LlmResponse.Text("done") }  // no usage
+
+        val a = agent<String, String>("a") {
+            model { ollama("llama3"); client = mock }
+            budget { maxTokens = 1 }  // hyper-tight cap; null usage means it must not fire
+            skills { skill<String, String>("s", "s") { tools() } }
+        }
+
+        assertEquals("done", a("input"))
+    }
+}