Skip to content

Commit eb4446b

Browse files
authored
Merge pull request #23 from Deep-CodeAI/feat/963-max-tokens-budget
feat(#963): maxTokens budget + TokenUsage on LlmResponse
2 parents db6166a + ac60ef3 commit eb4446b

7 files changed

Lines changed: 273 additions & 12 deletions

File tree

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,7 @@ These APIs work in `main`, are unit-tested, and are exercised by integration tes
9191
- **Memory bank**`memory(MemoryBank())` auto-injects `memory_read` / `memory_write` / `memory_search` tools. See [Agent Memory](#agent-memory).
9292
- **LLM skill routing** — manual `skillSelection { }` or LLM router with `skillSelectionConfidenceThreshold`; `SkillRoute(name, confidence, rationale)` is structured (#641). See [Skill Selection](#skill-selection).
9393
- **Tool error recovery** — per-tool `onError`, per-skill default, agent default; built-in `escalate` and `throwException` agents. See [Tool Error Recovery](#tool-error-recovery).
94-
- **Budget controls**`budget { maxTurns; maxToolCalls; maxDuration; perToolTimeout }` (sacrificial-thread enforcement) (#637).
94+
- **Budget controls**`budget { maxTurns; maxToolCalls; maxDuration; perToolTimeout; maxTokens }` (sacrificial-thread enforcement; token counts cumulative across turns when the provider reports usage) (#637, #963).
9595
- **MCP client**`mcp { server() }` over HTTP / stdio / TCP; Bearer auth; namespaced tools (`server.tool`). See [MCP Integration](#mcp-integration).
9696
- **MCP server**`McpServer.from(agent)` exposes an agent as an MCP-conformant server with explicit `tools/listChanged: false` capability (#619).
9797
- **`McpRunner` standalone** — picocli-style one-liner main for shipping agents as MCP services.
@@ -121,7 +121,7 @@ What the framework enforces today:
121121
| Repaired args | Re-validated through the typed schema before reaching the executor | #658 |
122122
| Tool output trust | Tool results wrapped in untrusted envelope so the model can't forge framework messages | #642 |
123123
| Provider errors | Surface as `LlmProviderException` — never confused with model output | #702 |
124-
| Budget caps | `maxTurns`, `maxToolCalls`, `maxDuration`, `perToolTimeout` (sacrificial-thread enforced) | #637 |
124+
| Budget caps | `maxTurns`, `maxToolCalls`, `maxDuration`, `perToolTimeout`, `maxTokens` (sacrificial-thread enforced; token cap is cumulative across turns when provider reports usage) | #637, #963 |
125125

126126
What the framework does **not** enforce — your responsibility:
127127

@@ -1193,7 +1193,7 @@ For the full contributor guide — running the live-LLM and MCP integration test
11931193
- [x] `.branch {}` — conditional routing on sealed types, composable with `then`
11941194
- [x] `@Generable("desc")` / `@Guide` / `@LlmDescription` — runtime reflection: `toLlmDescription()`, `jsonSchema()`, `promptFragment()`, `fromLlmOutput<T>()`, `PartiallyGenerated<T>`
11951195
- [x] `model { }` — Ollama backend; `host`, `port`, `temperature`; injectable `ModelClient` for tests; auto-fallback to inline JSON tool-call format for models without native tool support (#706)
1196-
- [x] Agentic execution loop — multi-turn tool calling with budget controls (`maxTurns`, `maxToolCalls`, `maxDuration`, `perToolTimeout`) + `onToolUse` observability hook (#637)
1196+
- [x] Agentic execution loop — multi-turn tool calling with budget controls (`maxTurns`, `maxToolCalls`, `maxDuration`, `perToolTimeout`, `maxTokens`) + `onToolUse` observability hook (#637, #963)
11971197
- [x] Skill selection — manual `skillSelection {}` + automatic LLM routing when multiple skills match
11981198
- [x] `onError { Throwable -> }` — infrastructure-error observability hook (LLM transport, response parse, budget); pure observability — original exception always rethrows (#962)
11991199
- [ ] `>>` — security/education wrap

docs/prd.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3922,7 +3922,8 @@ Notation: `[x]` shipped, `[ ]` planned. Mirrors the README's roadmap so contribu
39223922
- [x] DDD package structure: `agents_engine.core` (entities) + `agents_engine.composition` (operators)
39233923
- [x] Single-placement rule — each agent instance participates in at most one structure
39243924
- [x] `model { }` — Ollama backend; `host`, `port`, `temperature`; injectable `ModelClient` for tests; auto-fallback to inline JSON tool-call format for models without native tool support (#706)
3925-
- [x] Agentic execution loop — multi-turn tool calling with budget controls (`maxTurns`, `maxToolCalls`, `maxDuration`, `perToolTimeout`) + `onToolUse` observability hook (#637)
3925+
- [x] Agentic execution loop — multi-turn tool calling with budget controls (`maxTurns`, `maxToolCalls`, `maxDuration`, `perToolTimeout`, `maxTokens`) + `onToolUse` observability hook (#637, #963)
3926+
- [x] `TokenUsage` on `LlmResponse``prompt_eval_count` + `eval_count` parsed from Ollama; cumulative across turns, surfaces `BudgetReason.TOKENS` on overrun (#963)
39263927
- [x] Skill selection — manual `skillSelection {}` + automatic LLM routing when multiple skills match
39273928
- [x] `onSkillChosen { name -> }` — fires when an agent selects a skill to execute
39283929
- [x] `onKnowledgeUsed { name, content -> }` — fires when the LLM fetches a knowledge entry (tools model)
@@ -3946,7 +3947,7 @@ Notation: `[x]` shipped, `[ ]` planned. Mirrors the README's roadmap so contribu
39463947
- [ ] KSP annotation processor for compile-time `@Generable` (replaces runtime reflection); constrained decoding (Ollama/vLLM) + guided JSON mode (Anthropic/OpenAI)
39473948
- [ ] Native CLI binary (GraalVM — no JRE required); `brew`, npm, pip, curl, apt
39483949
- [ ] jlink minimal JRE bundle for runtime (~35 MB)
3949-
- [ ] Agentic execution loop — extend budget controls with `maxTokens` + structure-level budgets (§5.6)
3950+
- [ ] Structure-level budgets — `budget { }` on Pipeline / Forum / Parallel / Loop (§5.6)
39503951

39513952
**Secondary (stretch):**
39523953
- [ ] `Prompt<IN, OUT>` entity definition and DSL — typed public interface for agents (§8.6)

src/main/kotlin/agents_engine/model/AgenticLoop.kt

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,7 @@ suspend fun <IN> executeAgentic(
103103

104104
var turns = 0
105105
var toolCalls = 0
106+
var totalTokens = 0
106107
val invocationStartNanos = System.nanoTime()
107108
while (true) {
108109
val elapsedNanos = System.nanoTime() - invocationStartNanos
@@ -121,6 +122,21 @@ suspend fun <IN> executeAgentic(
121122
val response = withContext(Dispatchers.IO) { client.chat(messages) }
122123
turns++
123124

125+
// #963: accumulate tokens only when the provider reported usage —
126+
// a missing `tokenUsage` does NOT count as zero toward the cap.
127+
// Check after the round-trip so the LAST turn's tokens are counted
128+
// even if it tips us over: the throw still surfaces the breach.
129+
response.tokenUsage?.let { usage ->
130+
totalTokens += usage.total
131+
val cap = budget.maxTokens
132+
if (cap != null && totalTokens > cap) {
133+
throw BudgetExceededException(
134+
"Agent '${agent.name}' exceeded token budget of $cap (used $totalTokens)",
135+
BudgetReason.TOKENS,
136+
)
137+
}
138+
}
139+
124140
when (response) {
125141
is LlmResponse.Text -> {
126142
return skill.outputTransformer?.invoke(response.content)

src/main/kotlin/agents_engine/model/BudgetConfig.kt

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,29 +18,36 @@ import kotlin.time.Duration.Companion.minutes
1818
* @property maxDuration wall-clock cap from agentic invocation start.
1919
* Default 5 minutes.
2020
* @property perToolTimeout per-tool wall-clock cap. Null = no per-tool cap.
21+
* @property maxTokens hard cap on cumulative LLM tokens (prompt + completion)
22+
* across all turns of the invocation. Null = no token cap. Tokens are only
23+
* accumulated when the provider reports usage on the response (#963); turns
24+
* with null `tokenUsage` count zero toward the cap.
2125
*/
2226
data class BudgetConfig(
2327
val maxTurns: Int = 8,
2428
val maxToolCalls: Int = 32,
2529
val maxDuration: Duration = 5.minutes,
2630
val perToolTimeout: Duration? = null,
31+
val maxTokens: Int? = null,
2732
)
2833

2934
class BudgetBuilder {
3035
var maxTurns: Int = 8
3136
var maxToolCalls: Int = 32
3237
var maxDuration: Duration = 5.minutes
3338
var perToolTimeout: Duration? = null
39+
var maxTokens: Int? = null
3440

3541
internal fun build() = BudgetConfig(
3642
maxTurns = maxTurns,
3743
maxToolCalls = maxToolCalls,
3844
maxDuration = maxDuration,
3945
perToolTimeout = perToolTimeout,
46+
maxTokens = maxTokens,
4047
)
4148
}
4249

43-
enum class BudgetReason { TURNS, TOOL_CALLS, DURATION, PER_TOOL_TIMEOUT }
50+
enum class BudgetReason { TURNS, TOOL_CALLS, DURATION, PER_TOOL_TIMEOUT, TOKENS }
4451

4552
class BudgetExceededException(
4653
message: String,

src/main/kotlin/agents_engine/model/ModelClient.kt

Lines changed: 24 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,31 @@ data class ToolCall(
1313
val invalidArgumentsError: String? = null,
1414
)
1515

16+
/**
17+
* Token consumption for one LLM round-trip — null on the response when the
18+
* provider doesn't report it. Sum of prompt + completion is what counts toward
19+
* [BudgetConfig.maxTokens]. See #963.
20+
*/
21+
data class TokenUsage(
22+
val promptTokens: Int,
23+
val completionTokens: Int,
24+
) {
25+
val total: Int get() = promptTokens + completionTokens
26+
}
27+
1628
sealed interface LlmResponse {
17-
data class Text(val content: String) : LlmResponse
18-
data class ToolCalls(val calls: List<ToolCall>) : LlmResponse
29+
/** Token usage for this response, or null if the provider didn't report it. */
30+
val tokenUsage: TokenUsage?
31+
32+
data class Text(
33+
val content: String,
34+
override val tokenUsage: TokenUsage? = null,
35+
) : LlmResponse
36+
37+
data class ToolCalls(
38+
val calls: List<ToolCall>,
39+
override val tokenUsage: TokenUsage? = null,
40+
) : LlmResponse
1941
}
2042

2143
fun interface ModelClient {

src/main/kotlin/agents_engine/model/OllamaClient.kt

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -205,8 +205,14 @@ open class OllamaClient(
205205
(root["error"] as? String)?.let { errorText ->
206206
throw LlmProviderException("Ollama returned an error: $errorText")
207207
}
208+
// #963: Ollama reports prompt + completion token counts at the response root.
209+
// Both must be present for the count to be trustworthy — partial reports get
210+
// dropped (null) so the loop's accumulator can distinguish "0 tokens used"
211+
// from "provider didn't say."
212+
val tokenUsage = extractOllamaTokenUsage(root)
213+
208214
val message = root["message"] as? Map<*, *>
209-
?: return LlmResponse.Text(body)
215+
?: return LlmResponse.Text(body, tokenUsage)
210216
val content = message["content"] as? String ?: ""
211217

212218
// Native Ollama tool_calls field (models with built-in tool support)
@@ -223,14 +229,20 @@ open class OllamaClient(
223229
invalidArgumentsError = parsedArgs.parseError,
224230
)
225231
}
226-
if (calls.isNotEmpty()) return LlmResponse.ToolCalls(calls)
232+
if (calls.isNotEmpty()) return LlmResponse.ToolCalls(calls, tokenUsage)
227233
}
228234

229235
// Inline JSON tool call in content (models without native tool support)
230236
val toolCall = InlineToolCallParser.parse(content)
231-
if (toolCall != null) return LlmResponse.ToolCalls(listOf(toolCall))
237+
if (toolCall != null) return LlmResponse.ToolCalls(listOf(toolCall), tokenUsage)
238+
239+
return LlmResponse.Text(content, tokenUsage)
240+
}
232241

233-
return LlmResponse.Text(content)
242+
private fun extractOllamaTokenUsage(root: Map<*, *>): TokenUsage? {
243+
val prompt = (root["prompt_eval_count"] as? Number)?.toInt()
244+
val completion = (root["eval_count"] as? Number)?.toInt()
245+
return if (prompt != null && completion != null) TokenUsage(prompt, completion) else null
234246
}
235247
}
236248

Lines changed: 203 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,203 @@
1+
package agents_engine.model
2+
3+
import agents_engine.core.agent
4+
import org.junit.jupiter.api.assertThrows
5+
import kotlin.test.Test
6+
import kotlin.test.assertEquals
7+
import kotlin.test.assertNotNull
8+
import kotlin.test.assertNull
9+
10+
// Tests for #963 — token-based budget control.
11+
// Plumbing: Ollama reports prompt_eval_count + eval_count → ModelClient
12+
// surfaces TokenUsage on LlmResponse → AgenticLoop accumulates → throws
13+
// BudgetExceededException(TOKENS) when over cap.
14+
class MaxTokensBudgetTest {
15+
16+
@Test
17+
fun `TokenUsage total is the sum of prompt and completion`() {
18+
val u = TokenUsage(promptTokens = 30, completionTokens = 12)
19+
assertEquals(42, u.total)
20+
}
21+
22+
@Test
23+
fun `LlmResponse Text exposes tokenUsage when constructed with one`() {
24+
val r = LlmResponse.Text("hello", TokenUsage(10, 5))
25+
val usage = r.tokenUsage
26+
assertNotNull(usage)
27+
assertEquals(15, usage.total)
28+
}
29+
30+
@Test
31+
fun `LlmResponse ToolCalls exposes tokenUsage when constructed with one`() {
32+
val r = LlmResponse.ToolCalls(emptyList(), TokenUsage(20, 7))
33+
val usage = r.tokenUsage
34+
assertNotNull(usage)
35+
assertEquals(27, usage.total)
36+
}
37+
38+
@Test
39+
fun `LlmResponse default tokenUsage is null (back-compat)`() {
40+
// Existing call sites (FakeModelClient { LlmResponse.Text("x") })
41+
// must continue to work without specifying token usage.
42+
assertNull(LlmResponse.Text("hi").tokenUsage)
43+
assertNull(LlmResponse.ToolCalls(emptyList()).tokenUsage)
44+
}
45+
46+
@Test
47+
fun `BudgetConfig maxTokens default is null (no cap)`() {
48+
assertNull(BudgetConfig().maxTokens)
49+
}
50+
51+
@Test
52+
fun `BudgetBuilder exposes maxTokens via DSL`() {
53+
val b = BudgetBuilder()
54+
b.maxTokens = 1000
55+
assertEquals(1000, b.build().maxTokens)
56+
}
57+
58+
@Test
59+
fun `OllamaClient parseResponse extracts both prompt and completion counts`() {
60+
// Realistic Ollama response shape — token counts at the root, not on `message`.
61+
val body = """
62+
{
63+
"model": "llama3",
64+
"message": {"role": "assistant", "content": "hello"},
65+
"done": true,
66+
"prompt_eval_count": 25,
67+
"eval_count": 8
68+
}
69+
""".trimIndent()
70+
val client = OllamaClient(model = "llama3")
71+
val resp = client.parseResponse(body)
72+
val usage = resp.tokenUsage
73+
assertNotNull(usage)
74+
assertEquals(25, usage.promptTokens)
75+
assertEquals(8, usage.completionTokens)
76+
assertEquals(33, usage.total)
77+
}
78+
79+
@Test
80+
fun `OllamaClient parseResponse drops partial token reports`() {
81+
// If only one of prompt_eval_count / eval_count is present, the count
82+
// is untrustworthy — surface it as null rather than half-attributing.
83+
val body = """
84+
{
85+
"model": "llama3",
86+
"message": {"role": "assistant", "content": "hi"},
87+
"done": true,
88+
"prompt_eval_count": 10
89+
}
90+
""".trimIndent()
91+
val resp = OllamaClient(model = "llama3").parseResponse(body)
92+
assertNull(resp.tokenUsage)
93+
}
94+
95+
@Test
96+
fun `OllamaClient parseResponse handles missing token counts`() {
97+
// Provider didn't report anything — null, not zero.
98+
val body = """
99+
{
100+
"model": "llama3",
101+
"message": {"role": "assistant", "content": "hi"},
102+
"done": true
103+
}
104+
""".trimIndent()
105+
val resp = OllamaClient(model = "llama3").parseResponse(body)
106+
assertNull(resp.tokenUsage)
107+
}
108+
109+
@Test
110+
fun `agentic loop accumulates tokens across turns`() {
111+
// Two turns: a tool call followed by a final text. Cap is generous
112+
// so the loop succeeds; we then verify the cumulative count by
113+
// observing that a tighter cap would have tripped (separate test).
114+
val responses = ArrayDeque<LlmResponse>()
115+
responses.add(LlmResponse.ToolCalls(
116+
listOf(ToolCall(name = "noop", arguments = emptyMap())),
117+
TokenUsage(promptTokens = 10, completionTokens = 5),
118+
))
119+
responses.add(LlmResponse.Text(
120+
"done",
121+
TokenUsage(promptTokens = 15, completionTokens = 7),
122+
))
123+
val mock = ModelClient { _ -> responses.removeFirst() }
124+
125+
val a = agent<String, String>("a") {
126+
model { ollama("llama3"); client = mock }
127+
budget { maxTokens = 100 }
128+
tools { tool("noop", "") { _ -> "ok" } }
129+
skills { skill<String, String>("s", "s") { tools("noop") } }
130+
}
131+
132+
val out = a("input")
133+
assertEquals("done", out)
134+
}
135+
136+
@Test
137+
fun `agentic loop throws BudgetExceededException(TOKENS) when sum exceeds maxTokens`() {
138+
// First turn alone (10 + 5 = 15) is over the cap of 10.
139+
val responses = ArrayDeque<LlmResponse>()
140+
responses.add(LlmResponse.Text(
141+
"done",
142+
TokenUsage(promptTokens = 10, completionTokens = 5),
143+
))
144+
val mock = ModelClient { _ -> responses.removeFirst() }
145+
146+
val a = agent<String, String>("a") {
147+
model { ollama("llama3"); client = mock }
148+
budget { maxTokens = 10 }
149+
skills { skill<String, String>("s", "s") { tools() } }
150+
}
151+
152+
val ex = assertThrows<BudgetExceededException> { a("input") }
153+
assertEquals(BudgetReason.TOKENS, ex.reason)
154+
// Message should mention both the cap and the actual usage so users
155+
// can see how badly they overshot.
156+
val msg = ex.message.orEmpty()
157+
assertEquals(true, msg.contains("10"), "message should mention cap: $msg")
158+
assertEquals(true, msg.contains("15"), "message should mention used: $msg")
159+
}
160+
161+
@Test
162+
fun `agentic loop overrun triggers across cumulative turns, not per-turn`() {
163+
// Each turn is 5 + 5 = 10 tokens. Cap is 15. Turn 1 lands at 10
164+
// (under cap). Turn 2 brings cumulative to 20 (over) — that's where
165+
// the throw must happen.
166+
val responses = ArrayDeque<LlmResponse>()
167+
responses.add(LlmResponse.ToolCalls(
168+
listOf(ToolCall(name = "noop", arguments = emptyMap())),
169+
TokenUsage(5, 5),
170+
))
171+
responses.add(LlmResponse.Text("late", TokenUsage(5, 5)))
172+
val mock = ModelClient { _ -> responses.removeFirst() }
173+
174+
val a = agent<String, String>("a") {
175+
model { ollama("llama3"); client = mock }
176+
budget { maxTokens = 15 }
177+
tools { tool("noop", "") { _ -> "ok" } }
178+
skills { skill<String, String>("s", "s") { tools("noop") } }
179+
}
180+
181+
val ex = assertThrows<BudgetExceededException> { a("input") }
182+
assertEquals(BudgetReason.TOKENS, ex.reason)
183+
}
184+
185+
@Test
186+
fun `loop with null tokenUsage on responses ignores the token cap entirely`() {
187+
// Provider doesn't report token usage. The loop must not accumulate
188+
// anything (a null is not zero) and the cap effectively does nothing —
189+
// matching the "best-effort" contract documented on BudgetConfig.
190+
// If the implementation accidentally treated null as zero, no cap
191+
// would fire either; the key assertion is that the loop completes
192+
// normally rather than tripping a phantom budget.
193+
val mock = ModelClient { _ -> LlmResponse.Text("done") } // no usage
194+
195+
val a = agent<String, String>("a") {
196+
model { ollama("llama3"); client = mock }
197+
budget { maxTokens = 1 } // hyper-tight cap; null usage means it must not fire
198+
skills { skill<String, String>("s", "s") { tools() } }
199+
}
200+
201+
assertEquals("done", a("input"))
202+
}
203+
}

0 commit comments

Comments
 (0)