Feature/agent level token limit#4820
Conversation
PR Reviewer Guide 🔍(Review updated until commit 0093fb2)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to 0093fb2 Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit 763a83b
Suggestions up to commit e502cc1
Suggestions up to commit af3fc6b
Suggestions up to commit 8fdeac5
|
8fdeac5 to
af3fc6b
Compare
|
Persistent review updated to latest commit af3fc6b |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4820 +/- ##
============================================
+ Coverage 77.42% 77.47% +0.05%
- Complexity 11907 11973 +66
============================================
Files 963 964 +1
Lines 53326 53519 +193
Branches 6503 6532 +29
============================================
+ Hits 41285 41462 +177
- Misses 9289 9295 +6
- Partials 2752 2762 +10
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| private static String outputLimitField(Map<String, String> params) { | ||
| String llmInterface = params == null ? null : params.get(MLChatAgentRunner.LLM_INTERFACE); | ||
| if (llmInterface == null) { | ||
| return MAX_TOKENS_FIELD; | ||
| } | ||
|
|
||
| String normalized = llmInterface.trim().toLowerCase(Locale.ROOT); | ||
| if (normalized.startsWith(LLM_INTERFACE_BEDROCK_CONVERSE) || (normalized.contains("bedrock") && normalized.contains("converse"))) { | ||
| return BEDROCK_MAX_TOKENS_FIELD; | ||
| } | ||
| if (normalized.startsWith(LLM_INTERFACE_GEMINI_V1BETA_GENERATE_CONTENT) || normalized.contains("gemini")) { | ||
| return GEMINI_MAX_OUTPUT_TOKENS_FIELD; | ||
| } | ||
| return MAX_TOKENS_FIELD; | ||
| } |
There was a problem hiding this comment.
Blocking — provider detection silently fails open.
Two concerns:
-
LLM_INTERFACE_BEDROCK_CONVERSE = "bedrock/converse"is already a strict prefix of every legitimate Bedrock-converse interface (bedrock/converse/claude,bedrock/converse/deepseek_r1, fully-qualified model IDs). Thecontains("bedrock") && contains("converse")fallback adds no real coverage but silently accepts anything containing both substrings (incl. user-supplied junk). Same forcontains("gemini"). -
The bigger issue: when
_llm_interfaceis missing fromtargetParams, this returnsMAX_TOKENS_FIELDregardless of provider. For Bedrock-converse connectors that read${parameters.maxTokens}from their request template, the budget clamp will be written to a field the connector ignores — so the per-call cap is silently not enforced. Only the post-call exhaustion check stops the loop, which defeats half of the budget mechanism.
Fix options (any of):
- Detect provider from
mlAgent.getModel().getModelProvider()instead of an interface string. - Drop the substring fallback and fail loudly when the interface is unrecognized.
- Write all three candidate field names (
max_tokens,maxTokens,maxOutputTokens) — overhead is trivial and removes the brittle string match entirely.
There was a problem hiding this comment.
Fixed. Removed provider/interface detection for output-limit clamping. The effective clamp is now written to max_tokens, maxTokens, and maxOutputTokens.
| private static Long parsePositiveLong(Map<String, String> params, String field) { | ||
| if (params == null) { | ||
| return null; | ||
| } | ||
| String value = params.get(field); | ||
| if (value == null) { | ||
| return null; | ||
| } | ||
| try { | ||
| long parsed = Long.parseLong(value); | ||
| return parsed > 0 ? parsed : null; | ||
| } catch (NumberFormatException e) { | ||
| return null; | ||
| } | ||
| } |
There was a problem hiding this comment.
Blocking — 0 is treated as if the field were absent, then overwritten.
parsePositiveLong returns null for any non-positive value, so an existing max_tokens=0 (which several connectors interpret as "no per-call cap") is treated as missing — and applyRemainingToParams then overwrites it with remaining. That changes the connector's behavior for users who explicitly set 0.
If the intent is "ignore non-positive values", document it; otherwise treat 0 as "force unlimited per-call" and skip the clamp. Today the behavior is implicit and surprising.
There was a problem hiding this comment.
Intentional. Non-positive output-limit values are not treated as valid smaller caps; an active agent-level budget still clamps the outgoing request to the remaining budget.
| void applyRemainingToParams(Map<String, String> targetParams, Map<String, String> llmSpecParams, long consumedTokens) { | ||
| if (!isLimited() || targetParams == null) { | ||
| return; | ||
| } | ||
|
|
||
| long remaining = remaining(consumedTokens); | ||
| if (remaining <= 0) { | ||
| return; | ||
| } | ||
|
|
||
| long effectiveLimit = remaining; | ||
| Long existingLimit = smallestPositiveOutputLimit(targetParams, llmSpecParams); | ||
| if (existingLimit != null) { | ||
| effectiveLimit = Math.min(existingLimit, remaining); | ||
| } | ||
|
|
||
| targetParams.put(outputLimitField(targetParams), Long.toString(effectiveLimit)); | ||
| } |
There was a problem hiding this comment.
Note: when existingLimit (the smallest positive across targetParams/llmSpecParams) wins, this overwrites targetParams[outputLimitField]. The test testApplyRemainingToParams_PreservesSmallerExistingLimit asserts the smaller value ends up in params — which works, but the name "preserves" is misleading: we're rewriting the user's targetParams.max_tokens=80 with 30 (or with min(existing, remaining)). Consider renaming the test or adding a comment explaining the semantic.
There was a problem hiding this comment.
Fixed. Renamed the test to describe clamping to the smaller effective limit.
| if (tokenBudget.isExhausted(accumulatedTokenUsage[0])) { | ||
| long consumed = AgentTokenBudget.consumedTokens(accumulatedTokenUsage[0]); | ||
| log | ||
| .warn( | ||
| "ReAct loop stopped (token budget exhausted). agentId={}, maxTokens={}, consumedTokens={}", | ||
| agentId, | ||
| tokenBudget.getMaxTokens(), | ||
| consumed | ||
| ); | ||
| List<Message> toolInteractionMessages = toolInteractionJsonList.isEmpty() | ||
| ? null | ||
| : parseToolInteractionsForPersistence(toolInteractionJsonList, modelProvider); | ||
| Message resultMessage = toolCalls == null || toolCalls.isEmpty() | ||
| ? assistantMessage | ||
| : createBudgetExhaustedMessage(tokenBudget.exhaustedMessage(consumed)); | ||
| listener | ||
| .onResponse( | ||
| new AgentLogicResult( | ||
| resultMessage, | ||
| STOP_REASON_MAX_TOKENS, | ||
| accumulatedTokenUsage[0], | ||
| toolInteractionMessages | ||
| ) | ||
| ); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Blocking — persisted memory left in inconsistent state.
messages.add(assistantMessage) ran on line 241, before this exhaustion check. So if the LLM returned tool_calls and the budget is exhausted here, the local messages list (and therefore the persisted memory built from it on the parent's success path) contains a tool-calling assistant turn followed by — nothing. No tool result, no follow-up.
The next time this conversation is loaded, the model sees a dangling tool_use block. Anthropic-style providers return a 400 ("each tool_use must be followed by a tool_result"). OpenAI is more lenient but still surprising.
Fix options:
- Move the budget check above
messages.add(assistantMessage)(line 241) so the orphan turn is never added. - Drop the trailing tool-calling turn from
messageson this branch. - Synthesize a
tool_resultcontent block with the budget-exhausted text so the conversation is well-formed.
There was a problem hiding this comment.
Fixed. V2 now checks budget exhaustion before adding the assistant tool-call message to messages, so exhausted tool-call turns are not persisted without tool results.
|
|
||
| // Chat-specific constants | ||
| private static final String STOP_REASON_MAX_ITERATIONS = "max_iterations"; | ||
| private static final String STOP_REASON_MAX_TOKENS = "max_tokens"; |
There was a problem hiding this comment.
High-priority — stop-reason name collides with provider-side stop reasons.
STOP_REASON_MAX_TOKENS = "max_tokens" overlaps with the upstream Bedrock provider's own max_tokens stop reason (see BedrockStreamingHandler.java:497). Any downstream code keying off stopReason to discriminate "model hit its own max_tokens" vs "agent budget exhausted" will be unable to tell them apart.
Suggest agent_max_tokens_budget or budget_exhausted for the agent-side reason.
There was a problem hiding this comment.
Fixed. Replaced agent-side "max_tokens" with "budget_exhausted".
| } | ||
| } | ||
|
|
||
| /** | ||
| * Get agent-level token budget from parameters. | ||
| * Returns -1 if no budget is configured (unlimited). | ||
| * | ||
| * @param params Execution parameters (merged agent + runtime params) | ||
| * @return Maximum total tokens allowed for the agent execution, or -1 for unlimited | ||
| */ | ||
| private long getMaxTokensBudget(Map<String, String> params) { |
There was a problem hiding this comment.
Blocking — dead code. Mirrors the same unused method in MLPlanExecuteAndReflectAgentRunner.java:1046. No callers in the production code; please delete.
There was a problem hiding this comment.
Fixed. Removed the unused helpers from AbstractV2, V1, and PER, along with the false-coverage tests.
| if (tokenBudget.isExhausted(tokenTracker)) { | ||
| String fallbackResponse = buildMaxIterationsFallbackResponse(maxIterations, lastThought); | ||
| responseListener.onResponse(fallbackResponse); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Blocking — wrong message when budget exhausted at max-iterations time.
When both "max iterations reached" and "budget exhausted" trigger on the same final iteration, this returns the max-iterations fallback string ("Max iterations reached…") even though the budget is the actual cause. Every other budget-exhaustion site in the PR returns tokenBudget.exhaustedMessage(...), so the user-visible reason is now inconsistent across paths.
Use tokenBudget.exhaustedMessage(tokenTracker) here, or compose both reasons ("Max iterations reached AND token budget exhausted (X/Y)").
There was a problem hiding this comment.
Fixed. If V1 hits max iterations and budget exhaustion together, it now returns the budget-exhausted message and structured stop_reason=budget_exhausted.
| parentInteractionId, | ||
| allParams.get(EXECUTOR_AGENT_MEMORY_ID_FIELD), | ||
| allParams.get(EXECUTOR_AGENT_PARENT_INTERACTION_ID_FIELD), | ||
| tokenBudget.exhaustedMessage(tokenTracker) + ". Completed steps: " + completedSteps, |
There was a problem hiding this comment.
High-priority — List.toString() leaks [a, b, c] into user-visible output.
completedSteps is a List<String>, and ". Completed steps: " + completedSteps invokes List.toString(), producing "[step1, step2, step3]" with the surrounding brackets and , separators. That renders as garbage in chat UIs.
Same pattern appears at lines ~596, ~763, ~1081, ~1123. Either format as a numbered list (String.join("\n", completedSteps)) or move the steps into trace tensors / additionalInfo and keep the headline message clean.
There was a problem hiding this comment.
Fixed. Completed steps are now newline-joined instead of using Java List.toString() formatting.
| public long getCumulativeTotalTokens() { | ||
| long total = 0; | ||
| for (ModelUsageAggregation agg : perModelUsage.values()) { | ||
| TokenUsage usage = agg.getAggregatedUsage(); | ||
| Long effectiveTotal = usage.getEffectiveTotalTokens(); | ||
| if (effectiveTotal != null) { | ||
| total += effectiveTotal; | ||
| } | ||
| } | ||
| return total; |
There was a problem hiding this comment.
Minor: getEffectiveTotalTokens() returns null when only one of inputTokens/outputTokens is set, so this method silently treats partial data as zero. For providers that report only input tokens (rare), the budget would never trigger.
Consider summing (inputTokens != null ? inputTokens : 0) + (outputTokens != null ? outputTokens : 0) when at least one is non-null. Not blocking — just worth flagging since the budget assumes getEffectiveTotalTokens reflects reality.
There was a problem hiding this comment.
Fixed. getCumulativeTotalTokens() now counts input-only or output-only token data when total_tokens is absent.
| AgentTokenTracker tokenTracker, | ||
| FunctionCalling functionCalling | ||
| FunctionCalling functionCalling, | ||
| AgentTokenBudget tokenBudget |
There was a problem hiding this comment.
Nit: setupMemoryAndRun, setToolsAndRunAgent, and executePlanningLoop are now 9–11 parameters each after this PR. Consider extracting a small PerExecutionContext (carrying tokenTracker, tokenBudget, functionCalling, memory, conversationId) before signature creep gets worse. Not blocking.
There was a problem hiding this comment.
Acknowledged. I left the PerExecutionContext extraction out of this PR to avoid expanding the scope beyond the review fixes.
|
Persistent review updated to latest commit e502cc1 |
e502cc1 to
763a83b
Compare
|
Persistent review updated to latest commit 763a83b |
…ment a max_tokens parameter at the agent level that enforces a hard upper bound on total token usage across all LLM calls within a single agent execution. Changes: - AbstractV2AgentRunner: Add getMaxTokensBudget() helper method - MLChatAgentRunnerV2: Add token budget check after each LLM call with STOP_REASON_MAX_TOKENS stop reason - MLChatAgentRunner (V1): Add token budget check in ReAct loop after tokenTracker.recordTurn() - MLPlanExecuteAndReflectAgentRunner: Add token budget check in planning loop after planner LLM response - AgentTokenTracker: Add getCumulativeTotalTokens() method for budget enforcement The agent execution loop dynamically checks cumulative token usage after each LLM call. If the budget is exhausted, execution terminates gracefully with an appropriate message and stop reason. Usage: { "parameters": { "max_tokens": 10000 } } Resolves opensearch-project#4728 Signed-off-by: Kamen Tarlov
Signed-off-by: Kamen Tarlov <kamen.tarlov@itgix.com>
Signed-off-by: Guga Glonti <110097877+GugaGlonti@users.noreply.github.com>
- Introduced AgentTokenBudget class to encapsulate token budget logic. - Updated MLChatAgentRunner, MLChatAgentRunnerV2, and MLPlanExecuteAndReflectAgentRunner to utilize AgentTokenBudget for token management. - Removed redundant max_tokens checks and streamlined token exhaustion handling. - Enhanced token budget enforcement in agent execution loops. Signed-off-by: Guga Glonti <110097877+GugaGlonti@users.noreply.github.com>
- Implement tests to verify token budget behavior in various scenarios, including initial budget allocation, remaining budget application, and exhaustion handling. - Ensure correct parameter adjustments for different LLM interfaces based on token usage. Signed-off-by: Guga Glonti <110097877+GugaGlonti@users.noreply.github.com>
Signed-off-by: Guga Glonti <110097877+GugaGlonti@users.noreply.github.com>
Signed-off-by: Guga Glonti <110097877+GugaGlonti@users.noreply.github.com>
Signed-off-by: Guga Glonti <110097877+GugaGlonti@users.noreply.github.com>
Signed-off-by: Guga Glonti <110097877+GugaGlonti@users.noreply.github.com>
763a83b to
0093fb2
Compare
|
Persistent review updated to latest commit 0093fb2 |
rithinpullela
left a comment
There was a problem hiding this comment.
Thanks for addressing all the comments @GugaGlonti, the PR looks good to me to merge except for parseToolInteractionsForPersistence part in MLChatAgentRunnerV2.java, can you please take a look
| if (tokenBudget.isExhausted(consumedBeforeCall)) { | ||
| List<Message> toolInteractionMessages = toolInteractionJsonList.isEmpty() | ||
| ? null | ||
| : parseToolInteractionsForPersistence(toolInteractionJsonList, modelProvider); | ||
| listener | ||
| .onResponse( | ||
| new AgentLogicResult( | ||
| createBudgetExhaustedMessage(tokenBudget.exhaustedMessage(consumedBeforeCall)), | ||
| STOP_REASON_MAX_TOKENS, | ||
| accumulatedTokenUsage[0], | ||
| toolInteractionMessages | ||
| ) | ||
| ); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Looks like this issue has not been addressed yet, can you address this? Should be a small fix
| private String buildMaxIterationsFallbackResponse(int maxIterations, AtomicReference<String> lastThought) { | ||
| return (lastThought.get() != null && !lastThought.get().isEmpty() && !"null".equals(lastThought.get())) | ||
| ? String.format("%s. Last thought: %s", String.format(MAX_ITERATIONS_MESSAGE, maxIterations), lastThought.get()) | ||
| : String.format(MAX_ITERATIONS_MESSAGE, maxIterations); | ||
| } |
There was a problem hiding this comment.
Blocking — dead helper, never called.
This private method is defined but has zero call sites. The same expression is still inlined at lines 1614-1619 inside the e -> { ... } block of generateLLMSummary's listener.
Either replace the inline expression at 1614-1619 with responseListener.onResponse(buildMaxIterationsFallbackResponse(maxIterations, lastThought));, or delete this helper. As-is, it's dead code that will rot.
$ grep -n buildMaxIterationsFallbackResponse MLChatAgentRunner.java
1629: private String buildMaxIterationsFallbackResponse(...)
(only the definition; no callers.)
| }, e -> { | ||
| String agentId = allParams.getOrDefault(AGENT_ID_LOG_FIELD, "unknown"); | ||
| String tenantIdLog = allParams.get(TENANT_ID_FIELD); | ||
| log.error("Failed to execute ReAct agent. agentId={}, tenantId={}", agentId, tenantIdLog, e); | ||
| if (tokenBudget.isLimited()) { | ||
| tokenTracker.recordReservedTokens(reActAgentId, subAgentBudget); | ||
| } | ||
| finalListener.onFailure(e); |
There was a problem hiding this comment.
High-priority — recordReservedTokens mutation is dead before onFailure.
recordReservedTokens mutates tokenTracker, but the very next line invokes finalListener.onFailure(e), terminating the agent run. Nothing else reads the tracker on this path, so the reservation is silently discarded.
Two possible intents — neither matches the current code:
- Pre-reservation against runaway retries: then the reservation must happen before dispatching the executor at line 691 (
client.execute(MLExecuteTaskAction.INSTANCE, ...)) and be debited back when the actual sub-agent usage merges in viamergeSubAgentUsage(line 710). As written, it's executed too late to matter. - Telemetry on the failed call: then the merged
tokenTracker.toOutputMap()needs to actually flow into the failure response — butfinalListener.onFailure(e)carries only the exception.
Please clarify intent and adjust accordingly, or delete the call. There's also no test for this path — MLPlanExecuteAndReflectAgentRunnerTest does not exercise the executor onFailure branch.
| if (tokenBudget.isExhausted(accumulatedTokenUsage[0])) { | ||
| long consumed = AgentTokenBudget.consumedTokens(accumulatedTokenUsage[0]); | ||
| log | ||
| .warn( | ||
| "ReAct loop stopped (token budget exhausted). agentId={}, maxTokens={}, consumedTokens={}", | ||
| agentId, | ||
| tokenBudget.getMaxTokens(), | ||
| consumed | ||
| ); | ||
| List<Message> toolInteractionMessages = toolInteractionJsonList.isEmpty() | ||
| ? null | ||
| : parseToolInteractionsForPersistence(toolInteractionJsonList, modelProvider); | ||
| Message resultMessage = toolCalls == null || toolCalls.isEmpty() | ||
| ? assistantMessage | ||
| : createBudgetExhaustedMessage(tokenBudget.exhaustedMessage(consumed)); | ||
| listener | ||
| .onResponse( | ||
| new AgentLogicResult( | ||
| resultMessage, | ||
| AgentTokenBudget.STOP_REASON_BUDGET_EXHAUSTED, | ||
| accumulatedTokenUsage[0], | ||
| toolInteractionMessages | ||
| ) | ||
| ); | ||
| return; | ||
| } |
There was a problem hiding this comment.
High-priority — stop_reason mis-classifies natural finish on the over-budget call.
When the LLM returns a complete, tool-call-free response that just happens to push cumulative usage past the budget, this branch (correctly) surfaces the model's actual final answer (assistantMessage) — but tags the result stop_reason=budget_exhausted. From the caller's perspective this is indistinguishable from "we cut you off mid-thought," but in this case the model finished cleanly.
Consider:
String stopReason = (toolCalls == null || toolCalls.isEmpty())
? STOP_REASON_END_TURN
: AgentTokenBudget.STOP_REASON_BUDGET_EXHAUSTED;If budget_exhausted is desired regardless (e.g., to signal "do not continue"), document the semantic explicitly. The existing testExecuteAgentLogic_TokenBudgetExhausted_StopsExecution only exercises the tool-call path, so this natural-finish branch has no direct coverage.
| if (tokenBudget.isExhausted(tokenTracker)) { | ||
| boolean includeTokenUsage = Boolean.parseBoolean(allParams.getOrDefault(AgentUtils.INCLUDE_TOKEN_USAGE, "false")); | ||
| saveAndReturnFinalResult( | ||
| memory, | ||
| parentInteractionId, | ||
| allParams.get(EXECUTOR_AGENT_MEMORY_ID_FIELD), | ||
| allParams.get(EXECUTOR_AGENT_PARENT_INTERACTION_ID_FIELD), | ||
| tokenBudget.exhaustedMessage(tokenTracker) + ". Completed steps:\n" + formatCompletedSteps(completedSteps), | ||
| null, | ||
| finalListener, | ||
| tokenTracker, | ||
| allParams.get(TENANT_ID_FIELD), | ||
| includeTokenUsage, | ||
| AgentTokenBudget.STOP_REASON_BUDGET_EXHAUSTED | ||
| ); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Medium — duplicated boilerplate (4 sites in this file).
The same 11-line "budget exhausted → saveAndReturnFinalResult" block appears four times in this file: here (497-513), 582-605, 659-674, and 778-792. Each computes the same includeTokenUsage boolean, the same message string tokenBudget.exhaustedMessage(...) + ". Completed steps:\n" + formatCompletedSteps(completedSteps), and calls saveAndReturnFinalResult with STOP_REASON_BUDGET_EXHAUSTED.
Extract a helper:
private void returnBudgetExhausted(
Memory memory, String parentInteractionId, Map<String, String> allParams,
String exhaustedMessage, List<String> completedSteps,
AgentTokenTracker tokenTracker, ActionListener<Object> finalListener
) {
boolean includeTokenUsage = Boolean.parseBoolean(allParams.getOrDefault(AgentUtils.INCLUDE_TOKEN_USAGE, "false"));
saveAndReturnFinalResult(
memory, parentInteractionId,
allParams.get(EXECUTOR_AGENT_MEMORY_ID_FIELD),
allParams.get(EXECUTOR_AGENT_PARENT_INTERACTION_ID_FIELD),
exhaustedMessage + ". Completed steps:\n" + formatCompletedSteps(completedSteps),
null, finalListener, tokenTracker,
allParams.get(TENANT_ID_FIELD), includeTokenUsage,
AgentTokenBudget.STOP_REASON_BUDGET_EXHAUSTED
);
}Drops ~40 LOC and removes the risk of one site drifting from the others.
| Map<String, Object> dataAsMap = new HashMap<>(); | ||
| dataAsMap.put(RESPONSE_FIELD, finalResult); | ||
| if (stopReason != null) { | ||
| dataAsMap.put("stop_reason", stopReason); |
There was a problem hiding this comment.
Medium — magic string.
AgentTokenBudget.STOP_REASON_BUDGET_EXHAUSTED is a constant, but the field name "stop_reason" is hard-coded both here and at MLChatAgentRunner.java:1524. Promote to a constant (e.g., AgentTokenBudget.STOP_REASON_FIELD = "stop_reason") so a future rename doesn't drift between the two callers.
| dataAsMap.put("response", finalAnswer); | ||
| dataAsMap.put(ADDITIONAL_INFO_FIELD, additionalInfo); | ||
| if (stopReason != null) { | ||
| dataAsMap.put("stop_reason", stopReason); |
There was a problem hiding this comment.
Medium — magic string (paired with MLPlanExecuteAndReflectAgentRunner.java:1072).
Same "stop_reason" literal appears in both files. Promote to a constant and reference it from both call sites.
| void applyRemainingToParams(Map<String, String> targetParams, Map<String, String> llmSpecParams, long consumedTokens) { | ||
| if (!isLimited() || targetParams == null) { | ||
| return; | ||
| } | ||
|
|
||
| long remaining = remaining(consumedTokens); | ||
| if (remaining <= 0) { | ||
| return; | ||
| } | ||
|
|
||
| long effectiveLimit = remaining; | ||
| Long existingLimit = smallestPositiveOutputLimit(targetParams, llmSpecParams); | ||
| if (existingLimit != null) { | ||
| effectiveLimit = Math.min(existingLimit, remaining); | ||
| } | ||
|
|
||
| String effectiveLimitString = Long.toString(effectiveLimit); | ||
| for (String field : OUTPUT_LIMIT_FIELDS) { | ||
| targetParams.put(field, effectiveLimitString); | ||
| } | ||
| } |
There was a problem hiding this comment.
Medium — input-prompt tokens are not pre-checked, only output-cap is clamped.
This only writes the three output-cap fields (max_tokens, maxTokens, maxOutputTokens). The user-facing max_tokens is documented as "the total agent execution budget" (input + output across all calls), but on the next LLM call we cannot know the input-prompt token count until after the response. A large prompt can therefore push cumulative usage well past the budget before the post-call check at MLChatAgentRunner.java:594 catches it.
This isn't broken — the post-call check still fires — but the user could see substantially more than max_tokens actually charged on a single oversized prompt. Please document this semantic explicitly in the class doc, e.g.:
max_tokensis enforced post-hoc per LLM call; a single in-flight call can exceed the remaining budget by up to its input prompt size. Output limits are clamped pre-call.
(The PR description should probably say this too.)
| ) | ||
| ); | ||
|
|
||
| addSteps(completedSteps, allParams, COMPLETED_STEPS_FIELD); |
There was a problem hiding this comment.
Medium — addSteps reordering deserves a comment.
This addSteps(...) call was previously after saveTraceData(...); the PR moves it before both saveTraceData and the new budget-check so that formatCompletedSteps(completedSteps) in the budget-exhausted message includes the step that just finished.
Functionally appears safe (no caller of saveTraceData reads allParams[COMPLETED_STEPS_FIELD]), but please add a one-line comment documenting why this must precede the budget check — otherwise a future cleanup is likely to reorder it back. Something like:
// Must precede the budget check below so the just-finished step is included in the exhaust message.
addSteps(completedSteps, allParams, COMPLETED_STEPS_FIELD);| private static long effectiveTotalTokens(TokenUsage usage) { | ||
| Long totalTokens = usage.getTotalTokens(); | ||
| if (totalTokens != null) { | ||
| return totalTokens; | ||
| } | ||
| Long inputTokens = usage.getInputTokens(); | ||
| Long outputTokens = usage.getOutputTokens(); | ||
| if (inputTokens != null || outputTokens != null) { | ||
| return (inputTokens != null ? inputTokens : 0L) + (outputTokens != null ? outputTokens : 0L); | ||
| } | ||
| return 0L; | ||
| } |
There was a problem hiding this comment.
Style/nit — possible duplication of TokenUsage.getEffectiveTotalTokens().
AgentTokenBudget.consumedTokens(TokenUsage) already delegates to usage.getEffectiveTotalTokens() (per AgentTokenBudget.java:142). If TokenUsage.getEffectiveTotalTokens() does the same fallback (totalTokens → input + output → 0), this private method is duplicated logic — just call it directly and remove this helper. If it does not do the same thing, the divergence is itself a bug worth resolving.
| if (tokenCount <= 0) { | ||
| return; | ||
| } | ||
| String effectiveModelId = modelId != null ? modelId : "unknown"; |
There was a problem hiding this comment.
Nit — magic string "unknown".
If this constant appears anywhere else in the codebase (likely it does — telemetry pipelines often have a canonical "unknown model" sentinel), promote to a shared constant. Also worth pausing on: is silently dropping the reservation (when modelId is null) a better choice than emitting an "unknown" per-model row that downstream consumers may not expect? Tied to H1 — this method's correct call-site is unclear.
Description
This PR introduces agent-level token budget support for agent executions using
parameters.max_tokens.Before this change, agent executions did not enforce a cumulative token budget across LLM calls. This PR adds token usage aggregation and budget enforcement for V1 chat, V2 chat, and Plan-Execute-Reflect agents.
The implementation:
parameters.max_tokensas the total agent execution budget.max_tokensmaxTokensmaxOutputTokensRelated Issues
Resolves #4728
Check List
--signoff.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.
Usage Example:
Execute Agent request with parameter
"max_tokens": 10000Response: ML-Task terminating after detecting token-overexhaustion