Feature/agent level token limit by GugaGlonti · Pull Request #4820 · opensearch-project/ml-commons

GugaGlonti · 2026-05-11T10:06:58Z

Description

This PR introduces agent-level token budget support for agent executions using parameters.max_tokens.

Before this change, agent executions did not enforce a cumulative token budget across LLM calls. This PR adds token usage aggregation and budget enforcement for V1 chat, V2 chat, and Plan-Execute-Reflect agents.

The implementation:

Tracks cumulative token usage across agent LLM calls.
Treats positive parameters.max_tokens as the total agent execution budget.
Stops agent execution when the cumulative token budget is exhausted.
Clamps outgoing LLM request parameters to the remaining token budget before each LLM call.
Preserves smaller model-level per-call output limits when present.
Applies provider-compatible output limit fields:
- max_tokens
- maxTokens
- maxOutputTokens
Preserves existing token usage response behavior, including sub-agent usage merging for Plan-Execute-Reflect.

Related Issues

Resolves #4728

Check List

New functionality includes testing.
New functionality has been documented.
API changes companion pull request created.
Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Usage Example:

Execute Agent request with parameter `"max_tokens": 10000`

POST _plugins/_ml/agents/8Tp1JZ4Bd954a9UWJGmU/_execute?async=true
{
  "parameters": {
    "question": "How many flights from Beijing to Seattle?",
    "include_token_usage": true,
    "max_tokens": 10000
  }
}

Response: ML-Task terminating after detecting token-overexhaustion

{
  "task_type": "AGENT_EXECUTION",
  "function_name": "AGENT",
  "state": "COMPLETED",
  "worker_node": [
    "dLXu2EpiREqk64AL7dAC3w"
  ],
  "create_time": 1778745110465,
  "last_update_time": 1778745122576,
  "is_async": true,
  "response": {
    "memory_id": "KDp4JZ4Bd954a9UWb2rC",
    "executor_agent_memory_id": "Ljp4JZ4Bd954a9UWeGrr",
    "inference_results": [
      {
        "output": [
          {
            "result": "KDp4JZ4Bd954a9UWb2rC",
            "name": "memory_id"
          },
          {
            "result": "KTp4JZ4Bd954a9UWb2rX",
            "name": "parent_interaction_id"
          },
          {
            "result": "Ljp4JZ4Bd954a9UWeGrr",
            "name": "executor_agent_memory_id"
          },
          {
            "result": "Rjp4JZ4Bd954a9UWmmqm",
            "name": "executor_agent_parent_interaction_id"
          },
          {
            "name": "response",
            "dataAsMap": {
              "response": """Agent execution stopped: token budget exhausted (10197/10000 tokens used). Completed steps: [
<step-1>
Use ListIndexTool to identify all available indices in the OpenSearch cluster without filtering to discover what indices exist
</step-1>
, 
<step-1-result>
## Available Indices in OpenSearch Cluster

I've successfully retrieved all indices from the OpenSearch cluster. Here are all 16 available indices:

### System/Plugin Indices:
1. **`.plugins-ml-memory-message`** - green, open, 15 docs (4 deleted)
2. **`.plugins-ml-model-group`** - green, open, 5 docs (5 deleted)
3. **`.kibana_-152937574_admintenant_1`** - green, open, 2 docs
4. **`.plugins-ml-memory-meta`** - green, open, 8 docs (15 deleted)
5. **`.ql-datasources`** - green, open, 0 docs
6. **`.plugins-ml-agent`** - green, open, 4 docs (6 deleted)
7. **`.plugins-ml-task`** - green, open, 11 docs (11 deleted)
8. **`.plugins-ml-connector`** - green, open, 6 docs
9. **`.opendistro_security`** - green, open, 9 docs
10. **`.kibana_1`** - green, open, 0 docs
11. **`.plugins-ml-config`** - green, open, 1 doc
12. **`.plugins-ml-model`** - green, open, 5 docs (2 deleted)
13. **`.plugins-ml-context-management-templates`** - green, open, 3 docs

### User/Application Indices:
14. **`kb-sops-canonical`** - yellow, open, 0 docs (1 replica)
15. **`security-auditlog-2026.05.14`** - green, open, 162 docs
16. **`top_queries-2026.05.14-84928`** - green, open, 21 docs (9 deleted)

**Summary:** The cluster contains mostly plugin and system indices (13), with 3 user-facing indices. Most indices are in green health status, with one index (`kb-sops-canonical`) in yellow status. Total documents across all indices: 269 documents.
<step-traces>

</step-traces>

</step-1-result>
, 
<step-2>
Use IndexMappingTool with index pattern '*' to examine field mappings across all indices and identify which index contains flight data with origin and destination fields
</step-2>
, 
<step-2-result>
Agent execution stopped: token budget exhausted (2131/1934 tokens used)
<step-traces>

</step-traces>

</step-2-result>
]"""
            }
          },
          {
            "name": "token_usage",
            "dataAsMap": {
              "per_turn_usage": [
                {
                  "model_name": "Bedrock Claude Sonnet model",
                  "model_url": "https://bedrock-runtime.eu-west-1.amazonaws.com/model/eu.anthropic.claude-haiku-4-5-20251001-v1:0/converse",
                  "total_tokens": 1395,
                  "output_tokens": 149,
                  "turn": 1,
                  "model_id": "7Dp0JZ4Bd954a9UW_mkX",
                  "cache_creation_input_tokens": 0,
                  "input_tokens": 1246,
                  "cache_read_input_tokens": 0
                },
                {
                  "model_name": "Bedrock Claude Sonnet model",
                  "model_url": "https://bedrock-runtime.eu-west-1.amazonaws.com/model/eu.anthropic.claude-haiku-4-5-20251001-v1:0/converse",
                  "total_tokens": 1574,
                  "output_tokens": 55,
                  "turn": 2,
                  "model_id": "7Dp0JZ4Bd954a9UW_mkX",
                  "cache_creation_input_tokens": 0,
                  "input_tokens": 1519,
                  "cache_read_input_tokens": 0
                },
                {
                  "model_name": "Bedrock Claude Sonnet model",
                  "model_url": "https://bedrock-runtime.eu-west-1.amazonaws.com/model/eu.anthropic.claude-haiku-4-5-20251001-v1:0/converse",
                  "total_tokens": 2975,
                  "output_tokens": 518,
                  "turn": 3,
                  "model_id": "7Dp0JZ4Bd954a9UW_mkX",
                  "cache_creation_input_tokens": 0,
                  "input_tokens": 2457,
                  "cache_read_input_tokens": 0
                },
                {
                  "model_name": "Bedrock Claude Sonnet model",
                  "model_url": "https://bedrock-runtime.eu-west-1.amazonaws.com/model/eu.anthropic.claude-haiku-4-5-20251001-v1:0/converse",
                  "total_tokens": 2122,
                  "output_tokens": 123,
                  "turn": 4,
                  "model_id": "7Dp0JZ4Bd954a9UW_mkX",
                  "cache_creation_input_tokens": 0,
                  "input_tokens": 1999,
                  "cache_read_input_tokens": 0
                },
                {
                  "model_name": "Bedrock Claude Sonnet model",
                  "model_url": "https://bedrock-runtime.eu-west-1.amazonaws.com/model/eu.anthropic.claude-haiku-4-5-20251001-v1:0/converse",
                  "total_tokens": 2131,
                  "output_tokens": 59,
                  "turn": 5,
                  "model_id": "7Dp0JZ4Bd954a9UW_mkX",
                  "cache_creation_input_tokens": 0,
                  "input_tokens": 2072,
                  "cache_read_input_tokens": 0
                }
              ],
              "per_model_usage": [
                {
                  "model_name": "Bedrock Claude Sonnet model",
                  "model_url": "https://bedrock-runtime.eu-west-1.amazonaws.com/model/eu.anthropic.claude-haiku-4-5-20251001-v1:0/converse",
                  "call_count": 5,
                  "total_tokens": 10197,
                  "output_tokens": 904,
                  "model_id": "7Dp0JZ4Bd954a9UW_mkX",
                  "cache_creation_input_tokens": 0,
                  "input_tokens": 9293,
                  "cache_read_input_tokens": 0
                }
              ]
            }
          }
        ]
      }
    ],
    "parent_interaction_id": "KTp4JZ4Bd954a9UWb2rX",
    "executor_agent_parent_interaction_id": "Lzp4JZ4Bd954a9UWeGr-"
  }
}

github-actions · 2026-05-11T10:08:42Z

PR Reviewer Guide 🔍

(Review updated until commit `0093fb2`)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review Possible Issue In `fromExecutionParams`, when `maxTokensStr` is null, the method returns `unlimited()`. However, if `params` itself is null, it also returns `unlimited()`. This means that a missing `max_tokens` key is treated the same as a null params map. While this may be intentional, it could mask configuration errors where the params map is unexpectedly null. Consider logging a warning or distinguishing between these cases if null params indicates a problem upstream. static AgentTokenBudget fromExecutionParams(Map<String, String> params) { if (params == null) { return unlimited(); } String maxTokensStr = params.get(MAX_TOKENS_FIELD); if (maxTokensStr == null) { return unlimited(); } Possible Issue In `mergeSubAgentUsage`, the code iterates over `subTurns` and `subModels` after casting from `List<?>`. If a list element is not a `Map`, it silently continues. However, if an element is a `Map` but contains unexpected types for keys like `MODEL_ID`, `MODEL_URL`, or `MODEL_NAME` (e.g., integers instead of strings), `getStringFromMap` returns the fallback, which may be null or a default. For `MODEL_ID`, if the fallback is null, the code skips that model entry. This could silently drop valid usage data if the sub-agent response format is slightly malformed. Consider logging a warning when skipping entries due to type mismatches to aid debugging. if (perTurnObj instanceof List) { List<?> subTurns = (List<?>) perTurnObj; for (Object subTurnObj : subTurns) { if (!(subTurnObj instanceof Map)) { continue; } Map<String, Object> subTurn = toObjectMap((Map<?, ?>) subTurnObj); turnCounter++; Map<String, Object> mergedTurn = new HashMap<>(subTurn); mergedTurn.put(TURN, turnCounter); perTurnUsage.add(mergedTurn); } } // Merge per-model usage: aggregate into existing model data Object perModelObj = tokenUsageMap.get(PER_MODEL_USAGE); if (perModelObj instanceof List) { List<?> subModels = (List<?>) perModelObj; for (Object subModelObj : subModels) { if (!(subModelObj instanceof Map)) { continue; } Map<String, Object> subModel = toObjectMap((Map<?, ?>) subModelObj); String modelId = getStringFromMap(subModel, MODEL_ID, null); if (modelId == null) { continue; } String modelUrl = getStringFromMap(subModel, MODEL_URL, modelId); String modelName = getStringFromMap(subModel, MODEL_NAME, modelId); TokenUsage subUsage = TokenUsage .builder() .inputTokens(getLongFromMap(subModel, TokenUsage.INPUT_TOKENS)) .outputTokens(getLongFromMap(subModel, TokenUsage.OUTPUT_TOKENS)) .totalTokens(getLongFromMap(subModel, TokenUsage.TOTAL_TOKENS)) .cacheReadInputTokens(getLongFromMap(subModel, TokenUsage.CACHE_READ_INPUT_TOKENS)) .cacheCreationInputTokens(getLongFromMap(subModel, TokenUsage.CACHE_CREATION_INPUT_TOKENS)) .reasoningTokens(getLongFromMap(subModel, TokenUsage.REASONING_TOKENS)) .build(); int subCallCount = getIntFromMap(subModel, CALL_COUNT, 1); if (!modelMetadataMap.containsKey(modelId)) { modelMetadataMap.put(modelId, new ModelMetadata(modelUrl, modelName)); } ModelMetadata modelMetadata = modelMetadataMap.getOrDefault(modelId, new ModelMetadata(modelUrl, modelName)); perModelUsage .computeIfAbsent(modelId, k -> new ModelUsageAggregation(modelMetadata)) .mergeAggregated(subUsage, subCallCount); } } Possible Issue In `executePlanningLoop`, after a sub-agent call fails (line 826-830), if `tokenBudget.isLimited()` is true, the code calls `tokenTracker.recordReservedTokens(reActAgentId, subAgentBudget)`. However, `subAgentBudget` is computed earlier (line 656) and represents the remaining budget at that point. If the sub-agent call fails immediately without consuming tokens, recording the full `subAgentBudget` as consumed may overcount token usage. This could cause the parent agent to exhaust its budget prematurely. Consider recording zero tokens or a smaller estimate on failure, or document why the full reserved budget is counted even on failure. if (tokenBudget.isLimited()) { tokenTracker.recordReservedTokens(reActAgentId, subAgentBudget); } finalListener.onFailure(e); }));

github-actions · 2026-05-11T10:09:22Z

PR Code Suggestions ✨

Latest suggestions up to 0093fb2

Explore these optional code suggestions:

Category	Suggestion	Impact
General	Clean up resources before response The token budget exhaustion check occurs after recording token usage from the LLM response. If the budget is exhausted, `cleanUpResource(tools)` is called, but the method continues to execute subsequent code that may attempt to use the cleaned-up tools. Ensure that the return statement prevents any further execution that depends on the tools. ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/agent/MLChatAgentRunner.java [594-625] if (tokenBudget.isExhausted(tokenTracker)) { long consumed = tokenTracker.getCumulativeTotalTokens(); log .warn( "ReAct loop stopped (token budget exhausted). agentId={}, maxTokens={}, consumedTokens={}", agentId, tokenBudget.getMaxTokens(), consumed ); + cleanUpResource(tools); sendFinalAnswer( sessionId, listener, question, parentInteractionId, verbose, traceDisabled, traceTensors, memory, traceNumber, additionalInfo, tokenBudget.exhaustedMessage(consumed), usesUnifiedInterface, new ArrayList<>(interactions), modelProvider, tokenTracker, tenantId, includeTokenUsage, AgentTokenBudget.STOP_REASON_BUDGET_EXHAUSTED ); - cleanUpResource(tools); return; } Suggestion importance[1-10]: 6 __ Why: The suggestion correctly identifies that `cleanUpResource(tools)` should be called before `sendFinalAnswer` to ensure proper resource cleanup order. This improves code maintainability and prevents potential issues with resource management, though the current code already has a `return` statement preventing further execution.	Low
General	Improve error message clarity The validation logic allows negative values to pass through the `Long.parseLong()` call before checking if `value <= 0`. If a very large negative number causes an overflow or unexpected behavior during parsing, the error message may be misleading. Consider validating the string format before parsing to provide clearer error messages. ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/agent/AgentTokenBudget.java [51-53] if (value <= 0) { - throw new IllegalArgumentException("max_tokens must be a positive long value"); + throw new IllegalArgumentException("max_tokens must be a positive long value, got: " + value); } Suggestion importance[1-10]: 4 __ Why: The suggestion improves error message clarity by including the actual value in the exception message. However, the impact is minor since the validation already occurs after parsing, and the improvement is primarily cosmetic.	Low

Previous suggestions

Suggestions up to commit 763a83b

Category	Suggestion	Impact
General	Log warning for dropped map entries The `toObjectMap()` method silently drops entries with non-String keys. If the input map contains important data with non-String keys (e.g., from deserialization issues), this data loss could cause incorrect token tracking. Consider logging a warning when non-String keys are encountered to aid debugging. ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/agent/AgentTokenTracker.java [213-221] private static Map<String, Object> toObjectMap(Map<?, ?> map) { Map<String, Object> objectMap = new HashMap<>(); for (Map.Entry<?, ?> entry : map.entrySet()) { if (entry.getKey() instanceof String) { objectMap.put((String) entry.getKey(), entry.getValue()); + } else { + log.warn("Dropping non-String key in token usage map: {}", entry.getKey()); } } return objectMap; } Suggestion importance[1-10]: 5 __ Why: Adding logging for dropped non-String keys would improve debugging capabilities when malformed data is encountered. The suggestion correctly identifies that silent data loss could cause issues, and logging would help diagnose such problems. However, this is a defensive improvement rather than fixing a critical bug.	Low
	Return sentinel for unlimited budgets The `remaining()` method returns `Long.MAX_VALUE` for unlimited budgets, but callers like `applyRemainingToParams()` use this value directly as a token limit. This could cause LLM requests to fail or behave unexpectedly if providers reject extremely large token limits. Consider returning a sentinel value that callers explicitly check, or document that unlimited budgets should skip limit application entirely. ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/agent/AgentTokenBudget.java [81-86] long remaining(long consumedTokens) { if (!isLimited()) { - return Long.MAX_VALUE; + return UNLIMITED; } return Math.max(0L, maxTokens - consumedTokens); } Suggestion importance[1-10]: 3 __ Why: The suggestion identifies a potential issue where `Long.MAX_VALUE` might cause problems with LLM providers. However, the `applyRemainingToParams` method already handles unlimited budgets by checking `isLimited()` at line 97 and returning early, so the `Long.MAX_VALUE` is never actually used as a token limit. The suggestion is technically correct but has minimal practical impact.	Low
	Record reserved tokens on budget exhaustion When the sub-agent budget is exhausted (`subAgentBudget <= 0`), the code returns early but doesn't record reserved tokens for the sub-agent. This creates an inconsistency: the error path at line 821 calls `recordReservedTokens()` when the sub-agent fails, but the budget exhaustion path doesn't. This could lead to inaccurate token tracking when budgets are exhausted before sub-agent execution. ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/agent/MLPlanExecuteAndReflectAgentRunner.java [656-676] long subAgentBudget = tokenBudget.remaining(tokenTracker); if (tokenBudget.isLimited()) { if (subAgentBudget <= 0) { boolean includeTokenUsage = Boolean.parseBoolean(allParams.getOrDefault(AgentUtils.INCLUDE_TOKEN_USAGE, "false")); + tokenTracker.recordReservedTokens(reActAgentId, 0L); saveAndReturnFinalResult( memory, parentInteractionId, allParams.get(EXECUTOR_AGENT_MEMORY_ID_FIELD), allParams.get(EXECUTOR_AGENT_PARENT_INTERACTION_ID_FIELD), tokenBudget.exhaustedMessage(tokenTracker) + ". Completed steps:\n" + formatCompletedSteps(completedSteps), null, finalListener, tokenTracker, allParams.get(TENANT_ID_FIELD), includeTokenUsage, AgentTokenBudget.STOP_REASON_BUDGET_EXHAUSTED ); return; } reactParams.put(AgentTokenBudget.MAX_TOKENS_FIELD, Long.toString(subAgentBudget)); } Suggestion importance[1-10]: 2 __ Why: The suggestion proposes recording 0 tokens when the budget is exhausted before sub-agent execution. However, recording 0 tokens serves no practical purpose and doesn't improve consistency. The error path at line 827 records `subAgentBudget` (not 0) when the sub-agent fails, which is different from this scenario where no execution occurs. The suggestion misunderstands the intent of `recordReservedTokens`.	Low

Suggestions up to commit e502cc1

Category	Suggestion	Impact
General	Improve error message clarity The validation logic allows negative values to pass through the `Long.parseLong()` call before checking if `value <= 0`. If the string represents a very large number that causes overflow, it could result in a negative value that gets caught here, but the error message would be misleading. Consider validating the string format before parsing to provide clearer error messages. ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/agent/AgentTokenBudget.java [51-53] if (value <= 0) { - throw new IllegalArgumentException("max_tokens must be a positive long value"); + throw new IllegalArgumentException("max_tokens must be a positive long value, got: " + value); } Suggestion importance[1-10]: 4 __ Why: The suggestion improves error message clarity by including the actual value in the exception message. However, the impact is minor since the validation logic already works correctly and the original message is reasonably clear.	Low
General	Record reserved tokens before early return When the sub-agent budget is exhausted (`subAgentBudget <= 0`), the code returns early but doesn't record the reserved tokens. This could lead to inaccurate token tracking. Consider recording reserved tokens before returning, similar to the error handler below that calls `tokenTracker.recordReservedTokens()`. ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/agent/MLPlanExecuteAndReflectAgentRunner.java [656-676] long subAgentBudget = tokenBudget.remaining(tokenTracker); if (tokenBudget.isLimited()) { if (subAgentBudget <= 0) { + tokenTracker.recordReservedTokens(reActAgentId, 0L); ... return; } reactParams.put(AgentTokenBudget.MAX_TOKENS_FIELD, Long.toString(subAgentBudget)); } Suggestion importance[1-10]: 3 __ Why: The suggestion identifies a potential inconsistency in token tracking when the sub-agent budget is exhausted before execution. However, recording 0 tokens may not be semantically meaningful, and the impact on overall token tracking accuracy is minimal since the budget is already exhausted at this point.	Low

Suggestions up to commit af3fc6b

Category	Suggestion	Impact
General	Inconsistent state with unanswered tool calls The budget exhaustion check occurs after token usage is recorded but before tool execution. If the LLM response includes tool calls, those tools won't execute, but the assistant message with tool calls is already added to interactions. This creates inconsistent state where tool calls are present without results. ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/agent/MLChatAgentRunner.java [594-624] if (tokenBudget.isExhausted(tokenTracker)) { long consumed = tokenTracker.getCumulativeTotalTokens(); log .warn( "ReAct loop stopped (token budget exhausted). agentId={}, maxTokens={}, consumedTokens={}", agentId, tokenBudget.getMaxTokens(), consumed ); + // Remove the assistant message with tool calls to maintain consistency + if (!interactions.isEmpty()) { + interactions.remove(interactions.size() - 1); + } sendFinalAnswer( sessionId, listener, question, parentInteractionId, verbose, traceDisabled, traceTensors, memory, traceNumber, additionalInfo, tokenBudget.exhaustedMessage(consumed), usesUnifiedInterface, new ArrayList<>(interactions), modelProvider, tokenTracker, tenantId, includeTokenUsage ); cleanUpResource(tools); return; } Suggestion importance[1-10]: 7 __ Why: This identifies a legitimate issue where the assistant message containing tool calls is added to `interactions` before the budget check. If the budget is exhausted, the tool calls won't execute, leaving the conversation history in an inconsistent state. Removing the last message before sending the final answer would maintain consistency between the message history and actual execution.	Medium
	Silent failure when budget exhausted When `remaining <= 0`, the method silently returns without setting any limit parameter. This could allow an LLM request to proceed without token constraints. Consider throwing an exception or logging a warning to prevent unexpected behavior. ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/agent/AgentTokenBudget.java [94-102] void applyRemainingToParams(Map<String, String> targetParams, Map<String, String> llmSpecParams, long consumedTokens) { if (!isLimited() \|\| targetParams == null) { return; } long remaining = remaining(consumedTokens); if (remaining <= 0) { + log.warn("Cannot apply token budget: no tokens remaining (consumed={}, max={})", consumedTokens, maxTokens); return; } ... } Suggestion importance[1-10]: 4 __ Why: Adding a warning log when `remaining <= 0` would improve observability. However, this scenario should be prevented by the `isExhausted` checks that occur before `applyRemainingToParams` is called. The silent return is acceptable as a defensive measure, but logging would help identify if the exhaustion checks are bypassed.	Low
	Off-by-one error in budget exhaustion The exhaustion check uses `>=` which triggers when exactly at the limit. This prevents the final token from being used. Consider using `>` to allow full budget utilization before stopping execution. ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/agent/AgentTokenBudget.java [67-69] boolean isExhausted(long consumedTokens) { - return isLimited() && consumedTokens >= maxTokens; + return isLimited() && consumedTokens > maxTokens; } Suggestion importance[1-10]: 3 __ Why: The suggestion raises a valid point about the `>=` vs `>` comparison, but the current implementation is actually correct for budget enforcement. Using `>=` ensures execution stops when the budget is reached or exceeded, which is the expected behavior for a hard limit. Using `>` would allow one token over budget, which contradicts the purpose of having a strict limit.	Low

Suggestions up to commit 8fdeac5

Category	Suggestion	Impact
General	Clean up resources before early return The budget exhaustion check occurs after token usage is recorded but before tool execution. If the LLM response includes tool calls, those tools are never executed, potentially leaving the agent in an inconsistent state. Consider checking budget before the LLM call or handling tool cleanup explicitly. ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/agent/MLChatAgentRunner.java [594-624] if (tokenBudget.isExhausted(tokenTracker)) { long consumed = tokenTracker.getCumulativeTotalTokens(); log .warn( "ReAct loop stopped (token budget exhausted). agentId={}, maxTokens={}, consumedTokens={}", agentId, tokenBudget.getMaxTokens(), consumed ); + // Clean up tools before sending final answer to prevent resource leaks + cleanUpResource(tools); sendFinalAnswer( sessionId, listener, question, parentInteractionId, verbose, traceDisabled, traceTensors, memory, traceNumber, additionalInfo, tokenBudget.exhaustedMessage(consumed), usesUnifiedInterface, new ArrayList<>(interactions), modelProvider, tokenTracker, tenantId, includeTokenUsage ); - cleanUpResource(tools); return; } Suggestion importance[1-10]: 6 __ Why: The suggestion correctly identifies that `cleanUpResource(tools)` is called after `sendFinalAnswer()`, which could be problematic if `sendFinalAnswer()` throws an exception. Moving cleanup before sending the response ensures resources are released even if the response fails. However, the impact is moderate since the current order still executes cleanup in the normal path.	Low
Possible issue	Fix off-by-one in budget exhaustion check The exhaustion check uses `>=` which treats exact budget match as exhausted. This prevents using the full allocated budget. Change to `>` to allow consuming exactly `maxTokens` before stopping. ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/agent/AgentTokenBudget.java [67-69] boolean isExhausted(long consumedTokens) { - return isLimited() && consumedTokens >= maxTokens; + return isLimited() && consumedTokens > maxTokens; } Suggestion importance[1-10]: 3 __ Why: The suggestion argues that `>=` prevents using the full budget, but this is a design choice. Using `>=` means "exhausted when consumed equals or exceeds budget," which is reasonable. The current implementation allows consuming up to `maxTokens` tokens before stopping, which is correct behavior. The suggestion would allow exceeding the budget by 1 token, which contradicts the budget constraint.	Low
Possible issue	Prevent LLM requests when budget exhausted When `remaining <= 0`, the method silently returns without setting any limit parameter. This allows LLM requests to proceed without token constraints even when budget is exhausted. Either throw an exception or set a minimal positive limit to prevent unbounded requests. ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/agent/AgentTokenBudget.java [94-102] void applyRemainingToParams(Map<String, String> targetParams, Map<String, String> llmSpecParams, long consumedTokens) { if (!isLimited() \|\| targetParams == null) { return; } long remaining = remaining(consumedTokens); if (remaining <= 0) { - return; + throw new IllegalStateException("Cannot apply token limit: budget exhausted"); } ... } Suggestion importance[1-10]: 2 __ Why: The method `applyRemainingToParams` is called after budget checks in the calling code (e.g., lines 865-888, 1606-1609 in `MLChatAgentRunner.java`). When `remaining <= 0`, returning silently is correct because the caller should have already checked `isExhausted()` and stopped execution. Throwing an exception here would be redundant and could break the control flow. The suggestion misunderstands the method's role as a helper, not a guard.	Low

github-actions · 2026-05-11T11:19:14Z

Persistent review updated to latest commit af3fc6b

codecov · 2026-05-11T18:54:25Z

Codecov Report

❌ Patch coverage is 62.31156% with 75 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.47%. Comparing base (12f884e) to head (af3fc6b).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
...thms/agent/MLPlanExecuteAndReflectAgentRunner.java	39.58%	25 Missing and 4 partials ⚠️
...h/ml/engine/algorithms/agent/AgentTokenBudget.java	76.92%	8 Missing and 10 partials ⚠️
.../ml/engine/algorithms/agent/MLChatAgentRunner.java	50.00%	14 Missing and 3 partials ⚠️
...l/engine/algorithms/agent/MLChatAgentRunnerV2.java	63.33%	8 Missing and 3 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #4820      +/-   ##
============================================
+ Coverage     77.42%   77.47%   +0.05%     
- Complexity    11907    11973      +66     
============================================
  Files           963      964       +1     
  Lines         53326    53519     +193     
  Branches       6503     6532      +29     
============================================
+ Hits          41285    41462     +177     
- Misses         9289     9295       +6     
- Partials       2752     2762      +10

Flag	Coverage Δ
ml-commons	`77.47% <62.31%> (+0.05%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ylwu-amzn · 2026-05-11T23:50:15Z

+    private static String outputLimitField(Map<String, String> params) {
+        String llmInterface = params == null ? null : params.get(MLChatAgentRunner.LLM_INTERFACE);
+        if (llmInterface == null) {
+            return MAX_TOKENS_FIELD;
+        }
+
+        String normalized = llmInterface.trim().toLowerCase(Locale.ROOT);
+        if (normalized.startsWith(LLM_INTERFACE_BEDROCK_CONVERSE) || (normalized.contains("bedrock") && normalized.contains("converse"))) {
+            return BEDROCK_MAX_TOKENS_FIELD;
+        }
+        if (normalized.startsWith(LLM_INTERFACE_GEMINI_V1BETA_GENERATE_CONTENT) || normalized.contains("gemini")) {
+            return GEMINI_MAX_OUTPUT_TOKENS_FIELD;
+        }
+        return MAX_TOKENS_FIELD;
+    }


Blocking — provider detection silently fails open.

Two concerns:

LLM_INTERFACE_BEDROCK_CONVERSE = "bedrock/converse" is already a strict prefix of every legitimate Bedrock-converse interface (bedrock/converse/claude, bedrock/converse/deepseek_r1, fully-qualified model IDs). The contains("bedrock") && contains("converse") fallback adds no real coverage but silently accepts anything containing both substrings (incl. user-supplied junk). Same for contains("gemini").

The bigger issue: when _llm_interface is missing from targetParams, this returns MAX_TOKENS_FIELD regardless of provider. For Bedrock-converse connectors that read ${parameters.maxTokens} from their request template, the budget clamp will be written to a field the connector ignores — so the per-call cap is silently not enforced. Only the post-call exhaustion check stops the loop, which defeats half of the budget mechanism.

Fix options (any of):

Detect provider from mlAgent.getModel().getModelProvider() instead of an interface string.

Drop the substring fallback and fail loudly when the interface is unrecognized.

Write all three candidate field names (max_tokens, maxTokens, maxOutputTokens) — overhead is trivial and removes the brittle string match entirely.

Fixed. Removed provider/interface detection for output-limit clamping. The effective clamp is now written to max_tokens, maxTokens, and maxOutputTokens.

ylwu-amzn · 2026-05-11T23:50:15Z

+    private static Long parsePositiveLong(Map<String, String> params, String field) {
+        if (params == null) {
+            return null;
+        }
+        String value = params.get(field);
+        if (value == null) {
+            return null;
+        }
+        try {
+            long parsed = Long.parseLong(value);
+            return parsed > 0 ? parsed : null;
+        } catch (NumberFormatException e) {
+            return null;
+        }
+    }


Blocking — 0 is treated as if the field were absent, then overwritten.

parsePositiveLong returns null for any non-positive value, so an existing max_tokens=0 (which several connectors interpret as "no per-call cap") is treated as missing — and applyRemainingToParams then overwrites it with remaining. That changes the connector's behavior for users who explicitly set 0.

If the intent is "ignore non-positive values", document it; otherwise treat 0 as "force unlimited per-call" and skip the clamp. Today the behavior is implicit and surprising.

Intentional. Non-positive output-limit values are not treated as valid smaller caps; an active agent-level budget still clamps the outgoing request to the remaining budget.

ylwu-amzn · 2026-05-11T23:50:15Z

+    void applyRemainingToParams(Map<String, String> targetParams, Map<String, String> llmSpecParams, long consumedTokens) {
+        if (!isLimited() || targetParams == null) {
+            return;
+        }
+
+        long remaining = remaining(consumedTokens);
+        if (remaining <= 0) {
+            return;
+        }
+
+        long effectiveLimit = remaining;
+        Long existingLimit = smallestPositiveOutputLimit(targetParams, llmSpecParams);
+        if (existingLimit != null) {
+            effectiveLimit = Math.min(existingLimit, remaining);
+        }
+
+        targetParams.put(outputLimitField(targetParams), Long.toString(effectiveLimit));
+    }


Note: when existingLimit (the smallest positive across targetParams/llmSpecParams) wins, this overwrites targetParams[outputLimitField]. The test testApplyRemainingToParams_PreservesSmallerExistingLimit asserts the smaller value ends up in params — which works, but the name "preserves" is misleading: we're rewriting the user's targetParams.max_tokens=80 with 30 (or with min(existing, remaining)). Consider renaming the test or adding a comment explaining the semantic.

Fixed. Renamed the test to describe clamping to the smaller effective limit.

ylwu-amzn · 2026-05-11T23:50:15Z

+                    if (tokenBudget.isExhausted(accumulatedTokenUsage[0])) {
+                        long consumed = AgentTokenBudget.consumedTokens(accumulatedTokenUsage[0]);
+                        log
+                            .warn(
+                                "ReAct loop stopped (token budget exhausted). agentId={}, maxTokens={}, consumedTokens={}",
+                                agentId,
+                                tokenBudget.getMaxTokens(),
+                                consumed
+                            );
+                        List<Message> toolInteractionMessages = toolInteractionJsonList.isEmpty()
+                            ? null
+                            : parseToolInteractionsForPersistence(toolInteractionJsonList, modelProvider);
+                        Message resultMessage = toolCalls == null || toolCalls.isEmpty()
+                            ? assistantMessage
+                            : createBudgetExhaustedMessage(tokenBudget.exhaustedMessage(consumed));
+                        listener
+                            .onResponse(
+                                new AgentLogicResult(
+                                    resultMessage,
+                                    STOP_REASON_MAX_TOKENS,
+                                    accumulatedTokenUsage[0],
+                                    toolInteractionMessages
+                                )
+                            );
+                        return;
+                    }


Blocking — persisted memory left in inconsistent state.

messages.add(assistantMessage) ran on line 241, before this exhaustion check. So if the LLM returned tool_calls and the budget is exhausted here, the local messages list (and therefore the persisted memory built from it on the parent's success path) contains a tool-calling assistant turn followed by — nothing. No tool result, no follow-up.

The next time this conversation is loaded, the model sees a dangling tool_use block. Anthropic-style providers return a 400 ("each tool_use must be followed by a tool_result"). OpenAI is more lenient but still surprising.

Fix options:

Move the budget check above messages.add(assistantMessage) (line 241) so the orphan turn is never added.

Drop the trailing tool-calling turn from messages on this branch.

Synthesize a tool_result content block with the budget-exhausted text so the conversation is well-formed.

Fixed. V2 now checks budget exhaustion before adding the assistant tool-call message to messages, so exhausted tool-call turns are not persisted without tool results.

ylwu-amzn · 2026-05-11T23:50:15Z


    // Chat-specific constants
    private static final String STOP_REASON_MAX_ITERATIONS = "max_iterations";
+    private static final String STOP_REASON_MAX_TOKENS = "max_tokens";


High-priority — stop-reason name collides with provider-side stop reasons.

STOP_REASON_MAX_TOKENS = "max_tokens" overlaps with the upstream Bedrock provider's own max_tokens stop reason (see BedrockStreamingHandler.java:497). Any downstream code keying off stopReason to discriminate "model hit its own max_tokens" vs "agent budget exhausted" will be unable to tell them apart.

Suggest agent_max_tokens_budget or budget_exhausted for the agent-side reason.

Fixed. Replaced agent-side "max_tokens" with "budget_exhausted".

ylwu-amzn · 2026-05-11T23:50:15Z

        }
    }

+    /**
+     * Get agent-level token budget from parameters.
+     * Returns -1 if no budget is configured (unlimited).
+     *
+     * @param params Execution parameters (merged agent + runtime params)
+     * @return Maximum total tokens allowed for the agent execution, or -1 for unlimited
+     */
+    private long getMaxTokensBudget(Map<String, String> params) {


Blocking — dead code. Mirrors the same unused method in MLPlanExecuteAndReflectAgentRunner.java:1046. No callers in the production code; please delete.

Fixed. Removed the unused helpers from AbstractV2, V1, and PER, along with the false-coverage tests.

ylwu-amzn · 2026-05-11T23:50:15Z

+        if (tokenBudget.isExhausted(tokenTracker)) {
+            String fallbackResponse = buildMaxIterationsFallbackResponse(maxIterations, lastThought);
+            responseListener.onResponse(fallbackResponse);
+            return;
+        }


Blocking — wrong message when budget exhausted at max-iterations time.

When both "max iterations reached" and "budget exhausted" trigger on the same final iteration, this returns the max-iterations fallback string ("Max iterations reached…") even though the budget is the actual cause. Every other budget-exhaustion site in the PR returns tokenBudget.exhaustedMessage(...), so the user-visible reason is now inconsistent across paths.

Use tokenBudget.exhaustedMessage(tokenTracker) here, or compose both reasons ("Max iterations reached AND token budget exhausted (X/Y)").

Fixed. If V1 hits max iterations and budget exhaustion together, it now returns the budget-exhausted message and structured stop_reason=budget_exhausted.

ylwu-amzn · 2026-05-11T23:50:15Z

+                parentInteractionId,
+                allParams.get(EXECUTOR_AGENT_MEMORY_ID_FIELD),
+                allParams.get(EXECUTOR_AGENT_PARENT_INTERACTION_ID_FIELD),
+                tokenBudget.exhaustedMessage(tokenTracker) + ". Completed steps: " + completedSteps,


High-priority — List.toString() leaks [a, b, c] into user-visible output.

completedSteps is a List<String>, and ". Completed steps: " + completedSteps invokes List.toString(), producing "[step1, step2, step3]" with the surrounding brackets and , separators. That renders as garbage in chat UIs.

Same pattern appears at lines ~596, ~763, ~1081, ~1123. Either format as a numbered list (String.join("\n", completedSteps)) or move the steps into trace tensors / additionalInfo and keep the headline message clean.

Fixed. Completed steps are now newline-joined instead of using Java List.toString() formatting.

ylwu-amzn · 2026-05-11T23:50:15Z

+    public long getCumulativeTotalTokens() {
+        long total = 0;
+        for (ModelUsageAggregation agg : perModelUsage.values()) {
+            TokenUsage usage = agg.getAggregatedUsage();
+            Long effectiveTotal = usage.getEffectiveTotalTokens();
+            if (effectiveTotal != null) {
+                total += effectiveTotal;
+            }
+        }
+        return total;


Minor: getEffectiveTotalTokens() returns null when only one of inputTokens/outputTokens is set, so this method silently treats partial data as zero. For providers that report only input tokens (rare), the budget would never trigger.

Consider summing (inputTokens != null ? inputTokens : 0) + (outputTokens != null ? outputTokens : 0) when at least one is non-null. Not blocking — just worth flagging since the budget assumes getEffectiveTotalTokens reflects reality.

Fixed. getCumulativeTotalTokens() now counts input-only or output-only token data when total_tokens is absent.

ylwu-amzn · 2026-05-11T23:50:15Z

        AgentTokenTracker tokenTracker,
-        FunctionCalling functionCalling
+        FunctionCalling functionCalling,
+        AgentTokenBudget tokenBudget


Nit: setupMemoryAndRun, setToolsAndRunAgent, and executePlanningLoop are now 9–11 parameters each after this PR. Consider extracting a small PerExecutionContext (carrying tokenTracker, tokenBudget, functionCalling, memory, conversationId) before signature creep gets worse. Not blocking.

Acknowledged. I left the PerExecutionContext extraction out of this PR to avoid expanding the scope beyond the review fixes.

github-actions · 2026-05-18T07:56:17Z

Persistent review updated to latest commit e502cc1

github-actions · 2026-05-18T09:51:30Z

Persistent review updated to latest commit 763a83b

…ment a max_tokens parameter at the agent level that enforces a hard upper bound on total token usage across all LLM calls within a single agent execution. Changes: - AbstractV2AgentRunner: Add getMaxTokensBudget() helper method - MLChatAgentRunnerV2: Add token budget check after each LLM call with STOP_REASON_MAX_TOKENS stop reason - MLChatAgentRunner (V1): Add token budget check in ReAct loop after tokenTracker.recordTurn() - MLPlanExecuteAndReflectAgentRunner: Add token budget check in planning loop after planner LLM response - AgentTokenTracker: Add getCumulativeTotalTokens() method for budget enforcement The agent execution loop dynamically checks cumulative token usage after each LLM call. If the budget is exhausted, execution terminates gracefully with an appropriate message and stop reason. Usage: { "parameters": { "max_tokens": 10000 } } Resolves opensearch-project#4728 Signed-off-by: Kamen Tarlov Signed-off-by: Kamen Tarlov <kamen.tarlov@itgix.com> Signed-off-by: Guga Glonti <110097877+GugaGlonti@users.noreply.github.com>

- Introduced AgentTokenBudget class to encapsulate token budget logic. - Updated MLChatAgentRunner, MLChatAgentRunnerV2, and MLPlanExecuteAndReflectAgentRunner to utilize AgentTokenBudget for token management. - Removed redundant max_tokens checks and streamlined token exhaustion handling. - Enhanced token budget enforcement in agent execution loops. Signed-off-by: Guga Glonti <110097877+GugaGlonti@users.noreply.github.com>

- Implement tests to verify token budget behavior in various scenarios, including initial budget allocation, remaining budget application, and exhaustion handling. - Ensure correct parameter adjustments for different LLM interfaces based on token usage. Signed-off-by: Guga Glonti <110097877+GugaGlonti@users.noreply.github.com>

Signed-off-by: Guga Glonti <110097877+GugaGlonti@users.noreply.github.com>

github-actions · 2026-05-26T17:43:58Z

Persistent review updated to latest commit 0093fb2

rithinpullela

Thanks for addressing all the comments @GugaGlonti, the PR looks good to me to merge except for parseToolInteractionsForPersistence part in MLChatAgentRunnerV2.java, can you please take a look

rithinpullela · 2026-05-26T18:55:40Z

+        if (tokenBudget.isExhausted(consumedBeforeCall)) {
+            List<Message> toolInteractionMessages = toolInteractionJsonList.isEmpty()
+                ? null
+                : parseToolInteractionsForPersistence(toolInteractionJsonList, modelProvider);
+            listener
+                .onResponse(
+                    new AgentLogicResult(
+                        createBudgetExhaustedMessage(tokenBudget.exhaustedMessage(consumedBeforeCall)),
+                        STOP_REASON_MAX_TOKENS,
+                        accumulatedTokenUsage[0],
+                        toolInteractionMessages
+                    )
+                );
+            return;
+        }


Looks like this issue has not been addressed yet, can you address this? Should be a small fix

ylwu-amzn · 2026-05-26T19:46:27Z

+    private String buildMaxIterationsFallbackResponse(int maxIterations, AtomicReference<String> lastThought) {
+        return (lastThought.get() != null && !lastThought.get().isEmpty() && !"null".equals(lastThought.get()))
+            ? String.format("%s. Last thought: %s", String.format(MAX_ITERATIONS_MESSAGE, maxIterations), lastThought.get())
+            : String.format(MAX_ITERATIONS_MESSAGE, maxIterations);
+    }


Blocking — dead helper, never called.

This private method is defined but has zero call sites. The same expression is still inlined at lines 1614-1619 inside the e -> { ... } block of generateLLMSummary's listener.

Either replace the inline expression at 1614-1619 with responseListener.onResponse(buildMaxIterationsFallbackResponse(maxIterations, lastThought));, or delete this helper. As-is, it's dead code that will rot.

$ grep -n buildMaxIterationsFallbackResponse MLChatAgentRunner.java 1629: private String buildMaxIterationsFallbackResponse(...)

(only the definition; no callers.)

ylwu-amzn · 2026-05-26T19:46:27Z

                }, e -> {
                    String agentId = allParams.getOrDefault(AGENT_ID_LOG_FIELD, "unknown");
                    String tenantIdLog = allParams.get(TENANT_ID_FIELD);
                    log.error("Failed to execute ReAct agent. agentId={}, tenantId={}", agentId, tenantIdLog, e);
+                    if (tokenBudget.isLimited()) {
+                        tokenTracker.recordReservedTokens(reActAgentId, subAgentBudget);
+                    }
                    finalListener.onFailure(e);


High-priority — recordReservedTokens mutation is dead before onFailure.

recordReservedTokens mutates tokenTracker, but the very next line invokes finalListener.onFailure(e), terminating the agent run. Nothing else reads the tracker on this path, so the reservation is silently discarded.

Two possible intents — neither matches the current code:

Pre-reservation against runaway retries: then the reservation must happen before dispatching the executor at line 691 (client.execute(MLExecuteTaskAction.INSTANCE, ...)) and be debited back when the actual sub-agent usage merges in via mergeSubAgentUsage (line 710). As written, it's executed too late to matter.

Telemetry on the failed call: then the merged tokenTracker.toOutputMap() needs to actually flow into the failure response — but finalListener.onFailure(e) carries only the exception.

Please clarify intent and adjust accordingly, or delete the call. There's also no test for this path — MLPlanExecuteAndReflectAgentRunnerTest does not exercise the executor onFailure branch.

ylwu-amzn · 2026-05-26T19:46:27Z

+                    if (tokenBudget.isExhausted(accumulatedTokenUsage[0])) {
+                        long consumed = AgentTokenBudget.consumedTokens(accumulatedTokenUsage[0]);
+                        log
+                            .warn(
+                                "ReAct loop stopped (token budget exhausted). agentId={}, maxTokens={}, consumedTokens={}",
+                                agentId,
+                                tokenBudget.getMaxTokens(),
+                                consumed
+                            );
+                        List<Message> toolInteractionMessages = toolInteractionJsonList.isEmpty()
+                            ? null
+                            : parseToolInteractionsForPersistence(toolInteractionJsonList, modelProvider);
+                        Message resultMessage = toolCalls == null || toolCalls.isEmpty()
+                            ? assistantMessage
+                            : createBudgetExhaustedMessage(tokenBudget.exhaustedMessage(consumed));
+                        listener
+                            .onResponse(
+                                new AgentLogicResult(
+                                    resultMessage,
+                                    AgentTokenBudget.STOP_REASON_BUDGET_EXHAUSTED,
+                                    accumulatedTokenUsage[0],
+                                    toolInteractionMessages
+                                )
+                            );
+                        return;
+                    }


High-priority — stop_reason mis-classifies natural finish on the over-budget call.

When the LLM returns a complete, tool-call-free response that just happens to push cumulative usage past the budget, this branch (correctly) surfaces the model's actual final answer (assistantMessage) — but tags the result stop_reason=budget_exhausted. From the caller's perspective this is indistinguishable from "we cut you off mid-thought," but in this case the model finished cleanly.

Consider:

String stopReason = (toolCalls == null || toolCalls.isEmpty()) ? STOP_REASON_END_TURN : AgentTokenBudget.STOP_REASON_BUDGET_EXHAUSTED;

If budget_exhausted is desired regardless (e.g., to signal "do not continue"), document the semantic explicitly. The existing testExecuteAgentLogic_TokenBudgetExhausted_StopsExecution only exercises the tool-call path, so this natural-finish branch has no direct coverage.

ylwu-amzn · 2026-05-26T19:46:27Z

+        if (tokenBudget.isExhausted(tokenTracker)) {
+            boolean includeTokenUsage = Boolean.parseBoolean(allParams.getOrDefault(AgentUtils.INCLUDE_TOKEN_USAGE, "false"));
+            saveAndReturnFinalResult(
+                memory,
+                parentInteractionId,
+                allParams.get(EXECUTOR_AGENT_MEMORY_ID_FIELD),
+                allParams.get(EXECUTOR_AGENT_PARENT_INTERACTION_ID_FIELD),
+                tokenBudget.exhaustedMessage(tokenTracker) + ". Completed steps:\n" + formatCompletedSteps(completedSteps),
+                null,
+                finalListener,
+                tokenTracker,
+                allParams.get(TENANT_ID_FIELD),
+                includeTokenUsage,
+                AgentTokenBudget.STOP_REASON_BUDGET_EXHAUSTED
            );
            return;
        }


Medium — duplicated boilerplate (4 sites in this file).

The same 11-line "budget exhausted → saveAndReturnFinalResult" block appears four times in this file: here (497-513), 582-605, 659-674, and 778-792. Each computes the same includeTokenUsage boolean, the same message string tokenBudget.exhaustedMessage(...) + ". Completed steps:\n" + formatCompletedSteps(completedSteps), and calls saveAndReturnFinalResult with STOP_REASON_BUDGET_EXHAUSTED.

Extract a helper:

private void returnBudgetExhausted( Memory memory, String parentInteractionId, Map<String, String> allParams, String exhaustedMessage, List<String> completedSteps, AgentTokenTracker tokenTracker, ActionListener<Object> finalListener ) { boolean includeTokenUsage = Boolean.parseBoolean(allParams.getOrDefault(AgentUtils.INCLUDE_TOKEN_USAGE, "false")); saveAndReturnFinalResult( memory, parentInteractionId, allParams.get(EXECUTOR_AGENT_MEMORY_ID_FIELD), allParams.get(EXECUTOR_AGENT_PARENT_INTERACTION_ID_FIELD), exhaustedMessage + ". Completed steps:\n" + formatCompletedSteps(completedSteps), null, finalListener, tokenTracker, allParams.get(TENANT_ID_FIELD), includeTokenUsage, AgentTokenBudget.STOP_REASON_BUDGET_EXHAUSTED ); }

Drops ~40 LOC and removes the risk of one site drifting from the others.

ylwu-amzn · 2026-05-26T19:46:27Z

+        Map<String, Object> dataAsMap = new HashMap<>();
+        dataAsMap.put(RESPONSE_FIELD, finalResult);
+        if (stopReason != null) {
+            dataAsMap.put("stop_reason", stopReason);


Medium — magic string.

AgentTokenBudget.STOP_REASON_BUDGET_EXHAUSTED is a constant, but the field name "stop_reason" is hard-coded both here and at MLChatAgentRunner.java:1524. Promote to a constant (e.g., AgentTokenBudget.STOP_REASON_FIELD = "stop_reason") so a future rename doesn't drift between the two callers.

ylwu-amzn · 2026-05-26T19:46:27Z

+        dataAsMap.put("response", finalAnswer);
+        dataAsMap.put(ADDITIONAL_INFO_FIELD, additionalInfo);
+        if (stopReason != null) {
+            dataAsMap.put("stop_reason", stopReason);


Medium — magic string (paired with MLPlanExecuteAndReflectAgentRunner.java:1072).

Same "stop_reason" literal appears in both files. Promote to a constant and reference it from both call sites.

ylwu-amzn · 2026-05-26T19:46:27Z

+    void applyRemainingToParams(Map<String, String> targetParams, Map<String, String> llmSpecParams, long consumedTokens) {
+        if (!isLimited() || targetParams == null) {
+            return;
+        }
+
+        long remaining = remaining(consumedTokens);
+        if (remaining <= 0) {
+            return;
+        }
+
+        long effectiveLimit = remaining;
+        Long existingLimit = smallestPositiveOutputLimit(targetParams, llmSpecParams);
+        if (existingLimit != null) {
+            effectiveLimit = Math.min(existingLimit, remaining);
+        }
+
+        String effectiveLimitString = Long.toString(effectiveLimit);
+        for (String field : OUTPUT_LIMIT_FIELDS) {
+            targetParams.put(field, effectiveLimitString);
+        }
+    }


Medium — input-prompt tokens are not pre-checked, only output-cap is clamped.

This only writes the three output-cap fields (max_tokens, maxTokens, maxOutputTokens). The user-facing max_tokens is documented as "the total agent execution budget" (input + output across all calls), but on the next LLM call we cannot know the input-prompt token count until after the response. A large prompt can therefore push cumulative usage well past the budget before the post-call check at MLChatAgentRunner.java:594 catches it.

This isn't broken — the post-call check still fires — but the user could see substantially more than max_tokens actually charged on a single oversized prompt. Please document this semantic explicitly in the class doc, e.g.:

max_tokens is enforced post-hoc per LLM call; a single in-flight call can exceed the remaining budget by up to its input prompt size. Output limits are clamped pre-call.

(The PR description should probably say this too.)

ylwu-amzn · 2026-05-26T19:46:27Z

                                )
                        );

+                    addSteps(completedSteps, allParams, COMPLETED_STEPS_FIELD);


Medium — addSteps reordering deserves a comment.

This addSteps(...) call was previously after saveTraceData(...); the PR moves it before both saveTraceData and the new budget-check so that formatCompletedSteps(completedSteps) in the budget-exhausted message includes the step that just finished.

Functionally appears safe (no caller of saveTraceData reads allParams[COMPLETED_STEPS_FIELD]), but please add a one-line comment documenting why this must precede the budget check — otherwise a future cleanup is likely to reorder it back. Something like:

// Must precede the budget check below so the just-finished step is included in the exhaust message. addSteps(completedSteps, allParams, COMPLETED_STEPS_FIELD);

ylwu-amzn · 2026-05-26T19:46:28Z

+    private static long effectiveTotalTokens(TokenUsage usage) {
+        Long totalTokens = usage.getTotalTokens();
+        if (totalTokens != null) {
+            return totalTokens;
+        }
+        Long inputTokens = usage.getInputTokens();
+        Long outputTokens = usage.getOutputTokens();
+        if (inputTokens != null || outputTokens != null) {
+            return (inputTokens != null ? inputTokens : 0L) + (outputTokens != null ? outputTokens : 0L);
+        }
+        return 0L;
+    }


Style/nit — possible duplication of TokenUsage.getEffectiveTotalTokens().

AgentTokenBudget.consumedTokens(TokenUsage) already delegates to usage.getEffectiveTotalTokens() (per AgentTokenBudget.java:142). If TokenUsage.getEffectiveTotalTokens() does the same fallback (totalTokens → input + output → 0), this private method is duplicated logic — just call it directly and remove this helper. If it does not do the same thing, the divergence is itself a bug worth resolving.

ylwu-amzn · 2026-05-26T19:46:28Z

+        if (tokenCount <= 0) {
+            return;
+        }
+        String effectiveModelId = modelId != null ? modelId : "unknown";


Nit — magic string "unknown".

If this constant appears anywhere else in the codebase (likely it does — telemetry pipelines often have a canonical "unknown model" sentinel), promote to a shared constant. Also worth pausing on: is silently dropping the reservation (when modelId is null) a better choice than emitting an "unknown" per-model row that downstream consumers may not expect? Tied to H1 — this method's correct call-site is unclear.

GugaGlonti requested review from HenryL27, Zhangxunmt, akolarkunnu, austintlee, b4sjoo, dhrubo-os, jngz-es, mingshl, model-collapse, pyek-bot, rbhavna, sam-herman, xinyual, ylwu-amzn and zane-neo as code owners May 11, 2026 10:06

GugaGlonti requested a deployment to ml-commons-cicd-env-require-approval May 11, 2026 10:08 — with GitHub Actions Waiting

GugaGlonti force-pushed the feature/agent-level-token-limit branch from 8fdeac5 to af3fc6b Compare May 11, 2026 11:17

GugaGlonti temporarily deployed to ml-commons-cicd-env-require-approval May 11, 2026 11:19 — with GitHub Actions Inactive

GugaGlonti had a problem deploying to ml-commons-cicd-env-require-approval May 11, 2026 11:19 — with GitHub Actions Failure

GugaGlonti had a problem deploying to ml-commons-cicd-env-require-approval May 11, 2026 11:19 — with GitHub Actions Error

GugaGlonti temporarily deployed to ml-commons-cicd-env-require-approval May 11, 2026 18:55 — with GitHub Actions Inactive

ylwu-amzn reviewed May 12, 2026

View reviewed changes

GugaGlonti mentioned this pull request May 14, 2026

[DOC] Agent-level Token Limit parameter opensearch-project/documentation-website#12430

Open

4 tasks

GugaGlonti requested a deployment to ml-commons-cicd-env-require-approval May 18, 2026 07:56 — with GitHub Actions Waiting

GugaGlonti force-pushed the feature/agent-level-token-limit branch from e502cc1 to 763a83b Compare May 18, 2026 09:49

GugaGlonti temporarily deployed to ml-commons-cicd-env-require-approval May 18, 2026 09:51 — with GitHub Actions Inactive

GugaGlonti had a problem deploying to ml-commons-cicd-env-require-approval May 18, 2026 09:51 — with GitHub Actions Error

GugaGlonti had a problem deploying to ml-commons-cicd-env-require-approval May 18, 2026 09:51 — with GitHub Actions Failure

GugaGlonti temporarily deployed to ml-commons-cicd-env-require-approval May 18, 2026 09:51 — with GitHub Actions Inactive

ktarlov and others added 7 commits May 26, 2026 10:42

Harden agent token budget utilities

da48a6c

Signed-off-by: Guga Glonti <110097877+GugaGlonti@users.noreply.github.com>

Fix chat agent budget exhaustion handling

d8c4623

Signed-off-by: Guga Glonti <110097877+GugaGlonti@users.noreply.github.com>

Fix PER token budget accounting

7b58437

Signed-off-by: Guga Glonti <110097877+GugaGlonti@users.noreply.github.com>

format: spotlessApply

0093fb2

Signed-off-by: Guga Glonti <110097877+GugaGlonti@users.noreply.github.com>

rithinpullela force-pushed the feature/agent-level-token-limit branch from 763a83b to 0093fb2 Compare May 26, 2026 17:42

rithinpullela had a problem deploying to ml-commons-cicd-env-require-approval May 26, 2026 17:44 — with GitHub Actions Failure

rithinpullela temporarily deployed to ml-commons-cicd-env-require-approval May 26, 2026 17:44 — with GitHub Actions Inactive

rithinpullela reviewed May 26, 2026

View reviewed changes

ylwu-amzn reviewed May 26, 2026

View reviewed changes

rithinpullela requested a deployment to ml-commons-cicd-env-require-approval May 28, 2026 03:49 — with GitHub Actions Waiting

Conversation

GugaGlonti commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Check List

Usage Example:

Execute Agent request with parameter "max_tokens": 10000

Response: ML-Task terminating after detecting token-overexhaustion

Uh oh!

github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Reviewer Guide 🔍

(Review updated until commit 0093fb2)

Uh oh!

github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Previous suggestions

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

codecov Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

rithinpullela left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

GugaGlonti commented May 11, 2026 •

edited

Loading

Execute Agent request with parameter `"max_tokens": 10000`

github-actions Bot commented May 11, 2026 •

edited

Loading

(Review updated until commit `0093fb2`)

github-actions Bot commented May 11, 2026 •

edited

Loading

codecov Bot commented May 11, 2026 •

edited

Loading