Skip to content

Feature/agent level token limit#4820

Open
GugaGlonti wants to merge 7 commits into
opensearch-project:mainfrom
GugaGlonti:feature/agent-level-token-limit
Open

Feature/agent level token limit#4820
GugaGlonti wants to merge 7 commits into
opensearch-project:mainfrom
GugaGlonti:feature/agent-level-token-limit

Conversation

@GugaGlonti
Copy link
Copy Markdown

@GugaGlonti GugaGlonti commented May 11, 2026

Description

This PR introduces agent-level token budget support for agent executions using parameters.max_tokens.

Before this change, agent executions did not enforce a cumulative token budget across LLM calls. This PR adds token usage aggregation and budget enforcement for V1 chat, V2 chat, and Plan-Execute-Reflect agents.

The implementation:

  • Tracks cumulative token usage across agent LLM calls.
  • Treats positive parameters.max_tokens as the total agent execution budget.
  • Stops agent execution when the cumulative token budget is exhausted.
  • Clamps outgoing LLM request parameters to the remaining token budget before each LLM call.
  • Preserves smaller model-level per-call output limits when present.
  • Applies provider-compatible output limit fields:
    • max_tokens
    • maxTokens
    • maxOutputTokens
  • Preserves existing token usage response behavior, including sub-agent usage merging for Plan-Execute-Reflect.

Related Issues

Resolves #4728

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Usage Example:

Execute Agent request with parameter "max_tokens": 10000

POST _plugins/_ml/agents/8Tp1JZ4Bd954a9UWJGmU/_execute?async=true
{
  "parameters": {
    "question": "How many flights from Beijing to Seattle?",
    "include_token_usage": true,
    "max_tokens": 10000
  }
}

Response: ML-Task terminating after detecting token-overexhaustion

{
  "task_type": "AGENT_EXECUTION",
  "function_name": "AGENT",
  "state": "COMPLETED",
  "worker_node": [
    "dLXu2EpiREqk64AL7dAC3w"
  ],
  "create_time": 1778745110465,
  "last_update_time": 1778745122576,
  "is_async": true,
  "response": {
    "memory_id": "KDp4JZ4Bd954a9UWb2rC",
    "executor_agent_memory_id": "Ljp4JZ4Bd954a9UWeGrr",
    "inference_results": [
      {
        "output": [
          {
            "result": "KDp4JZ4Bd954a9UWb2rC",
            "name": "memory_id"
          },
          {
            "result": "KTp4JZ4Bd954a9UWb2rX",
            "name": "parent_interaction_id"
          },
          {
            "result": "Ljp4JZ4Bd954a9UWeGrr",
            "name": "executor_agent_memory_id"
          },
          {
            "result": "Rjp4JZ4Bd954a9UWmmqm",
            "name": "executor_agent_parent_interaction_id"
          },
          {
            "name": "response",
            "dataAsMap": {
              "response": """Agent execution stopped: token budget exhausted (10197/10000 tokens used). Completed steps: [
<step-1>
Use ListIndexTool to identify all available indices in the OpenSearch cluster without filtering to discover what indices exist
</step-1>
, 
<step-1-result>
## Available Indices in OpenSearch Cluster

I've successfully retrieved all indices from the OpenSearch cluster. Here are all 16 available indices:

### System/Plugin Indices:
1. **`.plugins-ml-memory-message`** - green, open, 15 docs (4 deleted)
2. **`.plugins-ml-model-group`** - green, open, 5 docs (5 deleted)
3. **`.kibana_-152937574_admintenant_1`** - green, open, 2 docs
4. **`.plugins-ml-memory-meta`** - green, open, 8 docs (15 deleted)
5. **`.ql-datasources`** - green, open, 0 docs
6. **`.plugins-ml-agent`** - green, open, 4 docs (6 deleted)
7. **`.plugins-ml-task`** - green, open, 11 docs (11 deleted)
8. **`.plugins-ml-connector`** - green, open, 6 docs
9. **`.opendistro_security`** - green, open, 9 docs
10. **`.kibana_1`** - green, open, 0 docs
11. **`.plugins-ml-config`** - green, open, 1 doc
12. **`.plugins-ml-model`** - green, open, 5 docs (2 deleted)
13. **`.plugins-ml-context-management-templates`** - green, open, 3 docs

### User/Application Indices:
14. **`kb-sops-canonical`** - yellow, open, 0 docs (1 replica)
15. **`security-auditlog-2026.05.14`** - green, open, 162 docs
16. **`top_queries-2026.05.14-84928`** - green, open, 21 docs (9 deleted)

**Summary:** The cluster contains mostly plugin and system indices (13), with 3 user-facing indices. Most indices are in green health status, with one index (`kb-sops-canonical`) in yellow status. Total documents across all indices: 269 documents.
<step-traces>

</step-traces>

</step-1-result>
, 
<step-2>
Use IndexMappingTool with index pattern '*' to examine field mappings across all indices and identify which index contains flight data with origin and destination fields
</step-2>
, 
<step-2-result>
Agent execution stopped: token budget exhausted (2131/1934 tokens used)
<step-traces>

</step-traces>

</step-2-result>
]"""
            }
          },
          {
            "name": "token_usage",
            "dataAsMap": {
              "per_turn_usage": [
                {
                  "model_name": "Bedrock Claude Sonnet model",
                  "model_url": "https://bedrock-runtime.eu-west-1.amazonaws.com/model/eu.anthropic.claude-haiku-4-5-20251001-v1:0/converse",
                  "total_tokens": 1395,
                  "output_tokens": 149,
                  "turn": 1,
                  "model_id": "7Dp0JZ4Bd954a9UW_mkX",
                  "cache_creation_input_tokens": 0,
                  "input_tokens": 1246,
                  "cache_read_input_tokens": 0
                },
                {
                  "model_name": "Bedrock Claude Sonnet model",
                  "model_url": "https://bedrock-runtime.eu-west-1.amazonaws.com/model/eu.anthropic.claude-haiku-4-5-20251001-v1:0/converse",
                  "total_tokens": 1574,
                  "output_tokens": 55,
                  "turn": 2,
                  "model_id": "7Dp0JZ4Bd954a9UW_mkX",
                  "cache_creation_input_tokens": 0,
                  "input_tokens": 1519,
                  "cache_read_input_tokens": 0
                },
                {
                  "model_name": "Bedrock Claude Sonnet model",
                  "model_url": "https://bedrock-runtime.eu-west-1.amazonaws.com/model/eu.anthropic.claude-haiku-4-5-20251001-v1:0/converse",
                  "total_tokens": 2975,
                  "output_tokens": 518,
                  "turn": 3,
                  "model_id": "7Dp0JZ4Bd954a9UW_mkX",
                  "cache_creation_input_tokens": 0,
                  "input_tokens": 2457,
                  "cache_read_input_tokens": 0
                },
                {
                  "model_name": "Bedrock Claude Sonnet model",
                  "model_url": "https://bedrock-runtime.eu-west-1.amazonaws.com/model/eu.anthropic.claude-haiku-4-5-20251001-v1:0/converse",
                  "total_tokens": 2122,
                  "output_tokens": 123,
                  "turn": 4,
                  "model_id": "7Dp0JZ4Bd954a9UW_mkX",
                  "cache_creation_input_tokens": 0,
                  "input_tokens": 1999,
                  "cache_read_input_tokens": 0
                },
                {
                  "model_name": "Bedrock Claude Sonnet model",
                  "model_url": "https://bedrock-runtime.eu-west-1.amazonaws.com/model/eu.anthropic.claude-haiku-4-5-20251001-v1:0/converse",
                  "total_tokens": 2131,
                  "output_tokens": 59,
                  "turn": 5,
                  "model_id": "7Dp0JZ4Bd954a9UW_mkX",
                  "cache_creation_input_tokens": 0,
                  "input_tokens": 2072,
                  "cache_read_input_tokens": 0
                }
              ],
              "per_model_usage": [
                {
                  "model_name": "Bedrock Claude Sonnet model",
                  "model_url": "https://bedrock-runtime.eu-west-1.amazonaws.com/model/eu.anthropic.claude-haiku-4-5-20251001-v1:0/converse",
                  "call_count": 5,
                  "total_tokens": 10197,
                  "output_tokens": 904,
                  "model_id": "7Dp0JZ4Bd954a9UW_mkX",
                  "cache_creation_input_tokens": 0,
                  "input_tokens": 9293,
                  "cache_read_input_tokens": 0
                }
              ]
            }
          }
        ]
      }
    ],
    "parent_interaction_id": "KTp4JZ4Bd954a9UWb2rX",
    "executor_agent_parent_interaction_id": "Lzp4JZ4Bd954a9UWeGr-"
  }
}

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 11, 2026

PR Reviewer Guide 🔍

(Review updated until commit 0093fb2)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Possible Issue

In fromExecutionParams, when maxTokensStr is null, the method returns unlimited(). However, if params itself is null, it also returns unlimited(). This means that a missing max_tokens key is treated the same as a null params map. While this may be intentional, it could mask configuration errors where the params map is unexpectedly null. Consider logging a warning or distinguishing between these cases if null params indicates a problem upstream.

static AgentTokenBudget fromExecutionParams(Map<String, String> params) {
    if (params == null) {
        return unlimited();
    }

    String maxTokensStr = params.get(MAX_TOKENS_FIELD);
    if (maxTokensStr == null) {
        return unlimited();
    }
Possible Issue

In mergeSubAgentUsage, the code iterates over subTurns and subModels after casting from List<?>. If a list element is not a Map, it silently continues. However, if an element is a Map but contains unexpected types for keys like MODEL_ID, MODEL_URL, or MODEL_NAME (e.g., integers instead of strings), getStringFromMap returns the fallback, which may be null or a default. For MODEL_ID, if the fallback is null, the code skips that model entry. This could silently drop valid usage data if the sub-agent response format is slightly malformed. Consider logging a warning when skipping entries due to type mismatches to aid debugging.

if (perTurnObj instanceof List) {
    List<?> subTurns = (List<?>) perTurnObj;
    for (Object subTurnObj : subTurns) {
        if (!(subTurnObj instanceof Map)) {
            continue;
        }
        Map<String, Object> subTurn = toObjectMap((Map<?, ?>) subTurnObj);
        turnCounter++;
        Map<String, Object> mergedTurn = new HashMap<>(subTurn);
        mergedTurn.put(TURN, turnCounter);
        perTurnUsage.add(mergedTurn);
    }
}

// Merge per-model usage: aggregate into existing model data
Object perModelObj = tokenUsageMap.get(PER_MODEL_USAGE);
if (perModelObj instanceof List) {
    List<?> subModels = (List<?>) perModelObj;
    for (Object subModelObj : subModels) {
        if (!(subModelObj instanceof Map)) {
            continue;
        }
        Map<String, Object> subModel = toObjectMap((Map<?, ?>) subModelObj);
        String modelId = getStringFromMap(subModel, MODEL_ID, null);
        if (modelId == null) {
            continue;
        }

        String modelUrl = getStringFromMap(subModel, MODEL_URL, modelId);
        String modelName = getStringFromMap(subModel, MODEL_NAME, modelId);

        TokenUsage subUsage = TokenUsage
            .builder()
            .inputTokens(getLongFromMap(subModel, TokenUsage.INPUT_TOKENS))
            .outputTokens(getLongFromMap(subModel, TokenUsage.OUTPUT_TOKENS))
            .totalTokens(getLongFromMap(subModel, TokenUsage.TOTAL_TOKENS))
            .cacheReadInputTokens(getLongFromMap(subModel, TokenUsage.CACHE_READ_INPUT_TOKENS))
            .cacheCreationInputTokens(getLongFromMap(subModel, TokenUsage.CACHE_CREATION_INPUT_TOKENS))
            .reasoningTokens(getLongFromMap(subModel, TokenUsage.REASONING_TOKENS))
            .build();

        int subCallCount = getIntFromMap(subModel, CALL_COUNT, 1);

        if (!modelMetadataMap.containsKey(modelId)) {
            modelMetadataMap.put(modelId, new ModelMetadata(modelUrl, modelName));
        }

        ModelMetadata modelMetadata = modelMetadataMap.getOrDefault(modelId, new ModelMetadata(modelUrl, modelName));
        perModelUsage
            .computeIfAbsent(modelId, k -> new ModelUsageAggregation(modelMetadata))
            .mergeAggregated(subUsage, subCallCount);
    }
}
Possible Issue

In executePlanningLoop, after a sub-agent call fails (line 826-830), if tokenBudget.isLimited() is true, the code calls tokenTracker.recordReservedTokens(reActAgentId, subAgentBudget). However, subAgentBudget is computed earlier (line 656) and represents the remaining budget at that point. If the sub-agent call fails immediately without consuming tokens, recording the full subAgentBudget as consumed may overcount token usage. This could cause the parent agent to exhaust its budget prematurely. Consider recording zero tokens or a smaller estimate on failure, or document why the full reserved budget is counted even on failure.

    if (tokenBudget.isLimited()) {
        tokenTracker.recordReservedTokens(reActAgentId, subAgentBudget);
    }
    finalListener.onFailure(e);
}));

@GugaGlonti GugaGlonti requested a deployment to ml-commons-cicd-env-require-approval May 11, 2026 10:08 — with GitHub Actions Waiting
@GugaGlonti GugaGlonti requested a deployment to ml-commons-cicd-env-require-approval May 11, 2026 10:08 — with GitHub Actions Waiting
@GugaGlonti GugaGlonti requested a deployment to ml-commons-cicd-env-require-approval May 11, 2026 10:08 — with GitHub Actions Waiting
@GugaGlonti GugaGlonti requested a deployment to ml-commons-cicd-env-require-approval May 11, 2026 10:08 — with GitHub Actions Waiting
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 11, 2026

PR Code Suggestions ✨

Latest suggestions up to 0093fb2

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Clean up resources before response

The token budget exhaustion check occurs after recording token usage from the LLM
response. If the budget is exhausted, cleanUpResource(tools) is called, but the
method continues to execute subsequent code that may attempt to use the cleaned-up
tools. Ensure that the return statement prevents any further execution that depends
on the tools.

ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/agent/MLChatAgentRunner.java [594-625]

 if (tokenBudget.isExhausted(tokenTracker)) {
     long consumed = tokenTracker.getCumulativeTotalTokens();
     log
         .warn(
             "ReAct loop stopped (token budget exhausted). agentId={}, maxTokens={}, consumedTokens={}",
             agentId,
             tokenBudget.getMaxTokens(),
             consumed
         );
+    cleanUpResource(tools);
     sendFinalAnswer(
         sessionId,
         listener,
         question,
         parentInteractionId,
         verbose,
         traceDisabled,
         traceTensors,
         memory,
         traceNumber,
         additionalInfo,
         tokenBudget.exhaustedMessage(consumed),
         usesUnifiedInterface,
         new ArrayList<>(interactions),
         modelProvider,
         tokenTracker,
         tenantId,
         includeTokenUsage,
         AgentTokenBudget.STOP_REASON_BUDGET_EXHAUSTED
     );
-    cleanUpResource(tools);
     return;
 }
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies that cleanUpResource(tools) should be called before sendFinalAnswer to ensure proper resource cleanup order. This improves code maintainability and prevents potential issues with resource management, though the current code already has a return statement preventing further execution.

Low
Improve error message clarity

The validation logic allows negative values to pass through the Long.parseLong()
call before checking if value <= 0. If a very large negative number causes an
overflow or unexpected behavior during parsing, the error message may be misleading.
Consider validating the string format before parsing to provide clearer error
messages.

ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/agent/AgentTokenBudget.java [51-53]

 if (value <= 0) {
-    throw new IllegalArgumentException("max_tokens must be a positive long value");
+    throw new IllegalArgumentException("max_tokens must be a positive long value, got: " + value);
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion improves error message clarity by including the actual value in the exception message. However, the impact is minor since the validation already occurs after parsing, and the improvement is primarily cosmetic.

Low

Previous suggestions

Suggestions up to commit 763a83b
CategorySuggestion                                                                                                                                    Impact
General
Log warning for dropped map entries

The toObjectMap() method silently drops entries with non-String keys. If the input
map contains important data with non-String keys (e.g., from deserialization
issues), this data loss could cause incorrect token tracking. Consider logging a
warning when non-String keys are encountered to aid debugging.

ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/agent/AgentTokenTracker.java [213-221]

 private static Map<String, Object> toObjectMap(Map<?, ?> map) {
     Map<String, Object> objectMap = new HashMap<>();
     for (Map.Entry<?, ?> entry : map.entrySet()) {
         if (entry.getKey() instanceof String) {
             objectMap.put((String) entry.getKey(), entry.getValue());
+        } else {
+            log.warn("Dropping non-String key in token usage map: {}", entry.getKey());
         }
     }
     return objectMap;
 }
Suggestion importance[1-10]: 5

__

Why: Adding logging for dropped non-String keys would improve debugging capabilities when malformed data is encountered. The suggestion correctly identifies that silent data loss could cause issues, and logging would help diagnose such problems. However, this is a defensive improvement rather than fixing a critical bug.

Low
Return sentinel for unlimited budgets

The remaining() method returns Long.MAX_VALUE for unlimited budgets, but callers
like applyRemainingToParams() use this value directly as a token limit. This could
cause LLM requests to fail or behave unexpectedly if providers reject extremely
large token limits. Consider returning a sentinel value that callers explicitly
check, or document that unlimited budgets should skip limit application entirely.

ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/agent/AgentTokenBudget.java [81-86]

 long remaining(long consumedTokens) {
     if (!isLimited()) {
-        return Long.MAX_VALUE;
+        return UNLIMITED;
     }
     return Math.max(0L, maxTokens - consumedTokens);
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion identifies a potential issue where Long.MAX_VALUE might cause problems with LLM providers. However, the applyRemainingToParams method already handles unlimited budgets by checking isLimited() at line 97 and returning early, so the Long.MAX_VALUE is never actually used as a token limit. The suggestion is technically correct but has minimal practical impact.

Low
Record reserved tokens on budget exhaustion

When the sub-agent budget is exhausted (subAgentBudget <= 0), the code returns early
but doesn't record reserved tokens for the sub-agent. This creates an inconsistency:
the error path at line 821 calls recordReservedTokens() when the sub-agent fails,
but the budget exhaustion path doesn't. This could lead to inaccurate token tracking
when budgets are exhausted before sub-agent execution.

ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/agent/MLPlanExecuteAndReflectAgentRunner.java [656-676]

 long subAgentBudget = tokenBudget.remaining(tokenTracker);
 if (tokenBudget.isLimited()) {
     if (subAgentBudget <= 0) {
         boolean includeTokenUsage = Boolean.parseBoolean(allParams.getOrDefault(AgentUtils.INCLUDE_TOKEN_USAGE, "false"));
+        tokenTracker.recordReservedTokens(reActAgentId, 0L);
         saveAndReturnFinalResult(
             memory,
             parentInteractionId,
             allParams.get(EXECUTOR_AGENT_MEMORY_ID_FIELD),
             allParams.get(EXECUTOR_AGENT_PARENT_INTERACTION_ID_FIELD),
             tokenBudget.exhaustedMessage(tokenTracker) + ". Completed steps:\n" + formatCompletedSteps(completedSteps),
             null,
             finalListener,
             tokenTracker,
             allParams.get(TENANT_ID_FIELD),
             includeTokenUsage,
             AgentTokenBudget.STOP_REASON_BUDGET_EXHAUSTED
         );
         return;
     }
     reactParams.put(AgentTokenBudget.MAX_TOKENS_FIELD, Long.toString(subAgentBudget));
 }
Suggestion importance[1-10]: 2

__

Why: The suggestion proposes recording 0 tokens when the budget is exhausted before sub-agent execution. However, recording 0 tokens serves no practical purpose and doesn't improve consistency. The error path at line 827 records subAgentBudget (not 0) when the sub-agent fails, which is different from this scenario where no execution occurs. The suggestion misunderstands the intent of recordReservedTokens.

Low
Suggestions up to commit e502cc1
CategorySuggestion                                                                                                                                    Impact
General
Improve error message clarity

The validation logic allows negative values to pass through the Long.parseLong()
call before checking if value <= 0. If the string represents a very large number
that causes overflow, it could result in a negative value that gets caught here, but
the error message would be misleading. Consider validating the string format before
parsing to provide clearer error messages.

ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/agent/AgentTokenBudget.java [51-53]

 if (value <= 0) {
-    throw new IllegalArgumentException("max_tokens must be a positive long value");
+    throw new IllegalArgumentException("max_tokens must be a positive long value, got: " + value);
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion improves error message clarity by including the actual value in the exception message. However, the impact is minor since the validation logic already works correctly and the original message is reasonably clear.

Low
Record reserved tokens before early return

When the sub-agent budget is exhausted (subAgentBudget <= 0), the code returns early
but doesn't record the reserved tokens. This could lead to inaccurate token
tracking. Consider recording reserved tokens before returning, similar to the error
handler below that calls tokenTracker.recordReservedTokens().

ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/agent/MLPlanExecuteAndReflectAgentRunner.java [656-676]

 long subAgentBudget = tokenBudget.remaining(tokenTracker);
 if (tokenBudget.isLimited()) {
     if (subAgentBudget <= 0) {
+        tokenTracker.recordReservedTokens(reActAgentId, 0L);
         ...
         return;
     }
     reactParams.put(AgentTokenBudget.MAX_TOKENS_FIELD, Long.toString(subAgentBudget));
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion identifies a potential inconsistency in token tracking when the sub-agent budget is exhausted before execution. However, recording 0 tokens may not be semantically meaningful, and the impact on overall token tracking accuracy is minimal since the budget is already exhausted at this point.

Low
Suggestions up to commit af3fc6b
CategorySuggestion                                                                                                                                    Impact
General
Inconsistent state with unanswered tool calls

The budget exhaustion check occurs after token usage is recorded but before tool
execution. If the LLM response includes tool calls, those tools won't execute, but
the assistant message with tool calls is already added to interactions. This creates
inconsistent state where tool calls are present without results.

ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/agent/MLChatAgentRunner.java [594-624]

 if (tokenBudget.isExhausted(tokenTracker)) {
     long consumed = tokenTracker.getCumulativeTotalTokens();
     log
         .warn(
             "ReAct loop stopped (token budget exhausted). agentId={}, maxTokens={}, consumedTokens={}",
             agentId,
             tokenBudget.getMaxTokens(),
             consumed
         );
+    // Remove the assistant message with tool calls to maintain consistency
+    if (!interactions.isEmpty()) {
+        interactions.remove(interactions.size() - 1);
+    }
     sendFinalAnswer(
         sessionId,
         listener,
         question,
         parentInteractionId,
         verbose,
         traceDisabled,
         traceTensors,
         memory,
         traceNumber,
         additionalInfo,
         tokenBudget.exhaustedMessage(consumed),
         usesUnifiedInterface,
         new ArrayList<>(interactions),
         modelProvider,
         tokenTracker,
         tenantId,
         includeTokenUsage
     );
     cleanUpResource(tools);
     return;
 }
Suggestion importance[1-10]: 7

__

Why: This identifies a legitimate issue where the assistant message containing tool calls is added to interactions before the budget check. If the budget is exhausted, the tool calls won't execute, leaving the conversation history in an inconsistent state. Removing the last message before sending the final answer would maintain consistency between the message history and actual execution.

Medium
Silent failure when budget exhausted

When remaining <= 0, the method silently returns without setting any limit
parameter. This could allow an LLM request to proceed without token constraints.
Consider throwing an exception or logging a warning to prevent unexpected behavior.

ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/agent/AgentTokenBudget.java [94-102]

 void applyRemainingToParams(Map<String, String> targetParams, Map<String, String> llmSpecParams, long consumedTokens) {
     if (!isLimited() || targetParams == null) {
         return;
     }
 
     long remaining = remaining(consumedTokens);
     if (remaining <= 0) {
+        log.warn("Cannot apply token budget: no tokens remaining (consumed={}, max={})", consumedTokens, maxTokens);
         return;
     }
     ...
 }
Suggestion importance[1-10]: 4

__

Why: Adding a warning log when remaining <= 0 would improve observability. However, this scenario should be prevented by the isExhausted checks that occur before applyRemainingToParams is called. The silent return is acceptable as a defensive measure, but logging would help identify if the exhaustion checks are bypassed.

Low
Off-by-one error in budget exhaustion

The exhaustion check uses >= which triggers when exactly at the limit. This prevents
the final token from being used. Consider using > to allow full budget utilization
before stopping execution.

ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/agent/AgentTokenBudget.java [67-69]

 boolean isExhausted(long consumedTokens) {
-    return isLimited() && consumedTokens >= maxTokens;
+    return isLimited() && consumedTokens > maxTokens;
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion raises a valid point about the >= vs > comparison, but the current implementation is actually correct for budget enforcement. Using >= ensures execution stops when the budget is reached or exceeded, which is the expected behavior for a hard limit. Using > would allow one token over budget, which contradicts the purpose of having a strict limit.

Low
Suggestions up to commit 8fdeac5
CategorySuggestion                                                                                                                                    Impact
General
Clean up resources before early return

The budget exhaustion check occurs after token usage is recorded but before tool
execution. If the LLM response includes tool calls, those tools are never executed,
potentially leaving the agent in an inconsistent state. Consider checking budget
before the LLM call or handling tool cleanup explicitly.

ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/agent/MLChatAgentRunner.java [594-624]

 if (tokenBudget.isExhausted(tokenTracker)) {
     long consumed = tokenTracker.getCumulativeTotalTokens();
     log
         .warn(
             "ReAct loop stopped (token budget exhausted). agentId={}, maxTokens={}, consumedTokens={}",
             agentId,
             tokenBudget.getMaxTokens(),
             consumed
         );
+    // Clean up tools before sending final answer to prevent resource leaks
+    cleanUpResource(tools);
     sendFinalAnswer(
         sessionId,
         listener,
         question,
         parentInteractionId,
         verbose,
         traceDisabled,
         traceTensors,
         memory,
         traceNumber,
         additionalInfo,
         tokenBudget.exhaustedMessage(consumed),
         usesUnifiedInterface,
         new ArrayList<>(interactions),
         modelProvider,
         tokenTracker,
         tenantId,
         includeTokenUsage
     );
-    cleanUpResource(tools);
     return;
 }
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies that cleanUpResource(tools) is called after sendFinalAnswer(), which could be problematic if sendFinalAnswer() throws an exception. Moving cleanup before sending the response ensures resources are released even if the response fails. However, the impact is moderate since the current order still executes cleanup in the normal path.

Low
Possible issue
Fix off-by-one in budget exhaustion check

The exhaustion check uses >= which treats exact budget match as exhausted. This
prevents using the full allocated budget. Change to > to allow consuming exactly
maxTokens before stopping.

ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/agent/AgentTokenBudget.java [67-69]

 boolean isExhausted(long consumedTokens) {
-    return isLimited() && consumedTokens >= maxTokens;
+    return isLimited() && consumedTokens > maxTokens;
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion argues that >= prevents using the full budget, but this is a design choice. Using >= means "exhausted when consumed equals or exceeds budget," which is reasonable. The current implementation allows consuming up to maxTokens tokens before stopping, which is correct behavior. The suggestion would allow exceeding the budget by 1 token, which contradicts the budget constraint.

Low
Prevent LLM requests when budget exhausted

When remaining <= 0, the method silently returns without setting any limit
parameter. This allows LLM requests to proceed without token constraints even when
budget is exhausted. Either throw an exception or set a minimal positive limit to
prevent unbounded requests.

ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/agent/AgentTokenBudget.java [94-102]

 void applyRemainingToParams(Map<String, String> targetParams, Map<String, String> llmSpecParams, long consumedTokens) {
     if (!isLimited() || targetParams == null) {
         return;
     }
 
     long remaining = remaining(consumedTokens);
     if (remaining <= 0) {
-        return;
+        throw new IllegalStateException("Cannot apply token limit: budget exhausted");
     }
     ...
 }
Suggestion importance[1-10]: 2

__

Why: The method applyRemainingToParams is called after budget checks in the calling code (e.g., lines 865-888, 1606-1609 in MLChatAgentRunner.java). When remaining <= 0, returning silently is correct because the caller should have already checked isExhausted() and stopped execution. Throwing an exception here would be redundant and could break the control flow. The suggestion misunderstands the method's role as a helper, not a guard.

Low

@GugaGlonti GugaGlonti force-pushed the feature/agent-level-token-limit branch from 8fdeac5 to af3fc6b Compare May 11, 2026 11:17
@GugaGlonti GugaGlonti temporarily deployed to ml-commons-cicd-env-require-approval May 11, 2026 11:19 — with GitHub Actions Inactive
@GugaGlonti GugaGlonti temporarily deployed to ml-commons-cicd-env-require-approval May 11, 2026 11:19 — with GitHub Actions Inactive
@GugaGlonti GugaGlonti had a problem deploying to ml-commons-cicd-env-require-approval May 11, 2026 11:19 — with GitHub Actions Failure
@GugaGlonti GugaGlonti had a problem deploying to ml-commons-cicd-env-require-approval May 11, 2026 11:19 — with GitHub Actions Error
@github-actions
Copy link
Copy Markdown

Persistent review updated to latest commit af3fc6b

@codecov
Copy link
Copy Markdown

codecov Bot commented May 11, 2026

Codecov Report

❌ Patch coverage is 62.31156% with 75 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.47%. Comparing base (12f884e) to head (af3fc6b).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
...thms/agent/MLPlanExecuteAndReflectAgentRunner.java 39.58% 25 Missing and 4 partials ⚠️
...h/ml/engine/algorithms/agent/AgentTokenBudget.java 76.92% 8 Missing and 10 partials ⚠️
.../ml/engine/algorithms/agent/MLChatAgentRunner.java 50.00% 14 Missing and 3 partials ⚠️
...l/engine/algorithms/agent/MLChatAgentRunnerV2.java 63.33% 8 Missing and 3 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #4820      +/-   ##
============================================
+ Coverage     77.42%   77.47%   +0.05%     
- Complexity    11907    11973      +66     
============================================
  Files           963      964       +1     
  Lines         53326    53519     +193     
  Branches       6503     6532      +29     
============================================
+ Hits          41285    41462     +177     
- Misses         9289     9295       +6     
- Partials       2752     2762      +10     
Flag Coverage Δ
ml-commons 77.47% <62.31%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@GugaGlonti GugaGlonti temporarily deployed to ml-commons-cicd-env-require-approval May 11, 2026 18:55 — with GitHub Actions Inactive
@GugaGlonti GugaGlonti temporarily deployed to ml-commons-cicd-env-require-approval May 11, 2026 18:55 — with GitHub Actions Inactive
Comment on lines +177 to +191
private static String outputLimitField(Map<String, String> params) {
String llmInterface = params == null ? null : params.get(MLChatAgentRunner.LLM_INTERFACE);
if (llmInterface == null) {
return MAX_TOKENS_FIELD;
}

String normalized = llmInterface.trim().toLowerCase(Locale.ROOT);
if (normalized.startsWith(LLM_INTERFACE_BEDROCK_CONVERSE) || (normalized.contains("bedrock") && normalized.contains("converse"))) {
return BEDROCK_MAX_TOKENS_FIELD;
}
if (normalized.startsWith(LLM_INTERFACE_GEMINI_V1BETA_GENERATE_CONTENT) || normalized.contains("gemini")) {
return GEMINI_MAX_OUTPUT_TOKENS_FIELD;
}
return MAX_TOKENS_FIELD;
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking — provider detection silently fails open.

Two concerns:

  1. LLM_INTERFACE_BEDROCK_CONVERSE = "bedrock/converse" is already a strict prefix of every legitimate Bedrock-converse interface (bedrock/converse/claude, bedrock/converse/deepseek_r1, fully-qualified model IDs). The contains("bedrock") && contains("converse") fallback adds no real coverage but silently accepts anything containing both substrings (incl. user-supplied junk). Same for contains("gemini").

  2. The bigger issue: when _llm_interface is missing from targetParams, this returns MAX_TOKENS_FIELD regardless of provider. For Bedrock-converse connectors that read ${parameters.maxTokens} from their request template, the budget clamp will be written to a field the connector ignores — so the per-call cap is silently not enforced. Only the post-call exhaustion check stops the loop, which defeats half of the budget mechanism.

Fix options (any of):

  • Detect provider from mlAgent.getModel().getModelProvider() instead of an interface string.
  • Drop the substring fallback and fail loudly when the interface is unrecognized.
  • Write all three candidate field names (max_tokens, maxTokens, maxOutputTokens) — overhead is trivial and removes the brittle string match entirely.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Removed provider/interface detection for output-limit clamping. The effective clamp is now written to max_tokens, maxTokens, and maxOutputTokens.

Comment on lines +161 to +175
private static Long parsePositiveLong(Map<String, String> params, String field) {
if (params == null) {
return null;
}
String value = params.get(field);
if (value == null) {
return null;
}
try {
long parsed = Long.parseLong(value);
return parsed > 0 ? parsed : null;
} catch (NumberFormatException e) {
return null;
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking — 0 is treated as if the field were absent, then overwritten.

parsePositiveLong returns null for any non-positive value, so an existing max_tokens=0 (which several connectors interpret as "no per-call cap") is treated as missing — and applyRemainingToParams then overwrites it with remaining. That changes the connector's behavior for users who explicitly set 0.

If the intent is "ignore non-positive values", document it; otherwise treat 0 as "force unlimited per-call" and skip the clamp. Today the behavior is implicit and surprising.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional. Non-positive output-limit values are not treated as valid smaller caps; an active agent-level budget still clamps the outgoing request to the remaining budget.

Comment on lines +94 to +111
void applyRemainingToParams(Map<String, String> targetParams, Map<String, String> llmSpecParams, long consumedTokens) {
if (!isLimited() || targetParams == null) {
return;
}

long remaining = remaining(consumedTokens);
if (remaining <= 0) {
return;
}

long effectiveLimit = remaining;
Long existingLimit = smallestPositiveOutputLimit(targetParams, llmSpecParams);
if (existingLimit != null) {
effectiveLimit = Math.min(existingLimit, remaining);
}

targetParams.put(outputLimitField(targetParams), Long.toString(effectiveLimit));
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: when existingLimit (the smallest positive across targetParams/llmSpecParams) wins, this overwrites targetParams[outputLimitField]. The test testApplyRemainingToParams_PreservesSmallerExistingLimit asserts the smaller value ends up in params — which works, but the name "preserves" is misleading: we're rewriting the user's targetParams.max_tokens=80 with 30 (or with min(existing, remaining)). Consider renaming the test or adding a comment explaining the semantic.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Renamed the test to describe clamping to the smaller effective limit.

Comment on lines +243 to +268
if (tokenBudget.isExhausted(accumulatedTokenUsage[0])) {
long consumed = AgentTokenBudget.consumedTokens(accumulatedTokenUsage[0]);
log
.warn(
"ReAct loop stopped (token budget exhausted). agentId={}, maxTokens={}, consumedTokens={}",
agentId,
tokenBudget.getMaxTokens(),
consumed
);
List<Message> toolInteractionMessages = toolInteractionJsonList.isEmpty()
? null
: parseToolInteractionsForPersistence(toolInteractionJsonList, modelProvider);
Message resultMessage = toolCalls == null || toolCalls.isEmpty()
? assistantMessage
: createBudgetExhaustedMessage(tokenBudget.exhaustedMessage(consumed));
listener
.onResponse(
new AgentLogicResult(
resultMessage,
STOP_REASON_MAX_TOKENS,
accumulatedTokenUsage[0],
toolInteractionMessages
)
);
return;
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking — persisted memory left in inconsistent state.

messages.add(assistantMessage) ran on line 241, before this exhaustion check. So if the LLM returned tool_calls and the budget is exhausted here, the local messages list (and therefore the persisted memory built from it on the parent's success path) contains a tool-calling assistant turn followed by — nothing. No tool result, no follow-up.

The next time this conversation is loaded, the model sees a dangling tool_use block. Anthropic-style providers return a 400 ("each tool_use must be followed by a tool_result"). OpenAI is more lenient but still surprising.

Fix options:

  • Move the budget check above messages.add(assistantMessage) (line 241) so the orphan turn is never added.
  • Drop the trailing tool-calling turn from messages on this branch.
  • Synthesize a tool_result content block with the budget-exhausted text so the conversation is well-formed.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. V2 now checks budget exhaustion before adding the assistant tool-call message to messages, so exhausted tool-call turns are not persisted without tool results.


// Chat-specific constants
private static final String STOP_REASON_MAX_ITERATIONS = "max_iterations";
private static final String STOP_REASON_MAX_TOKENS = "max_tokens";
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-priority — stop-reason name collides with provider-side stop reasons.

STOP_REASON_MAX_TOKENS = "max_tokens" overlaps with the upstream Bedrock provider's own max_tokens stop reason (see BedrockStreamingHandler.java:497). Any downstream code keying off stopReason to discriminate "model hit its own max_tokens" vs "agent budget exhausted" will be unable to tell them apart.

Suggest agent_max_tokens_budget or budget_exhausted for the agent-side reason.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Replaced agent-side "max_tokens" with "budget_exhausted".

Comment on lines +1122 to +1132
}
}

/**
* Get agent-level token budget from parameters.
* Returns -1 if no budget is configured (unlimited).
*
* @param params Execution parameters (merged agent + runtime params)
* @return Maximum total tokens allowed for the agent execution, or -1 for unlimited
*/
private long getMaxTokensBudget(Map<String, String> params) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking — dead code. Mirrors the same unused method in MLPlanExecuteAndReflectAgentRunner.java:1046. No callers in the production code; please delete.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Removed the unused helpers from AbstractV2, V1, and PER, along with the false-coverage tests.

Comment on lines +1509 to +1513
if (tokenBudget.isExhausted(tokenTracker)) {
String fallbackResponse = buildMaxIterationsFallbackResponse(maxIterations, lastThought);
responseListener.onResponse(fallbackResponse);
return;
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking — wrong message when budget exhausted at max-iterations time.

When both "max iterations reached" and "budget exhausted" trigger on the same final iteration, this returns the max-iterations fallback string ("Max iterations reached…") even though the budget is the actual cause. Every other budget-exhaustion site in the PR returns tokenBudget.exhaustedMessage(...), so the user-visible reason is now inconsistent across paths.

Use tokenBudget.exhaustedMessage(tokenTracker) here, or compose both reasons ("Max iterations reached AND token budget exhausted (X/Y)").

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. If V1 hits max iterations and budget exhaustion together, it now returns the budget-exhausted message and structured stop_reason=budget_exhausted.

parentInteractionId,
allParams.get(EXECUTOR_AGENT_MEMORY_ID_FIELD),
allParams.get(EXECUTOR_AGENT_PARENT_INTERACTION_ID_FIELD),
tokenBudget.exhaustedMessage(tokenTracker) + ". Completed steps: " + completedSteps,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-priority — List.toString() leaks [a, b, c] into user-visible output.

completedSteps is a List<String>, and ". Completed steps: " + completedSteps invokes List.toString(), producing "[step1, step2, step3]" with the surrounding brackets and , separators. That renders as garbage in chat UIs.

Same pattern appears at lines ~596, ~763, ~1081, ~1123. Either format as a numbered list (String.join("\n", completedSteps)) or move the steps into trace tensors / additionalInfo and keep the headline message clean.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Completed steps are now newline-joined instead of using Java List.toString() formatting.

Comment on lines +221 to +230
public long getCumulativeTotalTokens() {
long total = 0;
for (ModelUsageAggregation agg : perModelUsage.values()) {
TokenUsage usage = agg.getAggregatedUsage();
Long effectiveTotal = usage.getEffectiveTotalTokens();
if (effectiveTotal != null) {
total += effectiveTotal;
}
}
return total;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: getEffectiveTotalTokens() returns null when only one of inputTokens/outputTokens is set, so this method silently treats partial data as zero. For providers that report only input tokens (rare), the budget would never trigger.

Consider summing (inputTokens != null ? inputTokens : 0) + (outputTokens != null ? outputTokens : 0) when at least one is non-null. Not blocking — just worth flagging since the budget assumes getEffectiveTotalTokens reflects reality.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. getCumulativeTotalTokens() now counts input-only or output-only token data when total_tokens is absent.

AgentTokenTracker tokenTracker,
FunctionCalling functionCalling
FunctionCalling functionCalling,
AgentTokenBudget tokenBudget
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: setupMemoryAndRun, setToolsAndRunAgent, and executePlanningLoop are now 9–11 parameters each after this PR. Consider extracting a small PerExecutionContext (carrying tokenTracker, tokenBudget, functionCalling, memory, conversationId) before signature creep gets worse. Not blocking.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. I left the PerExecutionContext extraction out of this PR to avoid expanding the scope beyond the review fixes.

@github-actions
Copy link
Copy Markdown

Persistent review updated to latest commit e502cc1

@GugaGlonti GugaGlonti requested a deployment to ml-commons-cicd-env-require-approval May 18, 2026 07:56 — with GitHub Actions Waiting
@GugaGlonti GugaGlonti requested a deployment to ml-commons-cicd-env-require-approval May 18, 2026 07:56 — with GitHub Actions Waiting
@GugaGlonti GugaGlonti requested a deployment to ml-commons-cicd-env-require-approval May 18, 2026 07:56 — with GitHub Actions Waiting
@GugaGlonti GugaGlonti requested a deployment to ml-commons-cicd-env-require-approval May 18, 2026 07:56 — with GitHub Actions Waiting
@GugaGlonti GugaGlonti force-pushed the feature/agent-level-token-limit branch from e502cc1 to 763a83b Compare May 18, 2026 09:49
@GugaGlonti GugaGlonti temporarily deployed to ml-commons-cicd-env-require-approval May 18, 2026 09:51 — with GitHub Actions Inactive
@GugaGlonti GugaGlonti had a problem deploying to ml-commons-cicd-env-require-approval May 18, 2026 09:51 — with GitHub Actions Error
@GugaGlonti GugaGlonti had a problem deploying to ml-commons-cicd-env-require-approval May 18, 2026 09:51 — with GitHub Actions Failure
@GugaGlonti GugaGlonti temporarily deployed to ml-commons-cicd-env-require-approval May 18, 2026 09:51 — with GitHub Actions Inactive
@github-actions
Copy link
Copy Markdown

Persistent review updated to latest commit 763a83b

ktarlov and others added 7 commits May 26, 2026 10:42
…ment a max_tokens parameter at the agent level that enforces a hard upper bound on total token usage across all LLM calls within a single agent execution. Changes: - AbstractV2AgentRunner: Add getMaxTokensBudget() helper method - MLChatAgentRunnerV2: Add token budget check after each LLM call with STOP_REASON_MAX_TOKENS stop reason - MLChatAgentRunner (V1): Add token budget check in ReAct loop after tokenTracker.recordTurn() - MLPlanExecuteAndReflectAgentRunner: Add token budget check in planning loop after planner LLM response - AgentTokenTracker: Add getCumulativeTotalTokens() method for budget enforcement The agent execution loop dynamically checks cumulative token usage after each LLM call. If the budget is exhausted, execution terminates gracefully with an appropriate message and stop reason. Usage: { "parameters": { "max_tokens": 10000 } } Resolves opensearch-project#4728 Signed-off-by: Kamen Tarlov

Signed-off-by: Kamen Tarlov <kamen.tarlov@itgix.com>
Signed-off-by: Guga Glonti <110097877+GugaGlonti@users.noreply.github.com>
- Introduced AgentTokenBudget class to encapsulate token budget logic.
- Updated MLChatAgentRunner, MLChatAgentRunnerV2, and MLPlanExecuteAndReflectAgentRunner to utilize AgentTokenBudget for token management.
- Removed redundant max_tokens checks and streamlined token exhaustion handling.
- Enhanced token budget enforcement in agent execution loops.

Signed-off-by: Guga Glonti <110097877+GugaGlonti@users.noreply.github.com>
- Implement tests to verify token budget behavior in various scenarios, including initial budget allocation, remaining budget application, and exhaustion handling.
- Ensure correct parameter adjustments for different LLM interfaces based on token usage.

Signed-off-by: Guga Glonti <110097877+GugaGlonti@users.noreply.github.com>
Signed-off-by: Guga Glonti <110097877+GugaGlonti@users.noreply.github.com>
Signed-off-by: Guga Glonti <110097877+GugaGlonti@users.noreply.github.com>
Signed-off-by: Guga Glonti <110097877+GugaGlonti@users.noreply.github.com>
Signed-off-by: Guga Glonti <110097877+GugaGlonti@users.noreply.github.com>
@rithinpullela rithinpullela force-pushed the feature/agent-level-token-limit branch from 763a83b to 0093fb2 Compare May 26, 2026 17:42
@github-actions
Copy link
Copy Markdown

Persistent review updated to latest commit 0093fb2

@rithinpullela rithinpullela had a problem deploying to ml-commons-cicd-env-require-approval May 26, 2026 17:44 — with GitHub Actions Failure
@rithinpullela rithinpullela temporarily deployed to ml-commons-cicd-env-require-approval May 26, 2026 17:44 — with GitHub Actions Inactive
@rithinpullela rithinpullela temporarily deployed to ml-commons-cicd-env-require-approval May 26, 2026 17:44 — with GitHub Actions Inactive
@rithinpullela rithinpullela temporarily deployed to ml-commons-cicd-env-require-approval May 26, 2026 17:44 — with GitHub Actions Inactive
Copy link
Copy Markdown
Collaborator

@rithinpullela rithinpullela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all the comments @GugaGlonti, the PR looks good to me to merge except for parseToolInteractionsForPersistence part in MLChatAgentRunnerV2.java, can you please take a look

Comment on lines +179 to +193
if (tokenBudget.isExhausted(consumedBeforeCall)) {
List<Message> toolInteractionMessages = toolInteractionJsonList.isEmpty()
? null
: parseToolInteractionsForPersistence(toolInteractionJsonList, modelProvider);
listener
.onResponse(
new AgentLogicResult(
createBudgetExhaustedMessage(tokenBudget.exhaustedMessage(consumedBeforeCall)),
STOP_REASON_MAX_TOKENS,
accumulatedTokenUsage[0],
toolInteractionMessages
)
);
return;
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this issue has not been addressed yet, can you address this? Should be a small fix

Comment on lines +1629 to +1633
private String buildMaxIterationsFallbackResponse(int maxIterations, AtomicReference<String> lastThought) {
return (lastThought.get() != null && !lastThought.get().isEmpty() && !"null".equals(lastThought.get()))
? String.format("%s. Last thought: %s", String.format(MAX_ITERATIONS_MESSAGE, maxIterations), lastThought.get())
: String.format(MAX_ITERATIONS_MESSAGE, maxIterations);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking — dead helper, never called.

This private method is defined but has zero call sites. The same expression is still inlined at lines 1614-1619 inside the e -> { ... } block of generateLLMSummary's listener.

Either replace the inline expression at 1614-1619 with responseListener.onResponse(buildMaxIterationsFallbackResponse(maxIterations, lastThought));, or delete this helper. As-is, it's dead code that will rot.

$ grep -n buildMaxIterationsFallbackResponse MLChatAgentRunner.java
1629:    private String buildMaxIterationsFallbackResponse(...)

(only the definition; no callers.)

Comment on lines 822 to 829
}, e -> {
String agentId = allParams.getOrDefault(AGENT_ID_LOG_FIELD, "unknown");
String tenantIdLog = allParams.get(TENANT_ID_FIELD);
log.error("Failed to execute ReAct agent. agentId={}, tenantId={}", agentId, tenantIdLog, e);
if (tokenBudget.isLimited()) {
tokenTracker.recordReservedTokens(reActAgentId, subAgentBudget);
}
finalListener.onFailure(e);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-priority — recordReservedTokens mutation is dead before onFailure.

recordReservedTokens mutates tokenTracker, but the very next line invokes finalListener.onFailure(e), terminating the agent run. Nothing else reads the tracker on this path, so the reservation is silently discarded.

Two possible intents — neither matches the current code:

  1. Pre-reservation against runaway retries: then the reservation must happen before dispatching the executor at line 691 (client.execute(MLExecuteTaskAction.INSTANCE, ...)) and be debited back when the actual sub-agent usage merges in via mergeSubAgentUsage (line 710). As written, it's executed too late to matter.
  2. Telemetry on the failed call: then the merged tokenTracker.toOutputMap() needs to actually flow into the failure response — but finalListener.onFailure(e) carries only the exception.

Please clarify intent and adjust accordingly, or delete the call. There's also no test for this path — MLPlanExecuteAndReflectAgentRunnerTest does not exercise the executor onFailure branch.

Comment on lines +241 to +266
if (tokenBudget.isExhausted(accumulatedTokenUsage[0])) {
long consumed = AgentTokenBudget.consumedTokens(accumulatedTokenUsage[0]);
log
.warn(
"ReAct loop stopped (token budget exhausted). agentId={}, maxTokens={}, consumedTokens={}",
agentId,
tokenBudget.getMaxTokens(),
consumed
);
List<Message> toolInteractionMessages = toolInteractionJsonList.isEmpty()
? null
: parseToolInteractionsForPersistence(toolInteractionJsonList, modelProvider);
Message resultMessage = toolCalls == null || toolCalls.isEmpty()
? assistantMessage
: createBudgetExhaustedMessage(tokenBudget.exhaustedMessage(consumed));
listener
.onResponse(
new AgentLogicResult(
resultMessage,
AgentTokenBudget.STOP_REASON_BUDGET_EXHAUSTED,
accumulatedTokenUsage[0],
toolInteractionMessages
)
);
return;
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-priority — stop_reason mis-classifies natural finish on the over-budget call.

When the LLM returns a complete, tool-call-free response that just happens to push cumulative usage past the budget, this branch (correctly) surfaces the model's actual final answer (assistantMessage) — but tags the result stop_reason=budget_exhausted. From the caller's perspective this is indistinguishable from "we cut you off mid-thought," but in this case the model finished cleanly.

Consider:

String stopReason = (toolCalls == null || toolCalls.isEmpty())
    ? STOP_REASON_END_TURN
    : AgentTokenBudget.STOP_REASON_BUDGET_EXHAUSTED;

If budget_exhausted is desired regardless (e.g., to signal "do not continue"), document the semantic explicitly. The existing testExecuteAgentLogic_TokenBudgetExhausted_StopsExecution only exercises the tool-call path, so this natural-finish branch has no direct coverage.

Comment on lines +497 to 513
if (tokenBudget.isExhausted(tokenTracker)) {
boolean includeTokenUsage = Boolean.parseBoolean(allParams.getOrDefault(AgentUtils.INCLUDE_TOKEN_USAGE, "false"));
saveAndReturnFinalResult(
memory,
parentInteractionId,
allParams.get(EXECUTOR_AGENT_MEMORY_ID_FIELD),
allParams.get(EXECUTOR_AGENT_PARENT_INTERACTION_ID_FIELD),
tokenBudget.exhaustedMessage(tokenTracker) + ". Completed steps:\n" + formatCompletedSteps(completedSteps),
null,
finalListener,
tokenTracker,
allParams.get(TENANT_ID_FIELD),
includeTokenUsage,
AgentTokenBudget.STOP_REASON_BUDGET_EXHAUSTED
);
return;
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — duplicated boilerplate (4 sites in this file).

The same 11-line "budget exhausted → saveAndReturnFinalResult" block appears four times in this file: here (497-513), 582-605, 659-674, and 778-792. Each computes the same includeTokenUsage boolean, the same message string tokenBudget.exhaustedMessage(...) + ". Completed steps:\n" + formatCompletedSteps(completedSteps), and calls saveAndReturnFinalResult with STOP_REASON_BUDGET_EXHAUSTED.

Extract a helper:

private void returnBudgetExhausted(
    Memory memory, String parentInteractionId, Map<String, String> allParams,
    String exhaustedMessage, List<String> completedSteps,
    AgentTokenTracker tokenTracker, ActionListener<Object> finalListener
) {
    boolean includeTokenUsage = Boolean.parseBoolean(allParams.getOrDefault(AgentUtils.INCLUDE_TOKEN_USAGE, "false"));
    saveAndReturnFinalResult(
        memory, parentInteractionId,
        allParams.get(EXECUTOR_AGENT_MEMORY_ID_FIELD),
        allParams.get(EXECUTOR_AGENT_PARENT_INTERACTION_ID_FIELD),
        exhaustedMessage + ". Completed steps:\n" + formatCompletedSteps(completedSteps),
        null, finalListener, tokenTracker,
        allParams.get(TENANT_ID_FIELD), includeTokenUsage,
        AgentTokenBudget.STOP_REASON_BUDGET_EXHAUSTED
    );
}

Drops ~40 LOC and removes the risk of one site drifting from the others.

Map<String, Object> dataAsMap = new HashMap<>();
dataAsMap.put(RESPONSE_FIELD, finalResult);
if (stopReason != null) {
dataAsMap.put("stop_reason", stopReason);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — magic string.

AgentTokenBudget.STOP_REASON_BUDGET_EXHAUSTED is a constant, but the field name "stop_reason" is hard-coded both here and at MLChatAgentRunner.java:1524. Promote to a constant (e.g., AgentTokenBudget.STOP_REASON_FIELD = "stop_reason") so a future rename doesn't drift between the two callers.

dataAsMap.put("response", finalAnswer);
dataAsMap.put(ADDITIONAL_INFO_FIELD, additionalInfo);
if (stopReason != null) {
dataAsMap.put("stop_reason", stopReason);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — magic string (paired with MLPlanExecuteAndReflectAgentRunner.java:1072).

Same "stop_reason" literal appears in both files. Promote to a constant and reference it from both call sites.

Comment on lines +96 to +116
void applyRemainingToParams(Map<String, String> targetParams, Map<String, String> llmSpecParams, long consumedTokens) {
if (!isLimited() || targetParams == null) {
return;
}

long remaining = remaining(consumedTokens);
if (remaining <= 0) {
return;
}

long effectiveLimit = remaining;
Long existingLimit = smallestPositiveOutputLimit(targetParams, llmSpecParams);
if (existingLimit != null) {
effectiveLimit = Math.min(existingLimit, remaining);
}

String effectiveLimitString = Long.toString(effectiveLimit);
for (String field : OUTPUT_LIMIT_FIELDS) {
targetParams.put(field, effectiveLimitString);
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — input-prompt tokens are not pre-checked, only output-cap is clamped.

This only writes the three output-cap fields (max_tokens, maxTokens, maxOutputTokens). The user-facing max_tokens is documented as "the total agent execution budget" (input + output across all calls), but on the next LLM call we cannot know the input-prompt token count until after the response. A large prompt can therefore push cumulative usage well past the budget before the post-call check at MLChatAgentRunner.java:594 catches it.

This isn't broken — the post-call check still fires — but the user could see substantially more than max_tokens actually charged on a single oversized prompt. Please document this semantic explicitly in the class doc, e.g.:

max_tokens is enforced post-hoc per LLM call; a single in-flight call can exceed the remaining budget by up to its input prompt size. Output limits are clamped pre-call.

(The PR description should probably say this too.)

)
);

addSteps(completedSteps, allParams, COMPLETED_STEPS_FIELD);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — addSteps reordering deserves a comment.

This addSteps(...) call was previously after saveTraceData(...); the PR moves it before both saveTraceData and the new budget-check so that formatCompletedSteps(completedSteps) in the budget-exhausted message includes the step that just finished.

Functionally appears safe (no caller of saveTraceData reads allParams[COMPLETED_STEPS_FIELD]), but please add a one-line comment documenting why this must precede the budget check — otherwise a future cleanup is likely to reorder it back. Something like:

// Must precede the budget check below so the just-finished step is included in the exhaust message.
addSteps(completedSteps, allParams, COMPLETED_STEPS_FIELD);

Comment on lines +265 to +276
private static long effectiveTotalTokens(TokenUsage usage) {
Long totalTokens = usage.getTotalTokens();
if (totalTokens != null) {
return totalTokens;
}
Long inputTokens = usage.getInputTokens();
Long outputTokens = usage.getOutputTokens();
if (inputTokens != null || outputTokens != null) {
return (inputTokens != null ? inputTokens : 0L) + (outputTokens != null ? outputTokens : 0L);
}
return 0L;
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/nit — possible duplication of TokenUsage.getEffectiveTotalTokens().

AgentTokenBudget.consumedTokens(TokenUsage) already delegates to usage.getEffectiveTotalTokens() (per AgentTokenBudget.java:142). If TokenUsage.getEffectiveTotalTokens() does the same fallback (totalTokensinput + output0), this private method is duplicated logic — just call it directly and remove this helper. If it does not do the same thing, the divergence is itself a bug worth resolving.

if (tokenCount <= 0) {
return;
}
String effectiveModelId = modelId != null ? modelId : "unknown";
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit — magic string "unknown".

If this constant appears anywhere else in the codebase (likely it does — telemetry pipelines often have a canonical "unknown model" sentinel), promote to a shared constant. Also worth pausing on: is silently dropping the reservation (when modelId is null) a better choice than emitting an "unknown" per-model row that downstream consumers may not expect? Tied to H1 — this method's correct call-site is unclear.

@rithinpullela rithinpullela requested a deployment to ml-commons-cicd-env-require-approval May 28, 2026 03:49 — with GitHub Actions Waiting
@rithinpullela rithinpullela requested a deployment to ml-commons-cicd-env-require-approval May 28, 2026 03:49 — with GitHub Actions Waiting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Agent-level Token Limit parameter

4 participants