opensearch-project · GugaGlonti · May 21, 2026
@@ -238,6 +238,10 @@ AG-UI agents support token usage tracking, which provides detailed metrics about
 
 For AG-UI agents, token usage tracking is enabled during agent registration by setting `"include_token_usage": true` in the `parameters` field. This applies to both the unified registration method (new interface) and the regular registration method (old interface). Once the agent is registered, this setting cannot be changed during agent execution, it must be set at registration time.
 
+### Limiting token usage
+
+To limit total token consumption for one AG-UI agent execution, set `parameters.max_tokens` to a positive integer. OpenSearch treats this value as an agent-level budget for the agent runner's LLM calls in the streaming run. Before each covered LLM call, the runner caps the outgoing model request to the remaining budget while preserving any lower per-call model limit; when reported usage exhausts the budget, the agent stops instead of making another covered LLM call. Token limiting does not require `include_token_usage` to be `true`; that parameter controls only whether detailed usage metrics are returned in the response. Omitting `max_tokens` leaves the execution unlimited; invalid, zero, or negative values are rejected. LLM calls made internally by tools, such as `AgentTool` or `MLModelTool`, are outside this budget.
+
 ### Enabling token usage tracking during registration (unified method)
 
 To enable token usage tracking for an AG-UI agent using the unified registration method, include the `include_token_usage` parameter in the `parameters` field during registration:

@@ -301,6 +301,10 @@ Conversational agents support token usage tracking, which provides detailed metr
 
 To enable token usage tracking, set the `include_token_usage` parameter to `true` when executing the agent. The response will include a `token_usage` output with per-turn and per-model aggregated metrics. For detailed information about token usage fields and how tokens are calculated by different model providers, see [Tracking token usage]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/agent-apis/execute-agent/#tracking-token-usage) in the Execute Agent API documentation.
 
+### Limiting token usage
+
+To limit total token consumption for one conversational agent execution, set `parameters.max_tokens` to a positive integer. OpenSearch treats this value as an agent-level budget for the agent runner's LLM calls, including follow-up calls after tool results. Before each covered LLM call, the runner caps the outgoing model request to the remaining budget while preserving any lower per-call model limit; when reported usage exhausts the budget, the agent stops instead of making another covered LLM call. Token limiting does not require `include_token_usage` to be `true`; that parameter controls only whether detailed usage metrics are returned in the response. Omitting `max_tokens` leaves the execution unlimited; invalid, zero, or negative values are rejected. LLM calls made internally by tools, such as `AgentTool` or `MLModelTool`, are outside this budget.
+
 ## Next steps
 
 - To learn more about registering agents, see [Register Agent API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/agent-apis/register-agent/).

@@ -432,10 +432,14 @@ Plan-execute-reflect agents support token usage tracking, which provides detaile
 
 To enable token usage tracking, set the `include_token_usage` parameter to `true` when executing the agent. The response will include a `token_usage` output with per-turn and per-model aggregated metrics. For detailed information about token usage fields and how tokens are calculated by different model providers, see [Tracking token usage]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/agent-apis/execute-agent/#tracking-token-usage) in the Execute Agent API documentation.
 
+### Limiting token usage
+
+To limit total token consumption for one plan-execute-reflect agent execution, set `parameters.max_tokens` to a positive integer. OpenSearch applies this value as one shared budget across planner, executor subagent, and summary or reflection LLM calls. Before each covered LLM call, the runner caps the outgoing model request to the remaining budget while preserving any lower per-call model limit; when reported usage exhausts the budget, the agent stops instead of making another covered LLM call. Token limiting does not require `include_token_usage` to be `true`; that parameter controls only whether detailed usage metrics are returned in the response. Omitting `max_tokens` leaves the execution unlimited; invalid, zero, or negative values are rejected. LLM calls made internally by tools, such as `AgentTool` or `MLModelTool`, are outside this budget.
+
 ## Next steps
 
 - To learn more about registering agents, see [Register Agent API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/agent-apis/register-agent/).
 - For a list of supported tools, see [Tools]({{site.url}}{{site.baseurl}}/ml-commons-plugin/agents-tools/tools/index/).
 - For a step-by-step tutorial on using a plan-execute-reflect agent, see [Building a plan-execute-reflect agent]({{site.url}}{{site.baseurl}}/tutorials/gen-ai/agents/build-plan-execute-reflect-agent/).
 - For supported APIs, see [Agent APIs]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/agent-apis/).
-- To use agents and tools in configuration automation, see [Automating configurations]({{site.url}}{{site.baseurl}}/automating-configurations/index/).
+- To use agents and tools in configuration automation, see [Automating configurations]({{site.url}}{{site.baseurl}}/automating-configurations/index/).
@@ -317,6 +317,10 @@ When `include_token_usage` is set to `true`, the response includes detailed toke
 The `conversational_v2` agent automatically includes token usage in its response format through the `metrics` field and does not require this parameter. For details, see [The `conversational_v2` agent response format](#the-conversational_v2-agent-response-format).
 {: .note}
 
+### Limiting token usage
+
+To limit total token consumption for one agent execution, set `parameters.max_tokens` to a positive integer. OpenSearch treats this value as an agent-level budget for direct agent-runner LLM calls and plan-execute-reflect sub-agent executions. Before each covered LLM call, the agent runner caps the outgoing model request to the remaining budget while preserving any lower per-call model limit. The remaining budget is applied to the provider-specific output-token field, such as `max_tokens` for generic or OpenAI requests, `maxTokens` for Amazon Bedrock Converse requests, and `maxOutputTokens` for Google Gemini requests. If reported token usage exhausts the budget, the agent stops instead of making another covered LLM call; when a stop reason is returned, it is `budget_exhausted`. Token limiting does not require `include_token_usage` to be `true`; that parameter controls only whether detailed usage metrics are returned in the response. Omitting `max_tokens` leaves the execution unlimited; invalid, zero, or negative values are rejected. LLM calls made internally by tools, such as `AgentTool` or `MLModelTool`, are outside this budget.
+
 ### Example request: Regular registration
 **Introduced 3.6**
 {: .label .label-purple }
@@ -455,4 +459,4 @@ Field | Data type | Present in | Description
 Token counts are calculated by the model provider and may vary based on tokenization methods. For more information about how tokens are calculated, refer to your model provider's documentation:
 - [Amazon Bedrock TokenUsage](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_TokenUsage.html)
 - [OpenAI tokenization](https://platform.openai.com/docs/guides/tokenization)
-- [Google Gemini token counting](https://ai.google.dev/gemini-api/docs/tokens)
+- [Google Gemini token counting](https://ai.google.dev/gemini-api/docs/tokens)