Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions _ml-commons-plugin/agents-tools/agents/ag-ui.md
Original file line number Diff line number Diff line change
Expand Up @@ -238,6 +238,10 @@ AG-UI agents support token usage tracking, which provides detailed metrics about

For AG-UI agents, token usage tracking is enabled during agent registration by setting `"include_token_usage": true` in the `parameters` field. This applies to both the unified registration method (new interface) and the regular registration method (old interface). Once the agent is registered, this setting cannot be changed during agent execution, it must be set at registration time.

### Limiting token usage

To limit total token consumption for one AG-UI agent execution, set `parameters.max_tokens` to a positive integer. OpenSearch treats this value as an agent-level budget for the agent runner's LLM calls in the streaming run. Before each covered LLM call, the runner caps the outgoing model request to the remaining budget while preserving any lower per-call model limit; when reported usage exhausts the budget, the agent stops instead of making another covered LLM call. Token limiting does not require `include_token_usage` to be `true`; that parameter controls only whether detailed usage metrics are returned in the response. Omitting `max_tokens` leaves the execution unlimited; invalid, zero, or negative values are rejected. LLM calls made internally by tools, such as `AgentTool` or `MLModelTool`, are outside this budget.

### Enabling token usage tracking during registration (unified method)

To enable token usage tracking for an AG-UI agent using the unified registration method, include the `include_token_usage` parameter in the `parameters` field during registration:
Expand Down
4 changes: 4 additions & 0 deletions _ml-commons-plugin/agents-tools/agents/conversational.md
Original file line number Diff line number Diff line change
Expand Up @@ -301,6 +301,10 @@ Conversational agents support token usage tracking, which provides detailed metr

To enable token usage tracking, set the `include_token_usage` parameter to `true` when executing the agent. The response will include a `token_usage` output with per-turn and per-model aggregated metrics. For detailed information about token usage fields and how tokens are calculated by different model providers, see [Tracking token usage]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/agent-apis/execute-agent/#tracking-token-usage) in the Execute Agent API documentation.

### Limiting token usage

To limit total token consumption for one conversational agent execution, set `parameters.max_tokens` to a positive integer. OpenSearch treats this value as an agent-level budget for the agent runner's LLM calls, including follow-up calls after tool results. Before each covered LLM call, the runner caps the outgoing model request to the remaining budget while preserving any lower per-call model limit; when reported usage exhausts the budget, the agent stops instead of making another covered LLM call. Token limiting does not require `include_token_usage` to be `true`; that parameter controls only whether detailed usage metrics are returned in the response. Omitting `max_tokens` leaves the execution unlimited; invalid, zero, or negative values are rejected. LLM calls made internally by tools, such as `AgentTool` or `MLModelTool`, are outside this budget.

## Next steps

- To learn more about registering agents, see [Register Agent API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/agent-apis/register-agent/).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -432,10 +432,14 @@ Plan-execute-reflect agents support token usage tracking, which provides detaile

To enable token usage tracking, set the `include_token_usage` parameter to `true` when executing the agent. The response will include a `token_usage` output with per-turn and per-model aggregated metrics. For detailed information about token usage fields and how tokens are calculated by different model providers, see [Tracking token usage]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/agent-apis/execute-agent/#tracking-token-usage) in the Execute Agent API documentation.

### Limiting token usage

To limit total token consumption for one plan-execute-reflect agent execution, set `parameters.max_tokens` to a positive integer. OpenSearch applies this value as one shared budget across planner, executor subagent, and summary or reflection LLM calls. Before each covered LLM call, the runner caps the outgoing model request to the remaining budget while preserving any lower per-call model limit; when reported usage exhausts the budget, the agent stops instead of making another covered LLM call. Token limiting does not require `include_token_usage` to be `true`; that parameter controls only whether detailed usage metrics are returned in the response. Omitting `max_tokens` leaves the execution unlimited; invalid, zero, or negative values are rejected. LLM calls made internally by tools, such as `AgentTool` or `MLModelTool`, are outside this budget.

## Next steps

- To learn more about registering agents, see [Register Agent API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/agent-apis/register-agent/).
- For a list of supported tools, see [Tools]({{site.url}}{{site.baseurl}}/ml-commons-plugin/agents-tools/tools/index/).
- For a step-by-step tutorial on using a plan-execute-reflect agent, see [Building a plan-execute-reflect agent]({{site.url}}{{site.baseurl}}/tutorials/gen-ai/agents/build-plan-execute-reflect-agent/).
- For supported APIs, see [Agent APIs]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/agent-apis/).
- To use agents and tools in configuration automation, see [Automating configurations]({{site.url}}{{site.baseurl}}/automating-configurations/index/).
- To use agents and tools in configuration automation, see [Automating configurations]({{site.url}}{{site.baseurl}}/automating-configurations/index/).
6 changes: 5 additions & 1 deletion _ml-commons-plugin/api/agent-apis/execute-agent.md
Original file line number Diff line number Diff line change
Expand Up @@ -317,6 +317,10 @@ When `include_token_usage` is set to `true`, the response includes detailed toke
The `conversational_v2` agent automatically includes token usage in its response format through the `metrics` field and does not require this parameter. For details, see [The `conversational_v2` agent response format](#the-conversational_v2-agent-response-format).
{: .note}

### Limiting token usage

To limit total token consumption for one agent execution, set `parameters.max_tokens` to a positive integer. OpenSearch treats this value as an agent-level budget for direct agent-runner LLM calls and plan-execute-reflect sub-agent executions. Before each covered LLM call, the agent runner caps the outgoing model request to the remaining budget while preserving any lower per-call model limit. The remaining budget is applied to the provider-specific output-token field, such as `max_tokens` for generic or OpenAI requests, `maxTokens` for Amazon Bedrock Converse requests, and `maxOutputTokens` for Google Gemini requests. If reported token usage exhausts the budget, the agent stops instead of making another covered LLM call; when a stop reason is returned, it is `budget_exhausted`. Token limiting does not require `include_token_usage` to be `true`; that parameter controls only whether detailed usage metrics are returned in the response. Omitting `max_tokens` leaves the execution unlimited; invalid, zero, or negative values are rejected. LLM calls made internally by tools, such as `AgentTool` or `MLModelTool`, are outside this budget.

### Example request: Regular registration
**Introduced 3.6**
{: .label .label-purple }
Expand Down Expand Up @@ -455,4 +459,4 @@ Field | Data type | Present in | Description
Token counts are calculated by the model provider and may vary based on tokenization methods. For more information about how tokens are calculated, refer to your model provider's documentation:
- [Amazon Bedrock TokenUsage](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_TokenUsage.html)
- [OpenAI tokenization](https://platform.openai.com/docs/guides/tokenization)
- [Google Gemini token counting](https://ai.google.dev/gemini-api/docs/tokens)
- [Google Gemini token counting](https://ai.google.dev/gemini-api/docs/tokens)