Is your feature request related to a problem?
Currently, there is no way to enforce a hard limit on the total number of tokens consumed during an agent execution in ML Commons.
The only available control is parameters.max_iteration, which indirectly limits usage by restricting the number of steps an agent can take. However, this is not sufficient because token consumption can vary significantly between iterations depending on the task, prompt size, and intermediate outputs. As a result, token usage is unpredictable and cannot be reliably bounded.
While parameters.max_tokens can be configured at the model level (/_plugins/_ml/models), this setting does not apply to agent executions as a whole.
This becomes a critical limitation for:
- Cost control and billing predictability
- Multi-tenant environments
- Production safety and resource governance
Without a way to cap total token usage per agent execution, it is difficult to safely operate agents in cost-sensitive or large-scale environments.
What solution would you like?
A new parameter at the agent level:
{
"parameters": {
"max_tokens": <max_tokens_for_agent_execution>
}
}
With the recent addition of cumulative token tracking across multiple LLM calls, this feature can be implemented in a straightforward way.
The agent execution loop can dynamically adjust the max_tokens / maxTokens passed to each LLM call based on the remaining token budget:
{
"parameters": {
"max_tokens": "<max_tokens_for_agent_execution> - <totalTokensConsumedSoFar>"
}
}
This would:
- Enforce a hard upper bound on total token usage
- Work seamlessly across multiple LLM calls within a single agent execution
- Provide deterministic and predictable cost control
If the remaining token budget reaches zero, the agent execution can terminate gracefully with an appropriate error or partial result.
What alternatives have you considered?
The only current alternative is parameters.max_iteration, but it is too imprecise to be useful for controlling token usage:
- Token consumption per iteration varies widely
- There is no direct correlation between iterations and cost
- It cannot guarantee an upper bound on total tokens
Additional context
Recent PR introducing cumulative token counting across LLM calls:
https://github.com/opensearch-project/ml-commons/pull/4683/files
Is your feature request related to a problem?
Currently, there is no way to enforce a hard limit on the total number of tokens consumed during an agent execution in ML Commons.
The only available control is
parameters.max_iteration, which indirectly limits usage by restricting the number of steps an agent can take. However, this is not sufficient because token consumption can vary significantly between iterations depending on the task, prompt size, and intermediate outputs. As a result, token usage is unpredictable and cannot be reliably bounded.While
parameters.max_tokenscan be configured at the model level (/_plugins/_ml/models), this setting does not apply to agent executions as a whole.This becomes a critical limitation for:
Without a way to cap total token usage per agent execution, it is difficult to safely operate agents in cost-sensitive or large-scale environments.
What solution would you like?
A new parameter at the agent level:
{ "parameters": { "max_tokens": <max_tokens_for_agent_execution> } }With the recent addition of cumulative token tracking across multiple LLM calls, this feature can be implemented in a straightforward way.
The agent execution loop can dynamically adjust the
max_tokens/maxTokenspassed to each LLM call based on the remaining token budget:{ "parameters": { "max_tokens": "<max_tokens_for_agent_execution> - <totalTokensConsumedSoFar>" } }This would:
If the remaining token budget reaches zero, the agent execution can terminate gracefully with an appropriate error or partial result.
What alternatives have you considered?
The only current alternative is
parameters.max_iteration, but it is too imprecise to be useful for controlling token usage:Additional context
Recent PR introducing cumulative token counting across LLM calls:
https://github.com/opensearch-project/ml-commons/pull/4683/files