Skip to content

[FEATURE] Agent-level Token Limit parameter #4728

@GugaGlonti

Description

@GugaGlonti

Is your feature request related to a problem?

Currently, there is no way to enforce a hard limit on the total number of tokens consumed during an agent execution in ML Commons.

The only available control is parameters.max_iteration, which indirectly limits usage by restricting the number of steps an agent can take. However, this is not sufficient because token consumption can vary significantly between iterations depending on the task, prompt size, and intermediate outputs. As a result, token usage is unpredictable and cannot be reliably bounded.

While parameters.max_tokens can be configured at the model level (/_plugins/_ml/models), this setting does not apply to agent executions as a whole.

This becomes a critical limitation for:

  • Cost control and billing predictability
  • Multi-tenant environments
  • Production safety and resource governance

Without a way to cap total token usage per agent execution, it is difficult to safely operate agents in cost-sensitive or large-scale environments.

What solution would you like?

A new parameter at the agent level:

{
  "parameters": {
    "max_tokens": <max_tokens_for_agent_execution>
  }
}

With the recent addition of cumulative token tracking across multiple LLM calls, this feature can be implemented in a straightforward way.

The agent execution loop can dynamically adjust the max_tokens / maxTokens passed to each LLM call based on the remaining token budget:

{
  "parameters": {
    "max_tokens": "<max_tokens_for_agent_execution> - <totalTokensConsumedSoFar>"
  }
}

This would:

  • Enforce a hard upper bound on total token usage
  • Work seamlessly across multiple LLM calls within a single agent execution
  • Provide deterministic and predictable cost control

If the remaining token budget reaches zero, the agent execution can terminate gracefully with an appropriate error or partial result.

What alternatives have you considered?

The only current alternative is parameters.max_iteration, but it is too imprecise to be useful for controlling token usage:

  • Token consumption per iteration varies widely
  • There is no direct correlation between iterations and cost
  • It cannot guarantee an upper bound on total tokens

Additional context

Recent PR introducing cumulative token counting across LLM calls:
https://github.com/opensearch-project/ml-commons/pull/4683/files

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

Status
On-deck

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions