[FEATURE] Agent-level Token Limit parameter

**Is your feature request related to a problem?**

Currently, there is no way to enforce a hard limit on the total number of tokens consumed during an agent execution in ML Commons.

The only available control is `parameters.max_iteration`, which indirectly limits usage by restricting the number of steps an agent can take. However, this is not sufficient because token consumption can vary significantly between iterations depending on the task, prompt size, and intermediate outputs. As a result, token usage is unpredictable and cannot be reliably bounded.

While `parameters.max_tokens` can be configured at the model level (`/_plugins/_ml/models`), this setting does not apply to agent executions as a whole.

This becomes a critical limitation for:
- Cost control and billing predictability
- Multi-tenant environments
- Production safety and resource governance

Without a way to cap total token usage per agent execution, it is difficult to safely operate agents in cost-sensitive or large-scale environments.

**What solution would you like?**

A new parameter at the agent level:

```json
{
  "parameters": {
    "max_tokens": <max_tokens_for_agent_execution>
  }
}
```

With the recent addition of cumulative token tracking across multiple LLM calls, this feature can be implemented in a straightforward way.

The agent execution loop can dynamically adjust the `max_tokens` / `maxTokens` passed to each LLM call based on the remaining token budget:

```json
{
  "parameters": {
    "max_tokens": "<max_tokens_for_agent_execution> - <totalTokensConsumedSoFar>"
  }
}
```

This would:
- Enforce a hard upper bound on total token usage
- Work seamlessly across multiple LLM calls within a single agent execution
- Provide deterministic and predictable cost control

If the remaining token budget reaches zero, the agent execution can terminate gracefully with an appropriate error or partial result.

**What alternatives have you considered?**

The only current alternative is `parameters.max_iteration`, but it is too imprecise to be useful for controlling token usage:
- Token consumption per iteration varies widely
- There is no direct correlation between iterations and cost
- It cannot guarantee an upper bound on total tokens

**Additional context**

Recent PR introducing cumulative token counting across LLM calls:
https://github.com/opensearch-project/ml-commons/pull/4683/files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Agent-level Token Limit parameter #4728

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[FEATURE] Agent-level Token Limit parameter #4728

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions