This guide explains how to modify and manage APIs in TensorRT LLM, focusing on the high-level LLM API.
TensorRT LLM provides multiple API levels:
- LLM API - The highest-level API (e.g., the
LLMclass) - PyExecutor API - The mid-level API (e.g., the
PyExecutorclass)
This guide focuses on the LLM API, which is the primary interface for most users.
TensorRT LLM classifies APIs into two categories:
- Stable and guaranteed to remain consistent across releases
- No breaking changes without major version updates
- Schema stored in:
tests/unittest/api_stability/references_committed/
- Under active development and may change between releases
- Marked with a
statusfield in the docstring:prototype- Early experimental stagebeta- More stable but still subject to changedeprecated- Scheduled for removal
- Schema stored in:
tests/unittest/api_stability/references/ - See API status documentation for complete details
Any backwards-incompatible LLM API change is breaking, regardless of whether the affected API is committed, prototype, beta, deprecated, or otherwise non-committed. Examples include removing arguments or methods, making optional arguments required, changing defaults in a way existing callers observe, narrowing accepted values, or changing return shapes/types.
Do not break committed APIs. Prefer deprecation and migration paths. Breaking a
non-committed API is less strict, but should still be avoided unless justified.
If a pull request updates API stability reference files, classify the accepted
contract change with exactly one GitHub label enforced by the LLM API
Compatibility workflow: api-compatible or api-breaking. For api-breaking,
include BREAKING in the PR title.
All API schemas are:
- Stored as YAML files in the codebase
- Protected by unit tests in
tests/unittest/api_stability/ - Automatically validated to ensure consistency
Use Semantic Clarity
Argument names should describe what the argument represents, not how it is used internally.
✅ Good: max_new_tokens (clear meaning)
❌ Bad: num (ambiguous)
Reflect Argument Type and Granularity
-
For boolean knobs, prefix with verbs like
enable_and so on.Examples:
enable_cache,enable_flash_attention -
For numerical threshold knobs, suffix with
_limit,_size,_count,_len_or_ratioExamples:
max_seq_len,prefill_batch_size
Avoid Redundant Prefixes
Example (in MoeConfig):
✅ Good: backend
❌ Bad: moe_backend (redundant since it's already in MoeConfig)
Use Specific Names for Narrow Scenarios
When adding knobs for specific use cases, make the name convey the restriction clearly via a prefix. It's acceptable to rename later when the knob becomes more generic or is moved into a dedicated config.
Example (argument to the LLM class):
✅ Good: rope_scaling_factor → clearly indicates it's for RoPE
❌ Bad: scaling_factor → too generic and prone to misuse
Organize complex or hierarchical arguments into dedicated configuration dataclasses with intuitive and consistent naming.
Guidelines
-
Use the
XxxConfigsuffix consistentlyExamples:
ModelConfig,ParallelConfig,MoeConfig -
Reflect conceptual hierarchy
The dataclass name should represent a coherent functional unit, not an arbitrary grouping
-
Avoid over-nesting
Use only one level of configuration hierarchy whenever possible (e.g.,
LlmArgs → ParallelConfig) to balance readability and modularity
LlmArgs is the central place for all configuration knobs. It integrates with our infrastructure to ensure:
-
API Stability
- Protects committed (stable) APIs
- GitHub reviewer committee oversees API stability
-
API Status Registration
- Uncommitted (unstable) APIs must be marked as
"prototype"or"beta" - API statuses are displayed in the documentation
- Uncommitted (unstable) APIs must be marked as
-
API Documentation
- Each knob uses a
Fieldwith a description - Automatically rendered in public documentation
- Each knob uses a
Managing knobs in
LlmArgsremains scalable and maintainable thanks to our existing infrastructure and review processes.
Drawbacks of Environment Variables:
- Dispersed across the codebase
- Lack documentation and discoverability
- Pose challenges for testing and validation
Guidelines for Adding Knobs:
- ✅ Add clear, descriptive documentation for each field
- ✅ It's fine to add temporary knobs and refine them later
⚠️ Always mark temporary knobs as"prototype"if not stable yet- ✅ Refactor prototype knobs as they mature, promote them to "beta" or "stable".
The LLM class accepts numerous configuration parameters for models, runtime, and other components. These are managed through a Pydantic dataclass called LlmArgs.
- The LLM's
__init__method parameters map directly toLlmArgsfields LlmArgsis an alias forTorchLlmArgs(defined intensorrt_llm/llmapi/llm_args.py)- All arguments are validated and type-checked through Pydantic
Follow these steps to add a new constructor argument:
garbage_collection_gen0_threshold: int = Field(
default=20000,
description=(
"Threshold for Python garbage collection of generation 0 objects. "
"Lower values trigger more frequent garbage collection."
),
status="beta" # Required for non-committed arguments
)Field requirements:
- Type annotation: Required for all fields
- Default value: Recommended unless the field is mandatory
- Description: Clear explanation of the parameter's purpose
- Status: Required for non-committed arguments (
prototype,beta, etc.)
Add the field to the appropriate schema file:
-
Non-committed arguments:
tests/unittest/api_stability/references/llm.yamlgarbage_collection_gen0_threshold: type: int default: 20000 status: beta # Must match the status in code
-
Committed arguments:
tests/unittest/api_stability/references_committed/llm.yamlgarbage_collection_gen0_threshold: type: int default: 20000 # No status field for committed arguments
python -m pytest tests/unittest/api_stability/test_llm_api.pyPublic methods in the LLM class constitute the API surface. All changes must be properly documented and tracked.
- The actual implementation is in the
_TorchLLMclass (llm.py) - Public methods (not starting with
_) are automatically exposed as APIs
Follow these steps to add a new API method:
For non-committed APIs, use the @set_api_status decorator:
@set_api_status("beta")
def generate_with_streaming(
self,
prompts: List[str],
**kwargs
) -> Iterator[GenerationOutput]:
"""Generate text with streaming output.
Args:
prompts: Input prompts for generation
**kwargs: Additional generation parameters
Returns:
Iterator of generation outputs
"""
# Implementation here
passFor committed APIs, no decorator is needed:
def generate(self, prompts: List[str], **kwargs) -> GenerationOutput:
"""Generate text from prompts."""
# Implementation here
passAdd the method to the appropriate llm.yaml file:
Non-committed API (tests/unittest/api_stability/references/llm.yaml):
generate_with_streaming:
status: beta # Must match @set_api_status
parameters:
- name: prompts
type: List[str]
- name: kwargs
type: dict
returns: Iterator[GenerationOutput]Committed API (tests/unittest/api_stability/references_committed/llm.yaml):
generate:
parameters:
- name: prompts
type: List[str]
- name: kwargs
type: dict
returns: GenerationOutputWhen modifying existing methods:
-
Non-breaking changes (adding optional parameters):
- Update the method signature
- Update the schema file
- Apply the
api-compatiblelabel if API stability reference files change - No status change needed
-
Breaking changes (changing required parameters, return types):
- Treat as breaking regardless of committed, prototype, beta, or deprecated status
- Do not break committed APIs; prefer deprecation and migration paths
- Avoid breaking non-committed APIs too, unless justified
- Apply the
api-breakinglabel - Include
BREAKINGin the PR title
- Documentation: Always include comprehensive docstrings
- Type hints: Use proper type annotations for all parameters and returns
- Testing: Add unit tests for new methods
- Examples: Provide usage examples in the docstring
- Validation: Run API stability tests before submitting changes
Validate your changes:
# Run API stability tests
python -m pytest tests/unittest/api_stability/
# Run specific test for LLM API
python -m pytest tests/unittest/api_stability/test_llm_api.py -v- Remove the
@set_api_status("beta")decorator from the method - Move the schema entry from
tests/unittest/api_stability/references/totests/unittest/api_stability/references_committed/ - Remove the
statusfield from the schema - Update any documentation referring to the API's beta status
- Add
@set_api_status("deprecated")to the method - Update the schema with
status: deprecated - Add deprecation warning in the method:
import warnings warnings.warn( "This method is deprecated and will be removed in v2.0. " "Use new_method() instead.", DeprecationWarning, stacklevel=2 )
- Document the migration path