Skip to content

Commit eb3e6ed

Browse files
authored
[fix][5875912] Fix autoquant-autodeploy example (#878)
## What does this PR do? **Type of change:** Bug fix <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> **Overview:** ? Please check Bug ticket ## Usage <!-- You can potentially add a usage example below. --> ```python # Add a code snippet demonstrating how to use this ``` ## Testing Tested with ``` ./scripts/run_auto_quant_and_deploy.sh --hf_ckpt ./models/Qwen/Qwen3-8B --save_quantized_ckpt ./qwen3_8B_autoquant --quant fp8 --effective_bits 10.0 ``` ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes/No <!--- If No, explain why. --> - **Did you write any new necessary tests?**: Yes/No - **Did you add or update any necessary documentation?**: Yes/No - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes/No <!--- Only for new features, API changes, critical bug fixes or bw breaking changes. --> ## Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Refactor** * Simplified LLM initialization by removing intermediate configuration layer * Updated attention backend from triton to flashinfer <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
1 parent 10efcb6 commit eb3e6ed

File tree

1 file changed

+3
-8
lines changed

1 file changed

+3
-8
lines changed

examples/llm_autodeploy/api_server.py

Lines changed: 3 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,7 @@
2020

2121
import uvicorn
2222
from fastapi import FastAPI, HTTPException
23-
from tensorrt_llm._torch.auto_deploy import LLM, AutoDeployConfig
24-
from tensorrt_llm.builder import BuildConfig
23+
from tensorrt_llm._torch.auto_deploy import LLM
2524
from tensorrt_llm.llmapi.llm import RequestOutput
2625
from tensorrt_llm.sampling_params import SamplingParams
2726
from tensorrt_llm.serve.openai_protocol import (
@@ -45,11 +44,8 @@ def build_runner_from_config(args) -> LLM:
4544
"""Builds a model runner from our config."""
4645
mto.enable_huggingface_checkpointing()
4746
model_kwargs = {"max_position_embeddings": args.max_seq_len, "use_cache": False}
48-
build_config = BuildConfig(max_seq_len=args.max_seq_len, max_batch_size=args.max_batch_size)
49-
build_config.plugin_config.tokens_per_block = args.max_seq_len
5047

51-
# setup AD config
52-
ad_config = AutoDeployConfig(
48+
llm = LLM(
5349
model=args.ckpt_path,
5450
compile_backend=args.compile_backend,
5551
device=args.device,
@@ -58,9 +54,8 @@ def build_runner_from_config(args) -> LLM:
5854
max_seq_len=args.max_seq_len,
5955
max_num_tokens=args.max_num_tokens,
6056
model_kwargs=model_kwargs,
61-
attn_backend="triton",
57+
attn_backend="flashinfer",
6258
)
63-
llm = LLM(**ad_config.to_llm_kwargs())
6459

6560
return llm
6661

0 commit comments

Comments
 (0)