🚀 The feature, motivation and pitch
Problem
In tensorrt_llm/quantization/layers.py, SmoothQuantAttention.forward() asserts
lora_layer_params is None with the message:
AssertionError: lora is not supported on SmoothQuantAttention now
This means engines built from W8A8 SmoothQuant checkpoints (--smoothquant 0.5) cannot use
--lora_plugin auto at build time or load LoRA adapters at runtime.
Motivation
Ampere GPUs (A10, A100) do not have FP8 tensor cores, so W8A8 SmoothQuant is the only
activation-quantized option for these GPUs. At the same time, per-request LoRA is critical
for multi-tenant serving (task-specific fine-tunes). Today users on Ampere must choose between:
- W8A8 SQ (good quality, no LoRA) — unusable for multi-tenant serving
- W4A16 AWQ (LoRA works, but lower quality due to 4-bit weights)
This leaves no path to high-quality quantization + LoRA on Ampere hardware.
Note: Issue #2604 reported the same problem in Dec 2024 and was closed as "completed" in Jun 2025,
but the assertion is still present in v1.1.0 (tested with nvcr.io/nvidia/tritonserver:25.02-trtllm-python-py3).
Pitch
Add LoRA support for SmoothQuant layers (SmoothQuantAttention, SmoothQuantLinear) the same
way it is done for FP8:
- Remove the assertion guard.
- Keep LoRA computation in FP16/BF16 — run the LoRA low-rank path after the INT8 GEMM and add
the result to the dequantized output.
- For the attention path, apply LoRA to Q/K/V projections and dense output the same way
Attention.forward() does.
This approach preserves the INT8 GEMM throughput for the base model while adding the LoRA delta
in higher precision, matching the pattern already used by FP8 + LoRA.
Reproduction
# Convert checkpoint with SmoothQuant
python3 examples/llama/convert_checkpoint.py \
--model_dir ./Meta-Llama-3.1-8B-Instruct \
--output_dir ./ckpt_w8a8sq \
--dtype float16 --smoothquant 0.5
# Build engine with LoRA — fails
trtllm-build \
--checkpoint_dir ./ckpt_w8a8sq \
--output_dir ./engine_w8a8sq \
--gpt_attention_plugin auto \
--gemm_plugin auto \
--max_num_tokens 32768 \
--max_batch_size 64 \
--lora_plugin auto \
--lora_dir ./lora_adapter
### Alternatives
_No response_
### Additional context
- Related open issue: #12202 (FP4 + LoRA, same pattern — we contributed a patch for that one)
- We are happy to contribute a patch for the SmoothQuant + LoRA path if the team can provide
guidance on which code paths need coverage (single-GPU, TP, gemm_allreduce_plugin, etc.).
### Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.
🚀 The feature, motivation and pitch
Problem
In
tensorrt_llm/quantization/layers.py,SmoothQuantAttention.forward()assertslora_layer_params is Nonewith the message:This means engines built from W8A8 SmoothQuant checkpoints (
--smoothquant 0.5) cannot use--lora_plugin autoat build time or load LoRA adapters at runtime.Motivation
Ampere GPUs (A10, A100) do not have FP8 tensor cores, so W8A8 SmoothQuant is the only
activation-quantized option for these GPUs. At the same time, per-request LoRA is critical
for multi-tenant serving (task-specific fine-tunes). Today users on Ampere must choose between:
This leaves no path to high-quality quantization + LoRA on Ampere hardware.
Note: Issue #2604 reported the same problem in Dec 2024 and was closed as "completed" in Jun 2025,
but the assertion is still present in v1.1.0 (tested with
nvcr.io/nvidia/tritonserver:25.02-trtllm-python-py3).Pitch
Add LoRA support for SmoothQuant layers (
SmoothQuantAttention,SmoothQuantLinear) the sameway it is done for FP8:
the result to the dequantized output.
Attention.forward()does.This approach preserves the INT8 GEMM throughput for the base model while adding the LoRA delta
in higher precision, matching the pattern already used by FP8 + LoRA.
Reproduction