Skip to content

[Feature]: LoRA support for W8A8 SmoothQuant (SmoothQuantAttention / SmoothQuantLinear) #12703

@langzhao-netizen

Description

@langzhao-netizen

🚀 The feature, motivation and pitch

Problem

In tensorrt_llm/quantization/layers.py, SmoothQuantAttention.forward() asserts
lora_layer_params is None with the message:

AssertionError: lora is not supported on SmoothQuantAttention now

This means engines built from W8A8 SmoothQuant checkpoints (--smoothquant 0.5) cannot use
--lora_plugin auto at build time or load LoRA adapters at runtime.

Motivation

Ampere GPUs (A10, A100) do not have FP8 tensor cores, so W8A8 SmoothQuant is the only
activation-quantized option for these GPUs. At the same time, per-request LoRA is critical
for multi-tenant serving (task-specific fine-tunes). Today users on Ampere must choose between:

  • W8A8 SQ (good quality, no LoRA) — unusable for multi-tenant serving
  • W4A16 AWQ (LoRA works, but lower quality due to 4-bit weights)

This leaves no path to high-quality quantization + LoRA on Ampere hardware.

Note: Issue #2604 reported the same problem in Dec 2024 and was closed as "completed" in Jun 2025,
but the assertion is still present in v1.1.0 (tested with nvcr.io/nvidia/tritonserver:25.02-trtllm-python-py3).

Pitch

Add LoRA support for SmoothQuant layers (SmoothQuantAttention, SmoothQuantLinear) the same
way it is done for FP8:

  1. Remove the assertion guard.
  2. Keep LoRA computation in FP16/BF16 — run the LoRA low-rank path after the INT8 GEMM and add
    the result to the dequantized output.
  3. For the attention path, apply LoRA to Q/K/V projections and dense output the same way
    Attention.forward() does.

This approach preserves the INT8 GEMM throughput for the base model while adding the LoRA delta
in higher precision, matching the pattern already used by FP8 + LoRA.

Reproduction

# Convert checkpoint with SmoothQuant
python3 examples/llama/convert_checkpoint.py \
  --model_dir ./Meta-Llama-3.1-8B-Instruct \
  --output_dir ./ckpt_w8a8sq \
  --dtype float16 --smoothquant 0.5

# Build engine with LoRA — fails
trtllm-build \
  --checkpoint_dir ./ckpt_w8a8sq \
  --output_dir ./engine_w8a8sq \
  --gpt_attention_plugin auto \
  --gemm_plugin auto \
  --max_num_tokens 32768 \
  --max_batch_size 64 \
  --lora_plugin auto \
  --lora_dir ./lora_adapter

### Alternatives

_No response_

### Additional context

- Related open issue: #12202 (FP4 + LoRA, same pattern — we contributed a patch for that one)
- We are happy to contribute a patch for the SmoothQuant + LoRA path if the team can provide
  guidance on which code paths need coverage (single-GPU, TP, gemm_allreduce_plugin, etc.).

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Metadata

Metadata

Assignees

Labels

Lora/P-tuningParameter-Efficient Fine-Tuning (PEFT) like LoRA/P-tuning in TRTLLM: adapter use & perf.feature requestNew feature or request. This includes new model, dtype, functionality support

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions