[Feature]: LoRA support for W8A8 SmoothQuant (SmoothQuantAttention / SmoothQuantLinear)

### 🚀 The feature, motivation and pitch

## Problem

In `tensorrt_llm/quantization/layers.py`, `SmoothQuantAttention.forward()` asserts
`lora_layer_params is None` with the message:

> AssertionError: lora is not supported on SmoothQuantAttention now

This means engines built from W8A8 SmoothQuant checkpoints (`--smoothquant 0.5`) cannot use
`--lora_plugin auto` at build time or load LoRA adapters at runtime.

## Motivation

**Ampere GPUs (A10, A100) do not have FP8 tensor cores**, so W8A8 SmoothQuant is the only
activation-quantized option for these GPUs. At the same time, per-request LoRA is critical
for multi-tenant serving (task-specific fine-tunes). Today users on Ampere must choose between:

- W8A8 SQ (good quality, no LoRA) — unusable for multi-tenant serving
- W4A16 AWQ (LoRA works, but lower quality due to 4-bit weights)

This leaves no path to high-quality quantization + LoRA on Ampere hardware.

Note: Issue #2604 reported the same problem in Dec 2024 and was closed as "completed" in Jun 2025,
but the assertion is still present in v1.1.0 (tested with `nvcr.io/nvidia/tritonserver:25.02-trtllm-python-py3`).

## Pitch

Add LoRA support for SmoothQuant layers (`SmoothQuantAttention`, `SmoothQuantLinear`) the same
way it is done for FP8:

1. Remove the assertion guard.
2. Keep LoRA computation in FP16/BF16 — run the LoRA low-rank path after the INT8 GEMM and add
   the result to the dequantized output.
3. For the attention path, apply LoRA to Q/K/V projections and dense output the same way
   `Attention.forward()` does.

This approach preserves the INT8 GEMM throughput for the base model while adding the LoRA delta
in higher precision, matching the pattern already used by FP8 + LoRA.

## Reproduction

```bash
# Convert checkpoint with SmoothQuant
python3 examples/llama/convert_checkpoint.py \
  --model_dir ./Meta-Llama-3.1-8B-Instruct \
  --output_dir ./ckpt_w8a8sq \
  --dtype float16 --smoothquant 0.5

# Build engine with LoRA — fails
trtllm-build \
  --checkpoint_dir ./ckpt_w8a8sq \
  --output_dir ./engine_w8a8sq \
  --gpt_attention_plugin auto \
  --gemm_plugin auto \
  --max_num_tokens 32768 \
  --max_batch_size 64 \
  --lora_plugin auto \
  --lora_dir ./lora_adapter

### Alternatives

_No response_

### Additional context

- Related open issue: #12202 (FP4 + LoRA, same pattern — we contributed a patch for that one)
- We are happy to contribute a patch for the SmoothQuant + LoRA path if the team can provide
  guidance on which code paths need coverage (single-GPU, TP, gemm_allreduce_plugin, etc.).

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: LoRA support for W8A8 SmoothQuant (SmoothQuantAttention / SmoothQuantLinear) #12703

🚀 The feature, motivation and pitch

Problem

Motivation

Pitch

Reproduction

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: LoRA support for W8A8 SmoothQuant (SmoothQuantAttention / SmoothQuantLinear) #12703

Description

🚀 The feature, motivation and pitch

Problem

Motivation

Pitch

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions