Skip to content

feat: add Hygon DCU INT8 hipBLASLt GEMM path#1199

Draft
starrkk wants to merge 3 commits into
ModelTC:mainfrom
starrkk:codex/hygon-dcu-int8-hipblaslt
Draft

feat: add Hygon DCU INT8 hipBLASLt GEMM path#1199
starrkk wants to merge 3 commits into
ModelTC:mainfrom
starrkk:codex/hygon-dcu-int8-hipblaslt

Conversation

@starrkk

@starrkk starrkk commented Jun 30, 2026

Copy link
Copy Markdown

Summary

  • add an int8-vllm-hygon-dcu MM weight backend using Hygon DCU hipBLASLt W8A8 channelwise GEMM
  • support auto-quantized BF16 weights and optional selective BF16 fallback
  • expose helpers for shared activation quantization across repeated projections

Why

This enables faster INT8 GEMM execution on Hygon DCU while keeping the path gated by the quantization scheme and clear dependency checks.

Validation

  • branch rebuilt on latest ModelTC/LightX2V:main (89dfa833)
  • git diff --check passed for the PR branch
  • validated as part of the HunyuanVideo1.5 I2V 8-card benchmark path on Hygon DCU

zhenggf added 2 commits June 30, 2026 11:50
(cherry picked from commit a3a1a1f870b768929d8ca073f0c74added572087)
Add reusable quantized-input helpers for Hygon DCU W8A8 dynamic activation GEMMs, and support selective BF16 fallback for configured INT8 weights.

(cherry picked from commit 58dab25b69c41c6ec9a24df0fe584ca93534eacc)

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a selective BF16 fallback mechanism and integrates hipblaslt_w8a8_channelwise_gemm for Hygon DCU in the quantization pipeline, along with helper functions for managing weights, biases, and environment flags. It also updates module loading logic to support custom load functions. The review feedback highlights opportunities to optimize performance and robustness, specifically by caching the casted weight tensor in the BF16 fallback path to avoid redundant casting, removing redundant .contiguous() calls on already contiguous weights during GEMM execution, and safely checking for the existence of module.weight to prevent potential AttributeErrors when loading auto-quantized biases.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +202 to +207
def _apply_bf16(self, input_tensor):
weight = self.weight
if weight.dtype != input_tensor.dtype:
weight = weight.to(input_tensor.dtype)
bias = _bias_or_none(self, input_tensor.dtype)
return F.linear(input_tensor, weight, bias)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Casting a large weight tensor on every single forward pass is extremely inefficient and will cause significant GPU memory churn and latency overhead. We should cache the casted weight back to self.weight so that subsequent forward passes can reuse it directly.

Suggested change
def _apply_bf16(self, input_tensor):
weight = self.weight
if weight.dtype != input_tensor.dtype:
weight = weight.to(input_tensor.dtype)
bias = _bias_or_none(self, input_tensor.dtype)
return F.linear(input_tensor, weight, bias)
def _apply_bf16(self, input_tensor):
if self.weight.dtype != input_tensor.dtype:
self.weight = self.weight.to(input_tensor.dtype)
bias = _bias_or_none(self, input_tensor.dtype)
return F.linear(input_tensor, self.weight, bias)

bias = bias.to(torch.float32)
elif hasattr(module, "infer_dtype"):
bias = bias.to(module.infer_dtype)
module.bias = bias.to(module.weight.device)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To prevent a potential AttributeError if module.weight is None or not yet initialized, we should safely check for its existence and fallback to the bias tensor's own device.

Suggested change
module.bias = bias.to(module.weight.device)
device = module.weight.device if getattr(module, "weight", None) is not None else bias.device
module.bias = bias.to(device)

Comment on lines +175 to +179
_, output_tensor = hipblaslt_gemm(
a=input_tensor_quant.contiguous(),
b=self.weight.contiguous(),
scale_a=input_tensor_scale.contiguous(),
scale_b=self.weight_scale.contiguous(),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since self.weight and self.weight_scale are already made contiguous during the load() phase (via _make_weight_contiguous), calling .contiguous() on them during every forward pass is redundant and adds unnecessary overhead.

Suggested change
_, output_tensor = hipblaslt_gemm(
a=input_tensor_quant.contiguous(),
b=self.weight.contiguous(),
scale_a=input_tensor_scale.contiguous(),
scale_b=self.weight_scale.contiguous(),
_, output_tensor = hipblaslt_gemm(
a=input_tensor_quant.contiguous(),
b=self.weight,
scale_a=input_tensor_scale.contiguous(),
scale_b=self.weight_scale,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant