Skip to content

[Kunlunxin]support DS R1/V3 w4a8int8 per-channel quantization#226

Merged
yghstill merged 3 commits into
Tencent:mainfrom
lishaoao:ds_w4a8in8_quant
Jan 28, 2026
Merged

[Kunlunxin]support DS R1/V3 w4a8int8 per-channel quantization#226
yghstill merged 3 commits into
Tencent:mainfrom
lishaoao:ds_w4a8in8_quant

Conversation

@lishaoao
Copy link
Copy Markdown
Contributor

@lishaoao lishaoao commented Jan 22, 2026

What's Changed

  1. Add new quantization field w4a8i8 to enable the w4a8int8 quantization method for DeepSeek R1/V3 models running on kunlunxin vllm
  2. Introduce the W4A8INT8 module, inherited from GPTQ, to support deepseek r1/v3 w4a8int8 quantization with minimal code intrusion
  3. Implement W4A8Int8QuantLinear to support quantized weight packing and custom compressed tensors format

Key Details

  • The W4A8INT8 module inherits from GPTQ to minimize code modification, specifically adapted for deepseek r1/v3 w4a8int8 quantization
  • Quantization rules: weights use per-channel symmetric int4 quantization, activations use per-token dynamic symmetric int8 quantization
  • MOE is supported for int4 per-channel symmetric quantization, while other components also adopt int8 per-channel symmetric quantization, aligned with the configuration file deepseek_r1_w4a8_int8.yaml
  • W4A8Int8QuantLinear core functions: quantized weight packing + custom compressed tensors format support

Testing

  • Tested the quantized Deepseek v3.1 w4a8int8 model on the GPQA_Diamond dataset
  • Achieved a score of 74.44 (official baseline score: 74.9), maintaining high consistency with the official performance

Usage

#!/usr/bin/env bash

SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)

PROJECT_ROOT=$SCRIPT_DIR

export PYTHONPATH=$PROJECT_ROOT:$PYTHONPATH

NNODES=1
NPROC_PER_NODE=8
MASTER_ADDR=localhost

CONFIG=$PROJECT_ROOT/configs/deepseek_r1/w4a8_int8/deepseek_r1_w4a8_int8_kunlun.yaml

torchrun \
    --nnodes $NNODES \
    --nproc-per-node $NPROC_PER_NODE \
    --node-rank 0 \
    --master-addr $MASTER_ADDR \
    $PROJECT_ROOT/tools/run.py \
    --config $CONFIG \
    --multi-nodes

return None


class DeepSeekV3W4A8Int8Save(DeepSeekV3PTQSaveMulti):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Place it class in core/save.py

scale = abs_max / qmax
quantized = torch.round(tensor / scale)
quantized = torch.clamp(quantized, -qmax, qmax)
return quantized.to(torch.int8), scale.to(torch.float32)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

放到quant_func.py文件

return None


class DeepSeekV3W4A8Int8Save(DeepSeekV3PTQSaveMulti):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

放到save.py

# Global configuration of pipeline
global:
save_path: ./output
deploy_backend: vllm
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文件名改成configs/deepseek_r1/w4a8_int8/deepseek_r1_w4a8_int8_kunlun.yaml

@yghstill yghstill merged commit 1ea6e3d into Tencent:main Jan 28, 2026
5 checks passed
dawnranger pushed a commit to dawnranger/AngelSlim that referenced this pull request Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants