Skip to content

tools:Move Embed/MHA/RNN/LSTM/GRU weight scale generation to ncnn2table#6688

Merged
nihui merged 13 commits into
Tencent:masterfrom
Roundaboutt:opt-quantize-int8
May 18, 2026
Merged

tools:Move Embed/MHA/RNN/LSTM/GRU weight scale generation to ncnn2table#6688
nihui merged 13 commits into
Tencent:masterfrom
Roundaboutt:opt-quantize-int8

Conversation

@Roundaboutt
Copy link
Copy Markdown
Contributor

Description

This PR moves static weight scale generation for several non-convolution layers from ncnn2int8 to ncnn2table, following the same table-driven workflow already used by other quantized layers.

Changes

  • Add Embed and MultiHeadAttention weight scale generation to ncnn2table
  • Add RNN, LSTM, and GRU weight scale generation to ncnn2table
  • Update ncnn2int8 to read these scales from the calibration table instead of recomputing them locally
  • Make calibration dataset optional for models that only need static weight scales and do not require activation calibration
  • Keep SDPA unchanged, since it uses dynamic activation quantization in forward_int8

Test

using minimal RNN,LSTM,GRU,Eembed-Attn network to test:

Eembed-Attn

quantized param files:

7767517
3 3
Input                    in0                      0 1 in0
Embed                    embed_0                  1 1 in0 1 0=8 1=16 3=128 18=2
MultiHeadAttention       attention_1              1 1 1 out0 0=8 1=2 2=64 3=8 4=8 6=5.000000e-01 18=2

precision analysis:

fp32 model : tiny_embed_attn.ncnn.param/.bin
int8 model : tiny_embed_attn_int8.ncnn.param/.bin
samples    : 100
seq_len    : 4
input_size : 8
seed       : 0

overall metrics
  max_abs  = 0.00712827
  mean_abs = 0.00212720
  rmse     = 0.00247913

RNN

quantized param files:

7767517
3 3
Input                    in0                      0 1 in0
RNN                      rnn_1                    1 1 in0 1 0=8 1=64 8=2
Gemm                     gemm_0                   1 1 1 out0 3=1 5=1 6=1 7=4 8=4 9=8 10=4 18=2

precision analysis:

fp32 model : tiny_rnn.ncnn.param/.bin
int8 model : tiny_rnn_int8.ncnn.param/.bin
samples    : 100
seq_len    : 4
input_size : 8
seed       : 0

overall metrics
  max_abs  = 0.04329279
  mean_abs = 0.00797669
  rmse     = 0.01239488

GRU

quantized param files:

7767517
3 3
Input                    in0                      0 1 in0
GRU                      gru_1                    1 1 in0 1 0=8 1=192 8=2
Gemm                     gemm_0                   1 1 1 out0 3=1 5=1 6=1 7=4 8=4 9=8 10=4 18=2

precision analysis:

fp32 model : tiny_gru.ncnn.param/.bin
int8 model : tiny_gru_int8.ncnn.param/.bin
samples    : 100
seq_len    : 4
input_size : 8
seed       : 0

overall metrics
  max_abs  = 0.00559735
  mean_abs = 0.00107971
  rmse     = 0.00136703

LSTM

quantized param files:

7767517
3 3
Input                    in0                      0 1 in0
LSTM                     lstm_1                   1 1 in0 1 0=8 1=256 3=8 8=2
Gemm                     gemm_0                   1 1 1 out0 3=1 5=1 6=1 7=4 8=4 9=8 10=4 18=2

precision analysis:

fp32 model : tiny_lstm.ncnn.param/.bin
int8 model : tiny_lstm_int8.ncnn.param/.bin
samples    : 100
seq_len    : 4
input_size : 8
seed       : 0

overall metrics
  max_abs  = 0.00386286
  mean_abs = 0.00055465
  rmse     = 0.00072828

@nihui
Copy link
Copy Markdown
Member

nihui commented Apr 27, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fe827598da

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tools/quantize/ncnn2int8.cpp
@Roundaboutt
Copy link
Copy Markdown
Contributor Author

@codex review

This identical approach is used in previous functions like quantize_convolution():

        if (iter == weight_int8scale_table.end())
        {
            fprintf(stderr, "this layer need to be quantized, but no scale param!\n");
            return -1;
        }

Since the main function doesn't check for this return value, I'm not entirely sure if it's a minor bug. Therefore, I decided to preserve the original implementation.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fe827598da

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread tools/quantize/ncnn2int8.cpp
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the int8 calibration-table workflow so that static weight scale generation for non-convolution layers (Embed, MultiHeadAttention, RNN/LSTM/GRU) is produced by ncnn2table and then consumed by ncnn2int8, while also allowing ncnn2table to run without a calibration dataset when only static weight scales are needed.

Changes:

  • Add static weight scale generation + table serialization for Embed, MultiHeadAttention, RNN, LSTM, and GRU in ncnn2table.
  • Update ncnn2int8 to read these weight scales from the calibration table (instead of recomputing).
  • Update documentation and ncnn2table CLI parsing to make the calibration dataset optional for models without conv/activation calibration needs.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
tools/quantize/ncnn2table.cpp Detect Embed/MHA/RNN/LSTM/GRU layers, generate and save their weight scales, and make dataset arguments optional when activation calibration isn’t needed.
tools/quantize/ncnn2int8.cpp Switch recurrent/attention/embed weight quantization to consume per-layer scale entries from the table.
docs/how-to-use-and-FAQ/quantized-int8-inference.md Document the dataset-less table generation flow for static-weight-only models.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tools/quantize/ncnn2table.cpp Outdated
Comment thread tools/quantize/ncnn2table.cpp Outdated
Comment thread docs/how-to-use-and-FAQ/quantized-int8-inference.md Outdated
@tencent-adm
Copy link
Copy Markdown
Member

CLA assistant check
Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ Roundaboutt
❌ nihui
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (4)

tools/quantize/ncnn2int8.cpp:421

  • Same as above: the missing-scale error message is too generic. Please include layer name/type and the specific missing key (param_0/param_1) so users can fix or regenerate the calibration table correctly.
        char key_xc[256];
        snprintf(key_xc, 256, "%s_param_0", layers[i]->name.c_str());
        std::map<std::string, ncnn::Mat>::iterator iter_xc = weight_int8scale_table.find(key_xc);
        if (iter_xc == weight_int8scale_table.end())
        {
            fprintf(stderr, "this layer need to be quantized, but no scale param!\n");
            return -1;
        }

tools/quantize/ncnn2int8.cpp:503

  • Same as above: the missing-scale error message is too generic. Please include layer name/type and the specific missing key (param_0/param_1) so users can fix or regenerate the calibration table correctly.
        char key_xc[256];
        snprintf(key_xc, 256, "%s_param_0", layers[i]->name.c_str());
        std::map<std::string, ncnn::Mat>::iterator iter_xc = weight_int8scale_table.find(key_xc);
        if (iter_xc == weight_int8scale_table.end())
        {
            fprintf(stderr, "this layer need to be quantized, but no scale param!\n");
            return -1;
        }

        char key_hc[256];
        snprintf(key_hc, 256, "%s_param_1", layers[i]->name.c_str());
        std::map<std::string, ncnn::Mat>::iterator iter_hc = weight_int8scale_table.find(key_hc);
        if (iter_hc == weight_int8scale_table.end())
        {
            fprintf(stderr, "this layer need to be quantized, but no scale param!\n");
            return -1;
        }

tools/quantize/ncnn2int8.cpp:567

  • Same as above: the missing-scale error message is too generic. Please include the layer name/type and expected key so users can determine which table entry is required.
        char key[256];
        snprintf(key, 256, "%s_param_0", layers[i]->name.c_str());
        std::map<std::string, ncnn::Mat>::iterator iter = weight_int8scale_table.find(key);
        if (iter == weight_int8scale_table.end())
        {
            fprintf(stderr, "this layer need to be quantized, but no scale param!\n");
            return -1;
        }

tools/quantize/ncnn2int8.cpp:721

  • Same as above: the missing-scale error message is too generic. Please include the layer name/type and expected key (param_0..param_3) so users can determine which entry is missing from the calibration table.
        char key_q[256];
        snprintf(key_q, 256, "%s_param_0", layers[i]->name.c_str());
        std::map<std::string, ncnn::Mat>::iterator iter_q = weight_int8scale_table.find(key_q);
        if (iter_q == weight_int8scale_table.end())
        {
            fprintf(stderr, "this layer need to be quantized, but no scale param!\n");
            return -1;
        }

Comment thread tools/quantize/ncnn2int8.cpp
Comment thread docs/how-to-use-and-FAQ/quantized-int8-inference.md
Comment thread tools/quantize/ncnn2table.cpp
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

tools/quantize/ncnn2int8.cpp:747

  • Same as above: missing-scale failures in the MultiHeadAttention path should report which key is absent (q/k/v/out) and the layer name, rather than the generic "no scale param" message, to help users regenerate/fix their table quickly.
        char key_q[256];
        snprintf(key_q, 256, "%s_param_0", layers[i]->name.c_str());
        std::map<std::string, ncnn::Mat>::iterator iter_q = weight_int8scale_table.find(key_q);
        if (iter_q == weight_int8scale_table.end())
        {
            fprintf(stderr, "this layer need to be quantized, but no scale param!\n");
            return -1;
        }

        char key_k[256];
        snprintf(key_k, 256, "%s_param_1", layers[i]->name.c_str());
        std::map<std::string, ncnn::Mat>::iterator iter_k = weight_int8scale_table.find(key_k);
        if (iter_k == weight_int8scale_table.end())
        {
            fprintf(stderr, "this layer need to be quantized, but no scale param!\n");
            return -1;
        }

        char key_v[256];
        snprintf(key_v, 256, "%s_param_2", layers[i]->name.c_str());
        std::map<std::string, ncnn::Mat>::iterator iter_v = weight_int8scale_table.find(key_v);
        if (iter_v == weight_int8scale_table.end())
        {
            fprintf(stderr, "this layer need to be quantized, but no scale param!\n");
            return -1;
        }

        char key_out[256];
        snprintf(key_out, 256, "%s_param_3", layers[i]->name.c_str());
        std::map<std::string, ncnn::Mat>::iterator iter_out = weight_int8scale_table.find(key_out);
        if (iter_out == weight_int8scale_table.end())
        {
            fprintf(stderr, "this layer need to be quantized, but no scale param!\n");
            return -1;
        }

Comment thread docs/how-to-use-and-FAQ/quantized-int8-inference.md
Comment thread tools/quantize/ncnn2int8.cpp
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7a11ec27a8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tools/quantize/ncnn2table.cpp
@nihui nihui merged commit 3724d10 into Tencent:master May 18, 2026
26 of 27 checks passed
@nihui
Copy link
Copy Markdown
Member

nihui commented May 18, 2026

Thanks for your contribution !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants