Commit c267f8e
Optimize MatMulNBits 2-bit + float zero_point CPU dequantization with multi-threaded kernel (#28589)
### Description
Replace the naive single-threaded scalar loop for 2-bit dequantization
with float/MLFloat16 zero points with a multi-threaded kernel
(`DequantizeBlockwise2Bits`) that:
- **Parallelizes via `TrySimpleParallelFor`** — distributes work across
all intra-op threads (previously single-threaded)
- **Processes 16 elements per iteration** — one `uint32_t` = 16 packed
2-bit values, reducing per-element overhead
- **Hoists scale/zp lookups** — all 16 elements share a block, so scale
and zero_point are loaded once per batch
Follows the same threading pattern as the existing 4-bit
`DequantizeBlockwise` path for consistency.
**Files changed:**
- `matmul_nbits_impl.h` — declare `DequantizeBlockwise2Bits`
- `matmul_nbits_impl.cc` — implement `Dequantize2BitsKernel` +
`DequantizeBlockwise2Bits` with instantiations for `<float,float>` and
`<float,MLFloat16>`
- `matmul_nbits.cc` — replace naive loops in both `MatMulNBits<float>`
and `MatMulNBits<MLFloat16>` `ComputeBUnpacked`
### Motivation and Context
The `bits=2` + float zero_point path (added in #28354) was flagged with
`// !!!!!!!!!!!!!! naive implementation, need to be optimized
!!!!!!!!!!!!!!`. It ran ~20× slower than the `bits=4` MLAS path because
it was a tight scalar `for n × for k` loop with no threading — the
entire N×K dequantization ran on a single core before calling
`MlasGemmBatch`. With 8 intra-op threads this should recover most of
that gap.
### Benchmark Results
Tested on a 96-core x86_64 Linux machine, ORT 1.27.0 CPU Release build,
using typical LLM matrix shapes with `block_size=128` and float zero
points.
#### Multi-thread speedup (2-bit dequantization, 1 thread → 8 threads)
| Shape (M×K×N) | 1-thread (ms) | 8-thread (ms) | Speedup |
|---|---|---|---|
| 1×4096×4096 | 41.0 | 8.5 | **4.84×** |
| 32×4096×4096 | 47.9 | 8.8 | **5.46×** |
| 1×4096×11008 | 120.7 | 24.2 | **4.99×** |
| 32×4096×11008 | 146.8 | 28.2 | **5.21×** |
| 1×11008×4096 | 119.2 | 24.5 | **4.87×** |
| 32×11008×4096 | 154.4 | 28.2 | **5.47×** |
| 1×1024×1024 | 1.18 | 0.16 | **7.61×** |
#### 2-bit vs 4-bit comparison (ratio = 2-bit / 4-bit; <1.0 means 2-bit
is faster)
| Shape (M×K×N) | Threads | 4-bit (ms) | 2-bit (ms) | Ratio |
|---|---|---|---|---|
| 1×4096×4096 | 1 | 52.0 | 41.0 | **0.79×** |
| 1×4096×4096 | 8 | 9.4 | 8.5 | **0.90×** |
| 1×4096×11008 | 1 | 141.6 | 120.7 | **0.85×** |
| 1×4096×11008 | 8 | 26.8 | 24.2 | **0.90×** |
| 1×11008×4096 | 1 | 141.2 | 119.2 | **0.84×** |
| 1×11008×4096 | 8 | 26.6 | 24.5 | **0.92×** |
| 32×4096×4096 | 1 | 56.1 | 47.9 | **0.85×** |
| 32×4096×4096 | 8 | 9.6 | 8.8 | **0.92×** |
| 1×1024×1024 | 1 | 1.66 | 1.18 | **0.71×** |
**Key findings:**
- Multi-threading delivers **4.8–7.6× speedup** with 8 threads across
all LLM shapes
- 2-bit is now **10–30% faster** than 4-bit (ratio 0.71–0.93×), due to
fewer bytes read from memory
- The original ~20× regression (issue #28552) is fully resolved
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>1 parent c5afcc5 commit c267f8e
4 files changed
Lines changed: 502 additions & 43 deletions
File tree
- onnxruntime
- contrib_ops/cpu/quantization
- test/python/quantization
Lines changed: 22 additions & 38 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
935 | 935 | | |
936 | 936 | | |
937 | 937 | | |
938 | | - | |
939 | 938 | | |
940 | 939 | | |
941 | 940 | | |
942 | 941 | | |
943 | 942 | | |
944 | | - | |
945 | | - | |
946 | | - | |
947 | | - | |
948 | | - | |
949 | | - | |
950 | | - | |
951 | | - | |
952 | | - | |
953 | | - | |
954 | | - | |
955 | | - | |
956 | | - | |
957 | | - | |
958 | | - | |
959 | | - | |
960 | | - | |
961 | | - | |
| 943 | + | |
| 944 | + | |
| 945 | + | |
| 946 | + | |
| 947 | + | |
| 948 | + | |
| 949 | + | |
| 950 | + | |
| 951 | + | |
| 952 | + | |
| 953 | + | |
962 | 954 | | |
963 | 955 | | |
964 | 956 | | |
| |||
1096 | 1088 | | |
1097 | 1089 | | |
1098 | 1090 | | |
1099 | | - | |
1100 | 1091 | | |
1101 | 1092 | | |
1102 | 1093 | | |
1103 | 1094 | | |
1104 | 1095 | | |
1105 | | - | |
1106 | | - | |
1107 | | - | |
1108 | | - | |
1109 | | - | |
1110 | | - | |
1111 | | - | |
1112 | | - | |
1113 | | - | |
1114 | | - | |
1115 | | - | |
1116 | | - | |
1117 | | - | |
1118 | | - | |
1119 | | - | |
1120 | | - | |
1121 | | - | |
1122 | | - | |
| 1096 | + | |
| 1097 | + | |
| 1098 | + | |
| 1099 | + | |
| 1100 | + | |
| 1101 | + | |
| 1102 | + | |
| 1103 | + | |
| 1104 | + | |
| 1105 | + | |
| 1106 | + | |
1123 | 1107 | | |
1124 | 1108 | | |
1125 | 1109 | | |
| |||
Lines changed: 146 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
| 9 | + | |
9 | 10 | | |
10 | 11 | | |
11 | 12 | | |
| |||
41 | 42 | | |
42 | 43 | | |
43 | 44 | | |
44 | | - | |
45 | | - | |
46 | | - | |
47 | | - | |
48 | | - | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
49 | 50 | | |
50 | 51 | | |
51 | 52 | | |
| |||
117 | 118 | | |
118 | 119 | | |
119 | 120 | | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
120 | 261 | | |
121 | 262 | | |
Lines changed: 14 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
22 | 36 | | |
23 | 37 | | |
0 commit comments