Commit 215f643
Preserve weight dtype for LAQ amax and per-tensor scales
- StaticBlockScaleQuantizer.enable_laq no longer forces float32 on
_amax_pre, _amax_post, and _per_tensor_scale buffers/parameters;
they now inherit the dtype of the passed tensors.
- laq() calibration casts amax and per_tensor_scale to the weight
dtype before calling enable_laq so the quantizer matches module
precision (bf16/fp16) instead of silently upcasting to fp32.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: realAsma <akuriparambi@nvidia.com>1 parent 8866b80 commit 215f643
2 files changed
Lines changed: 9 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1818 | 1818 | | |
1819 | 1819 | | |
1820 | 1820 | | |
| 1821 | + | |
| 1822 | + | |
| 1823 | + | |
| 1824 | + | |
1821 | 1825 | | |
1822 | 1826 | | |
1823 | 1827 | | |
| |||
Lines changed: 5 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1455 | 1455 | | |
1456 | 1456 | | |
1457 | 1457 | | |
1458 | | - | |
| 1458 | + | |
1459 | 1459 | | |
1460 | | - | |
| 1460 | + | |
1461 | 1461 | | |
1462 | 1462 | | |
1463 | 1463 | | |
1464 | | - | |
| 1464 | + | |
1465 | 1465 | | |
1466 | | - | |
| 1466 | + | |
1467 | 1467 | | |
1468 | 1468 | | |
1469 | | - | |
| 1469 | + | |
1470 | 1470 | | |
1471 | 1471 | | |
1472 | 1472 | | |
| |||
0 commit comments