Skip to content

GPTQ vector#1223

Merged
sugunav14 merged 9 commits intomainfrom
svelury/gptq-vector
Apr 17, 2026
Merged

GPTQ vector#1223
sugunav14 merged 9 commits intomainfrom
svelury/gptq-vector

Conversation

@sugunav14
Copy link
Copy Markdown
Contributor

@sugunav14 sugunav14 commented Apr 9, 2026

What does this PR do?

Type of change: ?

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

  • New Features

    • Added backend-specific GPTQ helper registration to allow backend-tailored GPTQ behavior.
  • Bug Fixes

    • Prevented KV-cache state from leaking across repeated per-layer forwards during calibration.
  • Tests

    • Added GPU-focused tests validating GPTQ combined with vector quantization, including accuracy and end-to-end comparisons.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 9, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 9, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a GPTQ helper registry, selects helper by backend in GPTQ, prevents KV-cache leakage during per-layer sequential calibration, inserts a whitespace edit in an example script, and adds a new GPU test module implementing and validating GPTQ + vector quantization.

Changes

Cohort / File(s) Summary
GPTQ Helper Registry
modelopt/torch/quantization/utils/calib_utils.py
Introduce _GPTQ_HELPER_REGISTRY: dict[str, type[GPTQHelper]] and register_gptq_helper(backend: str, factory: type[GPTQHelper]) -> None to allow registering backend-specific GPTQ helper classes.
GPTQ selection & KV-cache handling
modelopt/torch/quantization/model_calib.py
In gptq(), choose helper class from _GPTQ_HELPER_REGISTRY keyed by m.weight_quantizer.backend (defaulting to GPTQHelper when absent). In sequential_calibrate, clone per-replay kwargs_input and clear past_key_values (via reset() if supported or set to None) before layer forwards to avoid KV-cache carryover.
Example whitespace edit
examples/llm_ptq/hf_ptq.py
Non-functional whitespace change: added a blank line before a get_dataset_dataloader(...) call (no semantic change).
GPU tests: GPTQ + VQ
tests/gpu/torch/quantization/test_gptq_vq.py
Add comprehensive GPU test module with reference GPTQ+vector-quantization implementations, utilities for codebook and Hessian generation, and three tests comparing VQ paths, blockwise updates, and end-to-end quantization behavior.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Title check ⚠️ Warning The title 'GPTQ vector' is extremely vague and does not clearly convey the actual changes. The PR implements GPTQ weight quantization with vector quantization (VQ), KV-cache handling, backend-specific helper registry, and comprehensive tests, but the title fails to communicate these substantive improvements. Use a more descriptive title such as 'Add GPTQ with vector quantization support and KV-cache management' or 'Implement GPTQ vector quantization with configurable backends'. The title should reflect the main technical contribution beyond just the acronym.
Docstring Coverage ⚠️ Warning Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Security Anti-Patterns ✅ Passed PR contains no security anti-patterns as defined in SECURITY.md including unsafe torch.load, numpy.load, hardcoded trust_remote_code, dangerous eval/exec calls, or problematic dependencies.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch svelury/gptq-vector

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-17 22:55 UTC

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 9, 2026

Codecov Report

❌ Patch coverage is 82.35294% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.55%. Comparing base (4e33368) to head (f2f2825).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/quantization/model_calib.py 85.71% 2 Missing ⚠️
modelopt/torch/quantization/utils/calib_utils.py 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1223      +/-   ##
==========================================
+ Coverage   72.74%   76.55%   +3.81%     
==========================================
  Files         459      459              
  Lines       48611    48626      +15     
==========================================
+ Hits        35361    37227    +1866     
+ Misses      13250    11399    -1851     
Flag Coverage Δ
examples 41.36% <17.64%> (+1.92%) ⬆️
gpu 59.97% <82.35%> (+7.77%) ⬆️
unit 52.21% <23.52%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sugunav14 sugunav14 force-pushed the svelury/gptq-vector branch from 6e9b1d1 to 7a29dc0 Compare April 14, 2026 22:54
@sugunav14 sugunav14 marked this pull request as ready for review April 14, 2026 23:28
@sugunav14 sugunav14 requested review from a team as code owners April 14, 2026 23:28
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
examples/llm_ptq/hf_ptq.py (2)

1075-1082: Same inconsistency: error message doesn't mention mtq attribute fallback.

For consistency, the error message should mention both supported sources.

♻️ Suggested fix
             else:
                 raise AssertionError(
-                    f"Unsupported quantization format: {args.qformat}, choices are: {list(QUANT_CFG_CHOICES.keys())}"
+                    f"Unsupported quantization format: {args.qformat}, choices are: {list(QUANT_CFG_CHOICES.keys())} or any mtq attribute"
                 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 1075 - 1082, The AssertionError
message for unsupported quantization format only lists QUANT_CFG_CHOICES keys
but omits the mtq attribute fallback; update the error message in the block
handling args.qformat (where QUANT_CFG_CHOICES, mtq, getattr(mtq, args.qformat),
and quant_cfg are used) to mention both valid keys from QUANT_CFG_CHOICES and
that additional formats may be available as attributes on mtq (or list mtq
attributes if desirable) so the error clearly indicates both supported sources.

423-431: Inconsistent error message between low-memory mode and standard mode.

The error message in low-memory mode (line 429-431) only mentions QUANT_CFG_CHOICES.keys() but doesn't mention that mtq attributes are also supported (which lines 425-426 check for). Consider aligning the error message with the actual supported formats.

♻️ Suggested fix for consistent error message
         else:
             raise AssertionError(
                 f"Quantization format is not supported for low memory mode. "
-                f"Supported formats: {QUANT_CFG_CHOICES.keys()}"
+                f"Supported formats: {list(QUANT_CFG_CHOICES.keys())} or any mtq attribute"
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 423 - 431, The error message for the
low-memory branch (the block that checks args.qformat against QUANT_CFG_CHOICES
and mtq via hasattr/getattr and sets quant_cfg) is inconsistent: update the
AssertionError text to reflect both supported sources (keys of QUANT_CFG_CHOICES
and supported attributes on mtq) — e.g., mention QUANT_CFG_CHOICES.keys() and
the list of mtq attribute names or a phrase like "or available mtq formats" — so
the message accurately describes what formats are accepted when raising the
AssertionError for args.qformat.
tests/gpu/torch/quantization/test_gptq_vq.py (1)

307-316: Consider adding test cases for edge cases and error conditions.

The current tests cover the happy path with specific dimensions. Consider adding tests for:

  • Misaligned dimensions (e.g., in_features not divisible by vector_size)
  • Invalid configurations that should raise assertions
  • Different block size combinations

This would improve confidence in the assertion checks added in the implementation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu/torch/quantization/test_gptq_vq.py` around lines 307 - 316, Add
unit tests in test_gptq_vq.py that exercise edge and error cases using the
existing constants (SEED, IN_FEATURES, OUT_FEATURES, VECTOR_SIZE,
GPTQ_BLOCK_SIZE, QUANT_BLOCK_SIZE, N_CODEWORDS, PERC_DAMP, SCALE_TYPE) by: 1)
creating a case where IN_FEATURES is not divisible by VECTOR_SIZE and asserting
the code raises the appropriate exception/AssertionError; 2) creating invalid
config combinations (e.g., VECTOR_SIZE <= 0, N_CODEWORDS out of range, PERC_DAMP
negative) and asserting they raise errors; and 3) parametrizing tests to run
across multiple GPTQ_BLOCK_SIZE/QUANT_BLOCK_SIZE combinations (including
mismatched sizes) to validate assertion checks and expected failures.
modelopt/torch/quantization/utils/calib_utils.py (1)

50-72: Adding explicit weights_only=True to torch.load would improve code clarity.

The torch.load on line 65 is already safe because the project requires PyTorch 2.8+, which defaults weights_only=True. However, making the parameter explicit improves readability and documents intent, since the codebook file only contains tensor data.

Optional improvement
     if "sorted" in encode_format:
-        cb = torch.load(encode_path + encode_format + ".pt", map_location="cpu")
+        cb = torch.load(encode_path + encode_format + ".pt", map_location="cpu", weights_only=True)
         codebook = cb["sorted_values"].cuda().float()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/utils/calib_utils.py` around lines 50 - 72, In
load_vector_lut_codebook, make the torch.load call explicit about loading
tensor-only data by adding weights_only=True to the torch.load invocation (the
branch that loads cb = torch.load(...)). Update the torch.load call inside the
load_vector_lut_codebook function so it passes weights_only=True (and keep
map_location="cpu") to clearly document intent that the file contains only
tensor data.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/quantization/utils/calib_utils.py`:
- Around line 258-314: The _blockwise_vector_update method assumes
quant_block_size aligns with vector_size for correct scale indexing; add an
assertion after loading codebook/computing vector_size and quant_block_size to
validate quant_block_size % vector_size == 0 (e.g., in _blockwise_vector_update
right after vector_size = codebook.shape[1]) so the j // quant_block_size scale
lookup in the loop is safe and fails fast with a clear message if misconfigured.

In `@tests/gpu/torch/quantization/test_gptq_vq.py`:
- Around line 320-321: The tests import luts (clip_vector_prescaled,
clip_vector_scalesign_fast) in tests/gpu/torch/quantization/test_gptq_vq.py and
calib_utils.py also imports luts at two locations; add luts as an optional test
dependency in pyproject.toml (e.g., under a [project.optional-dependencies]
group like "dev-test") or modify the test modules to skip when missing by
calling pytest.importorskip("luts") at the top of the failing test functions (or
module-level in test_gptq_vq.py) and similarly protect the imports/usage sites
in modelopt/torch/quantization/utils/calib_utils.py (the two import/usage
locations around the current lines) so runtime failures are avoided when luts is
not installed.

---

Nitpick comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 1075-1082: The AssertionError message for unsupported quantization
format only lists QUANT_CFG_CHOICES keys but omits the mtq attribute fallback;
update the error message in the block handling args.qformat (where
QUANT_CFG_CHOICES, mtq, getattr(mtq, args.qformat), and quant_cfg are used) to
mention both valid keys from QUANT_CFG_CHOICES and that additional formats may
be available as attributes on mtq (or list mtq attributes if desirable) so the
error clearly indicates both supported sources.
- Around line 423-431: The error message for the low-memory branch (the block
that checks args.qformat against QUANT_CFG_CHOICES and mtq via hasattr/getattr
and sets quant_cfg) is inconsistent: update the AssertionError text to reflect
both supported sources (keys of QUANT_CFG_CHOICES and supported attributes on
mtq) — e.g., mention QUANT_CFG_CHOICES.keys() and the list of mtq attribute
names or a phrase like "or available mtq formats" — so the message accurately
describes what formats are accepted when raising the AssertionError for
args.qformat.

In `@modelopt/torch/quantization/utils/calib_utils.py`:
- Around line 50-72: In load_vector_lut_codebook, make the torch.load call
explicit about loading tensor-only data by adding weights_only=True to the
torch.load invocation (the branch that loads cb = torch.load(...)). Update the
torch.load call inside the load_vector_lut_codebook function so it passes
weights_only=True (and keep map_location="cpu") to clearly document intent that
the file contains only tensor data.

In `@tests/gpu/torch/quantization/test_gptq_vq.py`:
- Around line 307-316: Add unit tests in test_gptq_vq.py that exercise edge and
error cases using the existing constants (SEED, IN_FEATURES, OUT_FEATURES,
VECTOR_SIZE, GPTQ_BLOCK_SIZE, QUANT_BLOCK_SIZE, N_CODEWORDS, PERC_DAMP,
SCALE_TYPE) by: 1) creating a case where IN_FEATURES is not divisible by
VECTOR_SIZE and asserting the code raises the appropriate
exception/AssertionError; 2) creating invalid config combinations (e.g.,
VECTOR_SIZE <= 0, N_CODEWORDS out of range, PERC_DAMP negative) and asserting
they raise errors; and 3) parametrizing tests to run across multiple
GPTQ_BLOCK_SIZE/QUANT_BLOCK_SIZE combinations (including mismatched sizes) to
validate assertion checks and expected failures.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 8f2a775c-a4a0-4aaf-aa80-b16f69d1d713

📥 Commits

Reviewing files that changed from the base of the PR and between 73be810 and 7a29dc0.

📒 Files selected for processing (5)
  • examples/llm_ptq/example_utils.py
  • examples/llm_ptq/hf_ptq.py
  • modelopt/torch/quantization/model_calib.py
  • modelopt/torch/quantization/utils/calib_utils.py
  • tests/gpu/torch/quantization/test_gptq_vq.py

Comment on lines +258 to +314
def _blockwise_vector_update(self, block_size):
"""GPTQ blockwise update for vector LUT quantizers.

Pre-computes scales once, then runs the standard GPTQ 3-loop
with per-vector-group static quantization via clip_vector_prescaled.
"""
import torch.nn.functional as F
from luts import clip_vector_prescaled, clip_vector_scalesign_fast

codebook, quant_block_size, scale_type = load_vector_lut_codebook(
self.module.weight_quantizer
)

# Get vector size from codebook
vector_size = codebook.shape[1]

assert self.weight is not None and self.h_inv is not None
num_cols = self.weight.shape[1]
assert block_size % quant_block_size == 0

# Pre-compute scales once outside the GPTQ loop
_, scales = clip_vector_scalesign_fast(
self.weight,
codebook,
quant_block_size,
scale_type,
scale_algo="max",
sign_scale=True,
return_scales=True,
)
scales_2d = scales.reshape(self.weight.shape[0], -1)

w = self.weight.clone()
h_inv = self.h_inv

for blk_start in range(0, num_cols, block_size):
blk_end = min(blk_start + block_size, num_cols)
errs = torch.zeros_like(w[:, blk_start:blk_end])

for j in range(blk_start, blk_end, vector_size):
d = min(vector_size, blk_end - j)
s = scales_2d[:, j // quant_block_size].contiguous()

sub = w[:, j : j + d].contiguous()
if d < vector_size:
sub = F.pad(sub, (0, vector_size - d))
q_sub = clip_vector_prescaled(sub, codebook, s)

for k in range(d):
col = j + k
self.weight[:, col] = q_sub[:, k]
err = (w[:, col] - q_sub[:, k]) / h_inv[col, col]
errs[:, col - blk_start] = err
w[:, col:blk_end].addr_(err, h_inv[col, col:blk_end], alpha=-1)

if blk_end < num_cols:
w[:, blk_end:] -= errs @ h_inv[blk_start:blk_end, blk_end:]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if there are any constraints documented between quant_block_size and vector_size
rg -n "quant_block_size.*vector" --type py
rg -n "vector_size.*block" --type py

Repository: NVIDIA/Model-Optimizer

Length of output: 193


🏁 Script executed:

# Find the load_vector_lut_codebook function definition
rg -n "def load_vector_lut_codebook" --type py -A 20

Repository: NVIDIA/Model-Optimizer

Length of output: 1926


🏁 Script executed:

# Check if there are any tests for this functionality
fd ".*test.*\.py" --type f | xargs rg -l "blockwise_vector_update|load_vector_lut_codebook" 2>/dev/null

Repository: NVIDIA/Model-Optimizer

Length of output: 110


🏁 Script executed:

# Look for any documentation or validation of quant_block_size and vector_size
rg -n "vector_size|quant_block_size" modelopt/torch/quantization/utils/calib_utils.py --type py -B 2 -A 2 | head -80

Repository: NVIDIA/Model-Optimizer

Length of output: 1415


🏁 Script executed:

# Examine the test file to understand the dimension relationships
cat tests/gpu/torch/quantization/test_gptq_vq.py | head -150

Repository: NVIDIA/Model-Optimizer

Length of output: 5382


🏁 Script executed:

# Check for any comments or documentation about vector_size and quant_block_size in the entire calib_utils file
cat modelopt/torch/quantization/utils/calib_utils.py | grep -n "vector_size\|quant_block_size" | head -30

Repository: NVIDIA/Model-Optimizer

Length of output: 633


🏁 Script executed:

# Check the complete load_vector_lut_codebook function
sed -n '50,100p' modelopt/torch/quantization/utils/calib_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1929


Add assertion to validate quant_block_size % vector_size == 0.

The scale indexing at line 299 (j // quant_block_size) assumes vector groups align with quantization block boundaries. While line 276 asserts block_size % quant_block_size == 0, there's no explicit check that quant_block_size % vector_size == 0. The reference implementation in the test file (test_gptq_vq.py) validates this constraint explicitly; adding this assertion here ensures correct scale lookups.

Suggested fix
         # Get vector size from codebook
         vector_size = codebook.shape[1]
 
         assert self.weight is not None and self.h_inv is not None
         num_cols = self.weight.shape[1]
         assert block_size % quant_block_size == 0
+        assert quant_block_size % vector_size == 0, (
+            f"quant_block_size ({quant_block_size}) must be divisible by vector_size ({vector_size})"
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/utils/calib_utils.py` around lines 258 - 314, The
_blockwise_vector_update method assumes quant_block_size aligns with vector_size
for correct scale indexing; add an assertion after loading codebook/computing
vector_size and quant_block_size to validate quant_block_size % vector_size == 0
(e.g., in _blockwise_vector_update right after vector_size = codebook.shape[1])
so the j // quant_block_size scale lookup in the loop is safe and fails fast
with a clear message if misconfigured.

Comment thread tests/gpu/torch/quantization/test_gptq_vq.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
modelopt/torch/quantization/utils/calib_utils.py (1)

261-280: ⚠️ Potential issue | 🟠 Major

Add explicit vector/block alignment validation before scale indexing.

Line 280 uses j // quant_block_size for scale lookup. Without a guard that quant_block_size is divisible by vector_size, vector chunks can straddle quant-block boundaries and select incorrect scales.

Suggested fix
         vector_size = quantizer.backend_extra_args["vector_size"]
         quant_block_size = quantizer.backend_extra_args["block_sizes"]

         assert self.weight is not None and self.h_inv is not None
         num_cols = self.weight.shape[1]
         assert block_size % quant_block_size == 0
+        assert quant_block_size % vector_size == 0, (
+            f"quant_block_size ({quant_block_size}) must be divisible by vector_size ({vector_size})"
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/utils/calib_utils.py` around lines 261 - 280, The
code can index quantizer._psx_scales incorrectly when vector chunks cross
quant-block boundaries; add an explicit validation before the blk loop ensuring
vector/block alignment (e.g., assert quant_block_size % vector_size == 0 and
vector_size <= quant_block_size, or raise a clear ValueError) so that the
expression s = quantizer._psx_scales[:, j // quant_block_size] is safe; place
this guard near the uses of vector_size/quant_block_size (around where
quant_block_size = quantizer.backend_extra_args["block_sizes"] and before the
for blk_start... loop) and include a descriptive message referencing vector_size
and quant_block_size.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@modelopt/torch/quantization/utils/calib_utils.py`:
- Around line 261-280: The code can index quantizer._psx_scales incorrectly when
vector chunks cross quant-block boundaries; add an explicit validation before
the blk loop ensuring vector/block alignment (e.g., assert quant_block_size %
vector_size == 0 and vector_size <= quant_block_size, or raise a clear
ValueError) so that the expression s = quantizer._psx_scales[:, j //
quant_block_size] is safe; place this guard near the uses of
vector_size/quant_block_size (around where quant_block_size =
quantizer.backend_extra_args["block_sizes"] and before the for blk_start...
loop) and include a descriptive message referencing vector_size and
quant_block_size.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 48a84d8f-2322-42e2-9fb8-609c5557c778

📥 Commits

Reviewing files that changed from the base of the PR and between 7a29dc0 and 78f0d8a.

📒 Files selected for processing (1)
  • modelopt/torch/quantization/utils/calib_utils.py

Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR adds GPTQ support for vector LUT quantization with several bug fixes. While the core algorithm and tests look solid, there are several issues to address:

  1. Bare assert for runtime validation in _blockwise_vector_update — these should be RuntimeError/ValueError since asserts can be optimized away with -O.
  2. Tests depend on undocumented luts module — all 3 tests import from luts which is not a standard package and will fail without it. Tests need skip markers or the dependency needs to be documented.
  3. getattr(mtq, args.qformat) fallback in hf_ptq.py allows arbitrary attribute access — should validate the returned value is actually a valid quant config dict.
  4. Inconsistent dtypetorch_dtype fix — one instance of dtype="auto" on line ~641 of example_utils.py was not updated.
  5. weight_quantizer.disable() after GPTQ is a broad behavioral change affecting all GPTQ paths — this needs to be explicitly called out as it may affect downstream code that expects the quantizer to still be enabled after GPTQ.

Comment thread modelopt/torch/quantization/utils/calib_utils.py Outdated
Comment thread modelopt/torch/quantization/utils/calib_utils.py
Comment thread modelopt/torch/quantization/model_calib.py Outdated
Comment thread tests/gpu/torch/quantization/test_gptq_vq.py Outdated
Comment thread tests/gpu/torch/quantization/test_gptq_vq.py Outdated
Comment thread examples/llm_ptq/hf_ptq.py Outdated
Comment thread examples/llm_ptq/example_utils.py Outdated
Comment thread modelopt/torch/quantization/utils/calib_utils.py Outdated
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a small, focused PR (+39 -4, 3 files) that adds two clean changes:

  1. KV cache reset in sequential calibration (model_calib.py): Prevents KV cache from leaking across forward replays during GPTQ's multi-pass calibration. The implementation correctly handles both DynamicCache (.reset()) and older cache formats (set to None), and properly shallow-copies kwargs_input to avoid mutating captured references.

  2. Backend-specific GPTQ helper registry (calib_utils.py + model_calib.py): Adds _GPTQ_HELPER_REGISTRY and register_gptq_helper() to allow backend-specific GPTQ subclasses to be registered and dispatched. The lookup in _make_gptq_handle correctly falls back to the default GPTQHelper when no backend is set or no custom helper is registered. The backend attribute access via getattr(m.weight_quantizer, "backend", None) is correct per the TensorQuantizer implementation.

Most previous review comments were from an earlier, larger iteration of the PR that included vector LUT implementation, test files, and example changes. Those files have been removed from this diff, making the previous critical comments no longer applicable. The current diff is infrastructure-only, setting up the extension point for backend-specific GPTQ helpers (presumably to be used by the vector LUT GPTQ in a follow-up).

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
@sugunav14 sugunav14 force-pushed the svelury/gptq-vector branch from 7f8bad7 to f2f2825 Compare April 17, 2026 18:39
@sugunav14 sugunav14 merged commit dc7ad66 into main Apr 17, 2026
45 checks passed
@sugunav14 sugunav14 deleted the svelury/gptq-vector branch April 17, 2026 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants