GPTQ vector by sugunav14 · Pull Request #1223 · NVIDIA/Model-Optimizer

sugunav14 · 2026-04-09T15:45:06Z

What does this PR do?

Type of change: ?

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

New Features
- Added backend-specific GPTQ helper registration to allow backend-tailored GPTQ behavior.
Bug Fixes
- Prevented KV-cache state from leaking across repeated per-layer forwards during calibration.
Tests
- Added GPU-focused tests validating GPTQ combined with vector quantization, including accuracy and end-to-end comparisons.

copy-pr-bot · 2026-04-09T15:45:11Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-04-09T15:45:15Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a GPTQ helper registry, selects helper by backend in GPTQ, prevents KV-cache leakage during per-layer sequential calibration, inserts a whitespace edit in an example script, and adds a new GPU test module implementing and validating GPTQ + vector quantization.

Changes

Cohort / File(s)	Summary
GPTQ Helper Registry `modelopt/torch/quantization/utils/calib_utils.py`	Introduce `_GPTQ_HELPER_REGISTRY: dict[str, type[GPTQHelper]]` and `register_gptq_helper(backend: str, factory: type[GPTQHelper]) -> None` to allow registering backend-specific GPTQ helper classes.
GPTQ selection & KV-cache handling `modelopt/torch/quantization/model_calib.py`	In `gptq()`, choose helper class from `_GPTQ_HELPER_REGISTRY` keyed by `m.weight_quantizer.backend` (defaulting to `GPTQHelper` when absent). In `sequential_calibrate`, clone per-replay `kwargs_input` and clear `past_key_values` (via `reset()` if supported or set to `None`) before layer forwards to avoid KV-cache carryover.
Example whitespace edit `examples/llm_ptq/hf_ptq.py`	Non-functional whitespace change: added a blank line before a `get_dataset_dataloader(...)` call (no semantic change).
GPU tests: GPTQ + VQ `tests/gpu/torch/quantization/test_gptq_vq.py`	Add comprehensive GPU test module with reference GPTQ+vector-quantization implementations, utilities for codebook and Hessian generation, and three tests comparing VQ paths, blockwise updates, and end-to-end quantization behavior.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The title 'GPTQ vector' is extremely vague and does not clearly convey the actual changes. The PR implements GPTQ weight quantization with vector quantization (VQ), KV-cache handling, backend-specific helper registry, and comprehensive tests, but the title fails to communicate these substantive improvements.	Use a more descriptive title such as 'Add GPTQ with vector quantization support and KV-cache management' or 'Implement GPTQ vector quantization with configurable backends'. The title should reflect the main technical contribution beyond just the acronym.
Docstring Coverage	⚠️ Warning	Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Security Anti-Patterns	✅ Passed	PR contains no security anti-patterns as defined in SECURITY.md including unsafe torch.load, numpy.load, hardcoded trust_remote_code, dangerous eval/exec calls, or problematic dependencies.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch svelury/gptq-vector

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-09T15:49:48Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-17 22:55 UTC

codecov · 2026-04-09T15:57:50Z

Codecov Report

❌ Patch coverage is 82.35294% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.55%. Comparing base (4e33368) to head (f2f2825).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/quantization/model_calib.py	85.71%	2 Missing ⚠️
modelopt/torch/quantization/utils/calib_utils.py	66.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1223      +/-   ##
==========================================
+ Coverage   72.74%   76.55%   +3.81%     
==========================================
  Files         459      459              
  Lines       48611    48626      +15     
==========================================
+ Hits        35361    37227    +1866     
+ Misses      13250    11399    -1851

Flag	Coverage Δ
examples	`41.36% <17.64%> (+1.92%)`	⬆️
gpu	`59.97% <82.35%> (+7.77%)`	⬆️
unit	`52.21% <23.52%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (4)

examples/llm_ptq/hf_ptq.py (2)

1075-1082: Same inconsistency: error message doesn't mention mtq attribute fallback.

For consistency, the error message should mention both supported sources.

♻️ Suggested fix

             else:
                 raise AssertionError(
-                    f"Unsupported quantization format: {args.qformat}, choices are: {list(QUANT_CFG_CHOICES.keys())}"
+                    f"Unsupported quantization format: {args.qformat}, choices are: {list(QUANT_CFG_CHOICES.keys())} or any mtq attribute"
                 )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 1075 - 1082, The AssertionError
message for unsupported quantization format only lists QUANT_CFG_CHOICES keys
but omits the mtq attribute fallback; update the error message in the block
handling args.qformat (where QUANT_CFG_CHOICES, mtq, getattr(mtq, args.qformat),
and quant_cfg are used) to mention both valid keys from QUANT_CFG_CHOICES and
that additional formats may be available as attributes on mtq (or list mtq
attributes if desirable) so the error clearly indicates both supported sources.

423-431: Inconsistent error message between low-memory mode and standard mode.

The error message in low-memory mode (line 429-431) only mentions QUANT_CFG_CHOICES.keys() but doesn't mention that mtq attributes are also supported (which lines 425-426 check for). Consider aligning the error message with the actual supported formats.

♻️ Suggested fix for consistent error message

         else:
             raise AssertionError(
                 f"Quantization format is not supported for low memory mode. "
-                f"Supported formats: {QUANT_CFG_CHOICES.keys()}"
+                f"Supported formats: {list(QUANT_CFG_CHOICES.keys())} or any mtq attribute"
             )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 423 - 431, The error message for the
low-memory branch (the block that checks args.qformat against QUANT_CFG_CHOICES
and mtq via hasattr/getattr and sets quant_cfg) is inconsistent: update the
AssertionError text to reflect both supported sources (keys of QUANT_CFG_CHOICES
and supported attributes on mtq) — e.g., mention QUANT_CFG_CHOICES.keys() and
the list of mtq attribute names or a phrase like "or available mtq formats" — so
the message accurately describes what formats are accepted when raising the
AssertionError for args.qformat.

tests/gpu/torch/quantization/test_gptq_vq.py (1)

307-316: Consider adding test cases for edge cases and error conditions.

The current tests cover the happy path with specific dimensions. Consider adding tests for:

Misaligned dimensions (e.g., in_features not divisible by vector_size)

Invalid configurations that should raise assertions

Different block size combinations

This would improve confidence in the assertion checks added in the implementation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu/torch/quantization/test_gptq_vq.py` around lines 307 - 316, Add
unit tests in test_gptq_vq.py that exercise edge and error cases using the
existing constants (SEED, IN_FEATURES, OUT_FEATURES, VECTOR_SIZE,
GPTQ_BLOCK_SIZE, QUANT_BLOCK_SIZE, N_CODEWORDS, PERC_DAMP, SCALE_TYPE) by: 1)
creating a case where IN_FEATURES is not divisible by VECTOR_SIZE and asserting
the code raises the appropriate exception/AssertionError; 2) creating invalid
config combinations (e.g., VECTOR_SIZE <= 0, N_CODEWORDS out of range, PERC_DAMP
negative) and asserting they raise errors; and 3) parametrizing tests to run
across multiple GPTQ_BLOCK_SIZE/QUANT_BLOCK_SIZE combinations (including
mismatched sizes) to validate assertion checks and expected failures.

modelopt/torch/quantization/utils/calib_utils.py (1)

50-72: Adding explicit weights_only=True to torch.load would improve code clarity.

The torch.load on line 65 is already safe because the project requires PyTorch 2.8+, which defaults weights_only=True. However, making the parameter explicit improves readability and documents intent, since the codebook file only contains tensor data.

Optional improvement

     if "sorted" in encode_format:
-        cb = torch.load(encode_path + encode_format + ".pt", map_location="cpu")
+        cb = torch.load(encode_path + encode_format + ".pt", map_location="cpu", weights_only=True)
         codebook = cb["sorted_values"].cuda().float()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/utils/calib_utils.py` around lines 50 - 72, In
load_vector_lut_codebook, make the torch.load call explicit about loading
tensor-only data by adding weights_only=True to the torch.load invocation (the
branch that loads cb = torch.load(...)). Update the torch.load call inside the
load_vector_lut_codebook function so it passes weights_only=True (and keep
map_location="cpu") to clearly document intent that the file contains only
tensor data.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/quantization/utils/calib_utils.py`:
- Around line 258-314: The _blockwise_vector_update method assumes
quant_block_size aligns with vector_size for correct scale indexing; add an
assertion after loading codebook/computing vector_size and quant_block_size to
validate quant_block_size % vector_size == 0 (e.g., in _blockwise_vector_update
right after vector_size = codebook.shape[1]) so the j // quant_block_size scale
lookup in the loop is safe and fails fast with a clear message if misconfigured.

In `@tests/gpu/torch/quantization/test_gptq_vq.py`:
- Around line 320-321: The tests import luts (clip_vector_prescaled,
clip_vector_scalesign_fast) in tests/gpu/torch/quantization/test_gptq_vq.py and
calib_utils.py also imports luts at two locations; add luts as an optional test
dependency in pyproject.toml (e.g., under a [project.optional-dependencies]
group like "dev-test") or modify the test modules to skip when missing by
calling pytest.importorskip("luts") at the top of the failing test functions (or
module-level in test_gptq_vq.py) and similarly protect the imports/usage sites
in modelopt/torch/quantization/utils/calib_utils.py (the two import/usage
locations around the current lines) so runtime failures are avoided when luts is
not installed.

---

Nitpick comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 1075-1082: The AssertionError message for unsupported quantization
format only lists QUANT_CFG_CHOICES keys but omits the mtq attribute fallback;
update the error message in the block handling args.qformat (where
QUANT_CFG_CHOICES, mtq, getattr(mtq, args.qformat), and quant_cfg are used) to
mention both valid keys from QUANT_CFG_CHOICES and that additional formats may
be available as attributes on mtq (or list mtq attributes if desirable) so the
error clearly indicates both supported sources.
- Around line 423-431: The error message for the low-memory branch (the block
that checks args.qformat against QUANT_CFG_CHOICES and mtq via hasattr/getattr
and sets quant_cfg) is inconsistent: update the AssertionError text to reflect
both supported sources (keys of QUANT_CFG_CHOICES and supported attributes on
mtq) — e.g., mention QUANT_CFG_CHOICES.keys() and the list of mtq attribute
names or a phrase like "or available mtq formats" — so the message accurately
describes what formats are accepted when raising the AssertionError for
args.qformat.

In `@modelopt/torch/quantization/utils/calib_utils.py`:
- Around line 50-72: In load_vector_lut_codebook, make the torch.load call
explicit about loading tensor-only data by adding weights_only=True to the
torch.load invocation (the branch that loads cb = torch.load(...)). Update the
torch.load call inside the load_vector_lut_codebook function so it passes
weights_only=True (and keep map_location="cpu") to clearly document intent that
the file contains only tensor data.

In `@tests/gpu/torch/quantization/test_gptq_vq.py`:
- Around line 307-316: Add unit tests in test_gptq_vq.py that exercise edge and
error cases using the existing constants (SEED, IN_FEATURES, OUT_FEATURES,
VECTOR_SIZE, GPTQ_BLOCK_SIZE, QUANT_BLOCK_SIZE, N_CODEWORDS, PERC_DAMP,
SCALE_TYPE) by: 1) creating a case where IN_FEATURES is not divisible by
VECTOR_SIZE and asserting the code raises the appropriate
exception/AssertionError; 2) creating invalid config combinations (e.g.,
VECTOR_SIZE <= 0, N_CODEWORDS out of range, PERC_DAMP negative) and asserting
they raise errors; and 3) parametrizing tests to run across multiple
GPTQ_BLOCK_SIZE/QUANT_BLOCK_SIZE combinations (including mismatched sizes) to
validate assertion checks and expected failures.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 8f2a775c-a4a0-4aaf-aa80-b16f69d1d713

📥 Commits

Reviewing files that changed from the base of the PR and between 73be810 and 7a29dc0.

📒 Files selected for processing (5)

examples/llm_ptq/example_utils.py
examples/llm_ptq/hf_ptq.py
modelopt/torch/quantization/model_calib.py
modelopt/torch/quantization/utils/calib_utils.py
tests/gpu/torch/quantization/test_gptq_vq.py

coderabbitai · 2026-04-14T23:33:47Z

+    def _blockwise_vector_update(self, block_size):
+        """GPTQ blockwise update for vector LUT quantizers.
+
+        Pre-computes scales once, then runs the standard GPTQ 3-loop
+        with per-vector-group static quantization via clip_vector_prescaled.
+        """
+        import torch.nn.functional as F
+        from luts import clip_vector_prescaled, clip_vector_scalesign_fast
+
+        codebook, quant_block_size, scale_type = load_vector_lut_codebook(
+            self.module.weight_quantizer
+        )
+
+        # Get vector size from codebook
+        vector_size = codebook.shape[1]
+
+        assert self.weight is not None and self.h_inv is not None
+        num_cols = self.weight.shape[1]
+        assert block_size % quant_block_size == 0
+
+        # Pre-compute scales once outside the GPTQ loop
+        _, scales = clip_vector_scalesign_fast(
+            self.weight,
+            codebook,
+            quant_block_size,
+            scale_type,
+            scale_algo="max",
+            sign_scale=True,
+            return_scales=True,
+        )
+        scales_2d = scales.reshape(self.weight.shape[0], -1)
+
+        w = self.weight.clone()
+        h_inv = self.h_inv
+
+        for blk_start in range(0, num_cols, block_size):
+            blk_end = min(blk_start + block_size, num_cols)
+            errs = torch.zeros_like(w[:, blk_start:blk_end])
+
+            for j in range(blk_start, blk_end, vector_size):
+                d = min(vector_size, blk_end - j)
+                s = scales_2d[:, j // quant_block_size].contiguous()
+
+                sub = w[:, j : j + d].contiguous()
+                if d < vector_size:
+                    sub = F.pad(sub, (0, vector_size - d))
+                q_sub = clip_vector_prescaled(sub, codebook, s)
+
+                for k in range(d):
+                    col = j + k
+                    self.weight[:, col] = q_sub[:, k]
+                    err = (w[:, col] - q_sub[:, k]) / h_inv[col, col]
+                    errs[:, col - blk_start] = err
+                    w[:, col:blk_end].addr_(err, h_inv[col, col:blk_end], alpha=-1)
+
+            if blk_end < num_cols:
+                w[:, blk_end:] -= errs @ h_inv[blk_start:blk_end, blk_end:]


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check if there are any constraints documented between quant_block_size and vector_size rg -n "quant_block_size.*vector" --type py rg -n "vector_size.*block" --type py

Repository: NVIDIA/Model-Optimizer

Length of output: 193

🏁 Script executed:

# Find the load_vector_lut_codebook function definition rg -n "def load_vector_lut_codebook" --type py -A 20

Repository: NVIDIA/Model-Optimizer

Length of output: 1926

🏁 Script executed:

# Check if there are any tests for this functionality fd ".*test.*\.py" --type f | xargs rg -l "blockwise_vector_update|load_vector_lut_codebook" 2>/dev/null

Repository: NVIDIA/Model-Optimizer

Length of output: 110

🏁 Script executed:

# Look for any documentation or validation of quant_block_size and vector_size rg -n "vector_size|quant_block_size" modelopt/torch/quantization/utils/calib_utils.py --type py -B 2 -A 2 | head -80

Repository: NVIDIA/Model-Optimizer

Length of output: 1415

🏁 Script executed:

# Examine the test file to understand the dimension relationships cat tests/gpu/torch/quantization/test_gptq_vq.py | head -150

Repository: NVIDIA/Model-Optimizer

Length of output: 5382

🏁 Script executed:

# Check for any comments or documentation about vector_size and quant_block_size in the entire calib_utils file cat modelopt/torch/quantization/utils/calib_utils.py | grep -n "vector_size\|quant_block_size" | head -30

Repository: NVIDIA/Model-Optimizer

Length of output: 633

🏁 Script executed:

# Check the complete load_vector_lut_codebook function sed -n '50,100p' modelopt/torch/quantization/utils/calib_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1929

Add assertion to validate quant_block_size % vector_size == 0.

The scale indexing at line 299 (j // quant_block_size) assumes vector groups align with quantization block boundaries. While line 276 asserts block_size % quant_block_size == 0, there's no explicit check that quant_block_size % vector_size == 0. The reference implementation in the test file (test_gptq_vq.py) validates this constraint explicitly; adding this assertion here ensures correct scale lookups.

Suggested fix

# Get vector size from codebook vector_size = codebook.shape[1] assert self.weight is not None and self.h_inv is not None num_cols = self.weight.shape[1] assert block_size % quant_block_size == 0 + assert quant_block_size % vector_size == 0, ( + f"quant_block_size ({quant_block_size}) must be divisible by vector_size ({vector_size})" + )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/utils/calib_utils.py` around lines 258 - 314, The _blockwise_vector_update method assumes quant_block_size aligns with vector_size for correct scale indexing; add an assertion after loading codebook/computing vector_size and quant_block_size to validate quant_block_size % vector_size == 0 (e.g., in _blockwise_vector_update right after vector_size = codebook.shape[1]) so the j // quant_block_size scale lookup in the loop is safe and fails fast with a clear message if misconfigured.

coderabbitai

♻️ Duplicate comments (1)

modelopt/torch/quantization/utils/calib_utils.py (1)

261-280: ⚠️ Potential issue | 🟠 Major

Add explicit vector/block alignment validation before scale indexing.

Line 280 uses j // quant_block_size for scale lookup. Without a guard that quant_block_size is divisible by vector_size, vector chunks can straddle quant-block boundaries and select incorrect scales.

Suggested fix

         vector_size = quantizer.backend_extra_args["vector_size"]
         quant_block_size = quantizer.backend_extra_args["block_sizes"]

         assert self.weight is not None and self.h_inv is not None
         num_cols = self.weight.shape[1]
         assert block_size % quant_block_size == 0
+        assert quant_block_size % vector_size == 0, (
+            f"quant_block_size ({quant_block_size}) must be divisible by vector_size ({vector_size})"
+        )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/utils/calib_utils.py` around lines 261 - 280, The
code can index quantizer._psx_scales incorrectly when vector chunks cross
quant-block boundaries; add an explicit validation before the blk loop ensuring
vector/block alignment (e.g., assert quant_block_size % vector_size == 0 and
vector_size <= quant_block_size, or raise a clear ValueError) so that the
expression s = quantizer._psx_scales[:, j // quant_block_size] is safe; place
this guard near the uses of vector_size/quant_block_size (around where
quant_block_size = quantizer.backend_extra_args["block_sizes"] and before the
for blk_start... loop) and include a descriptive message referencing vector_size
and quant_block_size.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@modelopt/torch/quantization/utils/calib_utils.py`:
- Around line 261-280: The code can index quantizer._psx_scales incorrectly when
vector chunks cross quant-block boundaries; add an explicit validation before
the blk loop ensuring vector/block alignment (e.g., assert quant_block_size %
vector_size == 0 and vector_size <= quant_block_size, or raise a clear
ValueError) so that the expression s = quantizer._psx_scales[:, j //
quant_block_size] is safe; place this guard near the uses of
vector_size/quant_block_size (around where quant_block_size =
quantizer.backend_extra_args["block_sizes"] and before the for blk_start...
loop) and include a descriptive message referencing vector_size and
quant_block_size.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 48a84d8f-2322-42e2-9fb8-609c5557c778

📥 Commits

Reviewing files that changed from the base of the PR and between 7a29dc0 and 78f0d8a.

📒 Files selected for processing (1)

modelopt/torch/quantization/utils/calib_utils.py

cjluo-nv

The PR adds GPTQ support for vector LUT quantization with several bug fixes. While the core algorithm and tests look solid, there are several issues to address:

Bare assert for runtime validation in _blockwise_vector_update — these should be RuntimeError/ValueError since asserts can be optimized away with -O.
Tests depend on undocumented luts module — all 3 tests import from luts which is not a standard package and will fail without it. Tests need skip markers or the dependency needs to be documented.
getattr(mtq, args.qformat) fallback in hf_ptq.py allows arbitrary attribute access — should validate the returned value is actually a valid quant config dict.
Inconsistent dtype → torch_dtype fix — one instance of dtype="auto" on line ~641 of example_utils.py was not updated.
weight_quantizer.disable() after GPTQ is a broad behavioral change affecting all GPTQ paths — this needs to be explicitly called out as it may affect downstream code that expects the quantizer to still be enabled after GPTQ.

cjluo-nv

This is a small, focused PR (+39 -4, 3 files) that adds two clean changes:

KV cache reset in sequential calibration (model_calib.py): Prevents KV cache from leaking across forward replays during GPTQ's multi-pass calibration. The implementation correctly handles both DynamicCache (.reset()) and older cache formats (set to None), and properly shallow-copies kwargs_input to avoid mutating captured references.
Backend-specific GPTQ helper registry (calib_utils.py + model_calib.py): Adds _GPTQ_HELPER_REGISTRY and register_gptq_helper() to allow backend-specific GPTQ subclasses to be registered and dispatched. The lookup in _make_gptq_handle correctly falls back to the default GPTQHelper when no backend is set or no custom helper is registered. The backend attribute access via getattr(m.weight_quantizer, "backend", None) is correct per the TensorQuantizer implementation.

Most previous review comments were from an earlier, larger iteration of the PR that included vector LUT implementation, test files, and example changes. Those files have been removed from this diff, making the previous critical comments no longer applicable. The current diff is infrastructure-only, setting up the extension point for backend-specific GPTQ helpers (presumably to be used by the vector LUT GPTQ in a follow-up).

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

sugunav14 force-pushed the svelury/gptq-vector branch from 6e9b1d1 to 7a29dc0 Compare April 14, 2026 22:54

sugunav14 marked this pull request as ready for review April 14, 2026 23:28

sugunav14 requested review from a team as code owners April 14, 2026 23:28

sugunav14 requested review from cjluo-nv, meenchen and realAsma April 14, 2026 23:28

coderabbitai bot reviewed Apr 14, 2026

View reviewed changes

coderabbitai bot reviewed Apr 16, 2026

View reviewed changes

cjluo-nv reviewed Apr 16, 2026

View reviewed changes

realAsma approved these changes Apr 17, 2026

View reviewed changes

cjluo-nv approved these changes Apr 17, 2026

View reviewed changes

sugunav14 added 9 commits April 17, 2026 18:39

GPTQ vector and unit test

c98933d

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

latest tested on Qwen3-8B

50d7009

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

removed dataset holdout logic

adafec5

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

reverted dataset_utils

b2f5a6e

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

update

23aeaf9

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

update

ad03449

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

update

5870cc1

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

minor cleanup

956d44b

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

removed stray unit tests

f2f2825

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

sugunav14 force-pushed the svelury/gptq-vector branch from 7f8bad7 to f2f2825 Compare April 17, 2026 18:39

sugunav14 merged commit dc7ad66 into main Apr 17, 2026
45 checks passed

sugunav14 deleted the svelury/gptq-vector branch April 17, 2026 22:54

Conversation

sugunav14 commented Apr 9, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Apr 9, 2026

Uh oh!

coderabbitai bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

github-actions bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sugunav14 commented Apr 9, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 9, 2026 •

edited

Loading

github-actions bot commented Apr 9, 2026 •

edited

Loading

codecov bot commented Apr 9, 2026 •

edited

Loading