Nothing to see here... by Qubitium · Pull Request #2456 · ModelCloud/GPTQModel

Qubitium · 2026-03-09T05:08:09Z

Too many changes to list them all:

major refractor BaseQuantLInear
Add paroquant, gguf, bitsandbytes, exllama v3. rtn support
rtn partial support was already added for fallback moe but now rtn is standalone quant method.
add optional external bitsandbytes pkg depend due to bitsandbytes kernel api usage.
gguf and exllama v3 kernels are ported and internal: fast gguf triton kernel added
deprecate non-working exllama_eora kernel (fusing of eora/lora ops with exllama v2 code)
refractor/redesign quantization pipeline
failsafe renamed to fallback

…a sigma-width window

Qubitium · 2026-03-10T05:26:18Z

Fused gguf: zen3

 +--------+------------+-------------+----------+---------+
  | device | case       | baseline_ms | fused_ms | speedup |
  +--------+------------+-------------+----------+---------+
  | cpu    | attn q4_k  | 27.397      | 18.865   | 1.45x   |
  | cpu    | attn q5_k  | 28.618      | 22.510   | 1.27x   |
  | cpu    | attn q6_k  | 26.458      | 28.665   | 0.92x   |
  | cpu    | mlp q4_k   | 83.428      | 45.598   | 1.83x   |
  | cpu    | mlp q5_k   | 101.261     | 50.265   | 2.01x   |
  | cpu    | mlp q6_k   | 84.662      | 51.076   | 1.66x   |
  | cuda   | attn q4_k  | 0.778       | 0.652    | 1.19x   |
  | cuda   | attn q5_k  | 0.612       | 0.625    | 0.98x   |
  | cuda   | attn q6_k  | 0.433       | 0.440    | 0.99x   |
  | cuda   | mlp q4_k   | 0.793       | 0.596    | 1.33x   |
  | cuda   | mlp q5_k   | 0.943       | 0.780    | 1.21x   |
  | cuda   | mlp q6_k   | 0.720       | 0.535    | 1.35x   |
  +--------+------------+-------------+----------+---------+

  Autotuned dispatch, shipped defaults, post-warmup steady state:

  +--------+------------+------+-----------+-------------+---------+
  | device | case       | plan | static_ms | autotune_ms | speedup |
  +--------+------------+------+-----------+-------------+---------+
  | cpu    | attn q4_k  | fused| 24.534    | 22.205      | 1.10x   |
  | cpu    | attn q5_k  | fused| 24.922    | 23.628      | 1.05x   |
  | cpu    | attn q6_k  | fused| 16.560    | 14.065      | 1.18x   |
  | cpu    | mlp q4_k   | fused| 48.167    | 44.739      | 1.08x   |
  | cpu    | mlp q5_k   | fused| 58.313    | 53.621      | 1.09x   |
  | cpu    | mlp q6_k   | fused| 53.546    | 49.650      | 1.08x   |
  | cuda   | attn q4_k  | none | 0.543     | 0.530       | 1.02x   |
  | cuda   | attn q5_k  | none | 0.649     | 0.647       | 1.00x   |
  | cuda   | attn q6_k  | none | 0.507     | 0.612       | 0.83x   |
  | cuda   | mlp q4_k   | fused| 0.589     | 0.593       | 0.99x   |
  | cuda   | mlp q5_k   | fused| 0.692     | 0.702       | 0.99x   |
  | cuda   | mlp q6_k   | fused| 0.525     | 0.521       | 1.01x   |
  +--------+------------+------+-----------+-------------+---------+

  Autotuned dispatch with --force-candidate, to answer the earlier attention question directly:

  +--------+------------+---------------+-----------+-------------+---------+
  | device | case       | autotune plan | static_ms | autotune_ms | speedup |
  +--------+------------+---------------+-----------+-------------+---------+
  | cpu    | attn q4_k  | fused         | 20.209    | 18.833      | 1.07x   |
  | cpu    | attn q5_k  | fused         | 26.965    | 25.374      | 1.06x   |
  | cpu    | attn q6_k  | fused         | 20.340    | 18.419      | 1.10x   |
  | cpu    | mlp q4_k   | fused         | 54.147    | 49.410      | 1.10x   |
  | cpu    | mlp q5_k   | fused         | 58.414    | 50.907      | 1.15x   |
  | cpu    | mlp q6_k   | fused         | 46.514    | 48.151      | 0.97x   |
  | cuda   | attn q4_k  | dense         | 0.543     | 0.540       | 1.00x   |
  | cuda   | attn q5_k  | dense         | 0.656     | 0.625       | 1.05x   |
  | cuda   | attn q6_k  | dense         | 0.459     | 0.467       | 0.98x   |
  | cuda   | mlp q4_k   | fused         | 0.699     | 0.645       | 1.08x   |
  | cuda   | mlp q5_k   | fused         | 0.677     | 0.672       | 1.01x   |
  | cuda   | mlp q6_k   | fused         | 0.570     | 0.701       | 0.81x   |
  +--------+------------+---------------+-----------+-------------+---------+

Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>

Qubitium added 2 commits March 9, 2026 04:27

init calibration less quant refractor

1b1e8f4

refractor quant config

661326b

github-code-quality Bot found potential problems Mar 9, 2026

View reviewed changes

Comment thread gptqmodel/looper/calibrationless_gptq_processor.py Fixed

Comment thread gptqmodel/looper/calibrationless_gptq_processor.py Fixed

Qubitium added 4 commits March 9, 2026 05:26

refractor quant config 2

781ba2c

refractor quant config 3

995b5da

rename calibrationless to weight_only

55084cd

fix awq oom

c00531f

github-code-quality Bot found potential problems Mar 9, 2026

View reviewed changes

Comment thread tests/models/model_test.py Fixed

Qubitium added 11 commits March 9, 2026 07:13

v6.0.0 update

c1f125b

cleanup

dd85bc8

stable return tuples

e2cc88e

accelerate depend 1.13.0

3e29f23

cleanup hf kernel gptq/awq post_init loading

ee1ba3f

fix test

e256c76

fix SmoothMAD overly-aggressive clipping: normalize k to behave like …

88e4e9b

…a sigma-width window

simplify

a3b5ee7

initial gguf

b5da414

gguf refractor

1cb7bca

gguf refractor

75eb3db

github-code-quality Bot found potential problems Mar 9, 2026

View reviewed changes

Comment thread gptqmodel/nn_modules/qlinear/gguf.py Fixed

Qubitium added 3 commits March 10, 2026 01:05

gguf unit test

0287831

fix gguf should directly bypass rtn with optional smoother

9d3c96d

add test

e8da713

github-code-quality Bot found potential problems Mar 10, 2026

View reviewed changes

Comment thread gptqmodel/nn_modules/qlinear/gguf.py Fixed

Qubitium added 2 commits March 10, 2026 03:42

refractor config

c14d624

refractor config part2

c76acac

github-code-quality Bot found potential problems Mar 10, 2026

View reviewed changes

Comment thread gptqmodel/quantization/config.py Fixed

Comment thread tests/qcfg/test_config_dispatch.py Fixed

Comment thread tests/test_weight_only_config.py Fixed

Comment thread tests/test_weight_only.py Fixed

Qubitium added 2 commits March 10, 2026 04:51

gguf dequant to native type, not fp32

af124eb

fuse gguf ops

c52e82b

add comments

8a012db

github-code-quality Bot found potential problems Mar 19, 2026

View reviewed changes

Comment thread gptqmodel/looper/paroquant_processor.py Fixed

Qubitium added 2 commits March 19, 2026 18:46

extract fwd execution (serial/parallel) into own module

3ff82d0

fold execution related proerties into ExecutionConfig

5b72e41

github-code-quality Bot found potential problems Mar 19, 2026

View reviewed changes

Comment thread gptqmodel/looper/forward_executor.py Fixed

Comment thread gptqmodel/looper/module_looper.py Fixed

Comment thread gptqmodel/looper/module_looper.py Fixed

Comment thread gptqmodel/looper/forward_executor.py Fixed

Qubitium and others added 16 commits March 20, 2026 03:32

simplification of subset control flow

5ae6c1c

Merge remote-tracking branch 'origin/main' into refractor-simple-quant

5c5d736

convert run_subset_stage() to a plan-only

33a33ad

add small moe test

ebb4cf4

reduce logging noise for ci

4801b7f

reduce logbar pb ci output

0d43a6a

update log bar depend

f6f8f1e

fix test_deepseekv2_lite.py

41ab4ee

Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>

fix test_dream.py

4faed80

Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>

fix test_brumby.py

4dc4e08

Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>

fix test_ernie4_5.py

ca7d282

Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>

remove test_ernie4_5_monkeypatch.py

fd5739b

Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>

ruff

29dac9e

fix import

a5605bb

add split_by=layer save option

95467c9

cleanup split_by

1137e69

Qubitium marked this pull request as ready for review March 20, 2026 12:39

Qubitium added 6 commits March 20, 2026 13:14

cleanup

ba0d24c

fix citation

a2434eb

fix paroquant

0b38ebf

headers

12d36e3

cleanup ide

3ee3f70

update readme

701bd25

Qubitium merged commit 96ff08b into main Mar 22, 2026
6 checks passed

Qubitium deleted the refractor-simple-quant branch March 22, 2026 03:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nothing to see here...#2456

Nothing to see here...#2456
Qubitium merged 130 commits into
mainfrom
refractor-simple-quant

Qubitium commented Mar 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Qubitium commented Mar 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Qubitium commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Qubitium commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Qubitium commented Mar 9, 2026 •

edited

Loading

Qubitium commented Mar 10, 2026 •

edited

Loading