Skip to content

Nothing to see here...#2456

Merged
Qubitium merged 130 commits into
mainfrom
refractor-simple-quant
Mar 22, 2026
Merged

Nothing to see here...#2456
Qubitium merged 130 commits into
mainfrom
refractor-simple-quant

Conversation

@Qubitium

@Qubitium Qubitium commented Mar 9, 2026

Copy link
Copy Markdown
Collaborator

Too many changes to list them all:

  1. major refractor BaseQuantLInear
  2. Add paroquant, gguf, bitsandbytes, exllama v3. rtn support
  3. rtn partial support was already added for fallback moe but now rtn is standalone quant method.
  4. add optional external bitsandbytes pkg depend due to bitsandbytes kernel api usage.
  5. gguf and exllama v3 kernels are ported and internal: fast gguf triton kernel added
  6. deprecate non-working exllama_eora kernel (fusing of eora/lora ops with exllama v2 code)
  7. refractor/redesign quantization pipeline
  8. failsafe renamed to fallback

Comment thread gptqmodel/looper/calibrationless_gptq_processor.py Fixed
Comment thread gptqmodel/looper/calibrationless_gptq_processor.py Fixed
Comment thread tests/models/model_test.py Fixed
Comment thread gptqmodel/nn_modules/qlinear/gguf.py Fixed
Comment thread gptqmodel/nn_modules/qlinear/gguf.py Fixed
Comment thread gptqmodel/quantization/config.py Fixed
Comment thread tests/qcfg/test_config_dispatch.py Fixed
Comment thread tests/test_weight_only_config.py Fixed
Comment thread tests/test_weight_only.py Fixed
@Qubitium

Qubitium commented Mar 10, 2026

Copy link
Copy Markdown
Collaborator Author

Fused gguf: zen3

 +--------+------------+-------------+----------+---------+
  | device | case       | baseline_ms | fused_ms | speedup |
  +--------+------------+-------------+----------+---------+
  | cpu    | attn q4_k  | 27.397      | 18.865   | 1.45x   |
  | cpu    | attn q5_k  | 28.618      | 22.510   | 1.27x   |
  | cpu    | attn q6_k  | 26.458      | 28.665   | 0.92x   |
  | cpu    | mlp q4_k   | 83.428      | 45.598   | 1.83x   |
  | cpu    | mlp q5_k   | 101.261     | 50.265   | 2.01x   |
  | cpu    | mlp q6_k   | 84.662      | 51.076   | 1.66x   |
  | cuda   | attn q4_k  | 0.778       | 0.652    | 1.19x   |
  | cuda   | attn q5_k  | 0.612       | 0.625    | 0.98x   |
  | cuda   | attn q6_k  | 0.433       | 0.440    | 0.99x   |
  | cuda   | mlp q4_k   | 0.793       | 0.596    | 1.33x   |
  | cuda   | mlp q5_k   | 0.943       | 0.780    | 1.21x   |
  | cuda   | mlp q6_k   | 0.720       | 0.535    | 1.35x   |
  +--------+------------+-------------+----------+---------+

  Autotuned dispatch, shipped defaults, post-warmup steady state:

  +--------+------------+------+-----------+-------------+---------+
  | device | case       | plan | static_ms | autotune_ms | speedup |
  +--------+------------+------+-----------+-------------+---------+
  | cpu    | attn q4_k  | fused| 24.534    | 22.205      | 1.10x   |
  | cpu    | attn q5_k  | fused| 24.922    | 23.628      | 1.05x   |
  | cpu    | attn q6_k  | fused| 16.560    | 14.065      | 1.18x   |
  | cpu    | mlp q4_k   | fused| 48.167    | 44.739      | 1.08x   |
  | cpu    | mlp q5_k   | fused| 58.313    | 53.621      | 1.09x   |
  | cpu    | mlp q6_k   | fused| 53.546    | 49.650      | 1.08x   |
  | cuda   | attn q4_k  | none | 0.543     | 0.530       | 1.02x   |
  | cuda   | attn q5_k  | none | 0.649     | 0.647       | 1.00x   |
  | cuda   | attn q6_k  | none | 0.507     | 0.612       | 0.83x   |
  | cuda   | mlp q4_k   | fused| 0.589     | 0.593       | 0.99x   |
  | cuda   | mlp q5_k   | fused| 0.692     | 0.702       | 0.99x   |
  | cuda   | mlp q6_k   | fused| 0.525     | 0.521       | 1.01x   |
  +--------+------------+------+-----------+-------------+---------+

  Autotuned dispatch with --force-candidate, to answer the earlier attention question directly:

  +--------+------------+---------------+-----------+-------------+---------+
  | device | case       | autotune plan | static_ms | autotune_ms | speedup |
  +--------+------------+---------------+-----------+-------------+---------+
  | cpu    | attn q4_k  | fused         | 20.209    | 18.833      | 1.07x   |
  | cpu    | attn q5_k  | fused         | 26.965    | 25.374      | 1.06x   |
  | cpu    | attn q6_k  | fused         | 20.340    | 18.419      | 1.10x   |
  | cpu    | mlp q4_k   | fused         | 54.147    | 49.410      | 1.10x   |
  | cpu    | mlp q5_k   | fused         | 58.414    | 50.907      | 1.15x   |
  | cpu    | mlp q6_k   | fused         | 46.514    | 48.151      | 0.97x   |
  | cuda   | attn q4_k  | dense         | 0.543     | 0.540       | 1.00x   |
  | cuda   | attn q5_k  | dense         | 0.656     | 0.625       | 1.05x   |
  | cuda   | attn q6_k  | dense         | 0.459     | 0.467       | 0.98x   |
  | cuda   | mlp q4_k   | fused         | 0.699     | 0.645       | 1.08x   |
  | cuda   | mlp q5_k   | fused         | 0.677     | 0.672       | 1.01x   |
  | cuda   | mlp q6_k   | fused         | 0.570     | 0.701       | 0.81x   |
  +--------+------------+---------------+-----------+-------------+---------+

Comment thread gptqmodel/looper/paroquant_processor.py Fixed
Comment thread gptqmodel/looper/forward_executor.py Fixed
Comment thread gptqmodel/looper/module_looper.py Fixed
Comment thread gptqmodel/looper/module_looper.py Fixed
Comment thread gptqmodel/looper/forward_executor.py Fixed
@Qubitium Qubitium marked this pull request as ready for review March 20, 2026 12:39
@Qubitium Qubitium merged commit 96ff08b into main Mar 22, 2026
6 checks passed
@Qubitium Qubitium deleted the refractor-simple-quant branch March 22, 2026 03:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants