Skip to content

Commit c6a7e02

Browse files
Production-hardening: fix v2.1.0rc1 quantization-state and architecture-fallback bugs
Reflects the issues identified in the codebase review on top of PR #27. Correctness fixes ----------------- * TurboModel._is_quantized is now a property derived from the loaded model's config.quantization_config and BitsAndBytes layer types, with an opt-in override slot used by from_gguf. This fixes: - from_config_only=True returning a random-weights model that was misreported as quantized; - missing bitsandbytes installs falling through silently while the flag stayed True; - pre-quantized HF repos (GPTQ/AWQ/etc.) not being recognized when the user passed quantize=False. * resolve_model_type now consults DEFAULT_ARCHITECTURE_FALLBACKS for unknown HF model_types and recognizes version-suffix patterns (qwen3 -> qwen2, llama4 -> llama, phi4 -> phi3, gemma3 -> gemma2, ...). The old logic only consulted the table when the config's model_type was empty, which never happens in practice. * register_architecture(model_class=...) is now discoverable under the original architecture name as well as the resolved base family, matching the documented API. * Removed an accidentally duplicated 'if is_bnb and is_8bit ...' block in the existing-quant detection branch. Robustness for new architectures and consumer hardware ----------------------------------------------------- * Greatly expanded DEFAULT_ARCHITECTURE_FALLBACKS (Llama 2/3/4, Qwen 2/2-MoE/3, Phi/3/4, Gemma/2/3, DeepSeek V2/V3, Cohere/Command-R, OLMo/2, SmolLM/2/3, Yi, StarCoder/2, InternLM/2, Baichuan, ChatGLM, StableLM, Falcon). * Pre-quantized HF repo names (Unsloth-style *-bnb-4bit, *-AWQ, *-GPTQ, *-INT4, *-FP8, etc.) are detected and surfaced as a hint; the embedded quantization_config is honoured. * GGUF-only repo names trigger a friendly hint pointing at from_gguf. * New TurboModel.report() returns a structured snapshot of the actual loaded model state (quant_method, device, dtype, params_billion). * TurboModel.is_quantized public property is the canonical answer rather than an instance flag that could drift. Production hygiene ------------------ * New .github/workflows/ci.yml runs ruff + pytest on Python 3.10/3.11 /3.12 and validates the build with python -m build / twine check. * New pyproject.toml provides PEP 517/518 build metadata plus a conservative ruff lint profile (only blocker-class rules) and pytest defaults. * New .pre-commit-config.yaml for local pre-commit enforcement. * New CHANGELOG.md documenting every change. Tests ----- * tests/test_quantization_state.py covers the from_config_only and is_quantized property fixes, the report() schema, and the override setter. * tests/test_resolve_model_type.py covers the fallback-table consultation, family-suffix matching, and registry-class lookup ergonomics. Docs ---- * docs/guide/loading-models.md updated to reflect the now-automatic fallbacks, the pre-quantized repo detection, and report(). * docs/guide/consumer-hardware.md added with per-tier guidance for CPU-only, Apple Silicon, 4-8 GB / 12-24 GB / multi-GPU.
1 parent c32c63d commit c6a7e02

9 files changed

Lines changed: 1220 additions & 60 deletions

File tree

.github/workflows/ci.yml

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches: [main]
6+
pull_request:
7+
branches: [main]
8+
workflow_dispatch:
9+
10+
# Cancel in-progress runs of the same workflow on the same branch / PR.
11+
concurrency:
12+
group: ${{ github.workflow }}-${{ github.ref }}
13+
cancel-in-progress: true
14+
15+
permissions:
16+
contents: read
17+
18+
jobs:
19+
lint:
20+
name: Ruff lint
21+
runs-on: ubuntu-latest
22+
steps:
23+
- uses: actions/checkout@v4
24+
- uses: actions/setup-python@v5
25+
with:
26+
python-version: "3.11"
27+
cache: pip
28+
- name: Install ruff
29+
run: |
30+
python -m pip install --upgrade pip
31+
pip install "ruff>=0.5.0"
32+
- name: Run ruff
33+
run: ruff check .
34+
35+
tests:
36+
name: Tests (Python ${{ matrix.python-version }})
37+
runs-on: ubuntu-latest
38+
strategy:
39+
fail-fast: false
40+
matrix:
41+
python-version: ["3.10", "3.11", "3.12"]
42+
steps:
43+
- uses: actions/checkout@v4
44+
- uses: actions/setup-python@v5
45+
with:
46+
python-version: ${{ matrix.python-version }}
47+
cache: pip
48+
- name: Install minimal runtime deps
49+
# We deliberately install CPU-only ``torch`` to keep CI fast and avoid
50+
# pulling CUDA / cuDNN wheels. Tests stub out the heavy I/O and never
51+
# touch a real GPU.
52+
run: |
53+
python -m pip install --upgrade pip
54+
pip install --index-url https://download.pytorch.org/whl/cpu "torch>=2.0.0"
55+
pip install \
56+
"transformers>=4.36.0" \
57+
"datasets>=2.14.0" \
58+
"accelerate>=0.24.0" \
59+
"peft>=0.6.0" \
60+
"scipy>=1.10.0" \
61+
"scikit-learn>=1.3.0" \
62+
"tqdm>=4.65.0" \
63+
"rich>=13.0.0" \
64+
"huggingface_hub>=0.20.0" \
65+
"psutil" \
66+
"gguf" \
67+
"py-cpuinfo" \
68+
"pytest>=7.4.0"
69+
- name: Run pytest
70+
env:
71+
QUANTLLM_BANNER: "0"
72+
run: pytest tests/ -ra
73+
74+
build:
75+
name: Build sdist + wheel
76+
runs-on: ubuntu-latest
77+
needs: [lint, tests]
78+
steps:
79+
- uses: actions/checkout@v4
80+
- uses: actions/setup-python@v5
81+
with:
82+
python-version: "3.11"
83+
cache: pip
84+
- name: Install build tooling
85+
run: |
86+
python -m pip install --upgrade pip
87+
pip install build twine
88+
- name: Build distribution
89+
run: python -m build
90+
- name: Validate artifacts
91+
run: twine check dist/*
92+
- name: Upload artifacts
93+
uses: actions/upload-artifact@v4
94+
with:
95+
name: dist-${{ github.sha }}
96+
path: dist/
97+
retention-days: 14

.pre-commit-config.yaml

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Pre-commit configuration for QuantLLM contributors.
2+
#
3+
# Setup:
4+
# pip install pre-commit
5+
# pre-commit install
6+
#
7+
# The hooks below run a fast subset of CI locally before each commit. The full
8+
# test suite still runs in GitHub Actions; pre-commit only blocks obviously
9+
# broken commits (lint failures, leftover merge markers, accidentally
10+
# committed large files, etc.).
11+
12+
repos:
13+
- repo: https://github.com/pre-commit/pre-commit-hooks
14+
rev: v4.6.0
15+
hooks:
16+
- id: trailing-whitespace
17+
- id: end-of-file-fixer
18+
- id: check-yaml
19+
- id: check-toml
20+
- id: check-merge-conflict
21+
- id: check-added-large-files
22+
args: ["--maxkb=1024"]
23+
- id: debug-statements
24+
- id: mixed-line-ending
25+
args: ["--fix=lf"]
26+
27+
- repo: https://github.com/astral-sh/ruff-pre-commit
28+
rev: v0.5.7
29+
hooks:
30+
- id: ruff
31+
args: ["--fix"]

CHANGELOG.md

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
# Changelog
2+
3+
All notable changes to QuantLLM are recorded here. The format follows
4+
[Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and the project
5+
adheres to [Semantic Versioning](https://semver.org/).
6+
7+
## [Unreleased] — production hardening on top of v2.1.0rc1
8+
9+
### Fixed
10+
11+
- **`is_quantized` no longer lies about the loaded model state.** The
12+
attribute is now a derived property reading
13+
`model.config.quantization_config` (and BitsAndBytes layer types) at
14+
call time. This fixes three concrete bugs in v2.1.0rc1:
15+
* `from_config_only=True` previously left `_is_quantized=True` even
16+
though `AutoModelForCausalLM.from_config(...)` returns a random-
17+
weights model with no quantization. The flag is now `False` and a
18+
warning is emitted to make the random-weights nature explicit.
19+
* A missing `bitsandbytes` install used to silently fall through to
20+
full precision while keeping `_is_quantized=True`. We now log a
21+
descriptive warning and report `False`.
22+
* Pre-quantized HF repos that already shipped a `quantization_config`
23+
(GPTQ, AWQ, etc.) are now correctly reported as quantized regardless
24+
of the user's `quantize=False` flag.
25+
- **`DEFAULT_ARCHITECTURE_FALLBACKS` is now actually consulted.** The
26+
fallback table introduced by PR #27 was dead code whenever HF returned
27+
a non-empty `model_type` (i.e. always). `resolve_model_type` now
28+
checks the table directly and recognises common version-suffix
29+
patterns (`qwen3``qwen2`, `llama4``llama`, `phi4``phi3`,
30+
`gemma3``gemma2`, etc.).
31+
- **`register_architecture` class lookup now uses the natural API.**
32+
Calling `register_architecture("newmodel", base_model_type="llama",
33+
model_class=NewModel)` previously stored the class under `"newmodel"`
34+
but looked it up under `"llama"`, so the fallback path silently
35+
ignored it. The lookup now tries the original `config.model_type`
36+
first and falls back to the resolved base family.
37+
- Removed an accidentally duplicated `if is_bnb and is_8bit ...` block
38+
in the existing-quant detection branch of
39+
`TurboModel.from_pretrained`.
40+
41+
### Added
42+
43+
- **`TurboModel.is_quantized` public property** plus
44+
**`TurboModel.report()`** returning a structured dict (`model_id`,
45+
`params_billion`, `requested_bits`, `effective_loading_bits`,
46+
`is_quantized`, `quant_method`, `device`, `dtype`, `finetuned`,
47+
`lora_applied`). Use `report()` to assert programmatically what the
48+
loader actually produced.
49+
- **Pre-quantized repo detection.** Repository names matching
50+
`*-bnb-4bit`, `*-bnb-8bit`, `*-AWQ`, `*-GPTQ`, `*-INT4`, `*-INT8`,
51+
`*-FP8`, `*-EETQ`, `*-HQQ`, `*-AQLM` log a friendly hint that the
52+
embedded `quantization_config` will be honoured rather than
53+
re-quantized.
54+
- **GGUF-only repo hint.** When a name contains `-gguf` / `.gguf`,
55+
`from_pretrained` warns and points the user at `from_gguf`.
56+
- **Expanded `DEFAULT_ARCHITECTURE_FALLBACKS` table** covering Llama 2/3/4,
57+
Mistral / Mixtral, Qwen 2 / 2-MoE / 3, Phi / Phi-3 / Phi-4, Gemma /
58+
Gemma 2 / Gemma 3, Falcon, Cohere / Command-R, DeepSeek (V2/V3),
59+
OLMo / OLMo 2, SmolLM / SmolLM 2 / SmolLM 3, Yi, StarCoder /
60+
StarCoder 2, InternLM / InternLM 2, Baichuan, ChatGLM and StableLM.
61+
- **Real CI workflow** at `.github/workflows/ci.yml` running ruff,
62+
pytest on Python 3.10 / 3.11 / 3.12, and `python -m build` +
63+
`twine check` on every PR.
64+
- **`pyproject.toml`** providing PEP 517 / 518 build metadata, a
65+
conservative ruff lint profile and pytest defaults.
66+
- **`.pre-commit-config.yaml`** for local enforcement (whitespace,
67+
end-of-file fixer, large-file guard, ruff with autofix).
68+
- **`docs/guide/consumer-hardware.md`** documenting expected behaviour
69+
on every tier of consumer hardware (CPU-only, ≤ 8 GB VRAM,
70+
12 – 24 GB, Apple Silicon, multi-GPU) and how to inspect the loaded
71+
state.
72+
- **Regression tests** for every fix above:
73+
* `tests/test_quantization_state.py` — runtime quantization state
74+
tracking, `from_config_only` honesty, `report()` schema.
75+
* `tests/test_resolve_model_type.py` — fallback table consultation,
76+
family-suffix matching, registry-class lookup ergonomics.
77+
78+
### Changed
79+
80+
- `TurboModel.__repr__` now reads from the new `is_quantized` property
81+
and degrades gracefully when `num_parameters()` is unavailable
82+
(mocked / lazily-loaded models).
83+
- `TurboModel.from_gguf` now sets `_is_quantized_override = True`
84+
rather than mutating an attribute the type system thought was a
85+
property -- this is functionally identical but more honest about the
86+
contract.
87+
- The "bitsandbytes not installed" warning now explains how to install
88+
it and explicitly states that loading falls back to full precision.
89+
90+
## [2.0.0] — 2025-12-21
91+
92+
Initial public release of the `turbo()` API and the GGUF / ONNX / MLX
93+
export pipeline. See the GitHub
94+
[releases page](https://github.com/codewithdark-git/QuantLLM/releases/tag/v2.0.0)
95+
for the full notes.

0 commit comments

Comments
 (0)