You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Closes the offline-PyTorch/TF-training -> embedded-inference path for
hybrid models that want the int8 affine grid at the boundaries (standard
TFLite / ONNX QDQ export shape) but a Q-format middle tier (existing
NeuralNet<Q8.8> MCU code, or a hidden layer that prefers QValue's
compile-time fixed/fractional bit split).
Runtime additions (cpp/qbridge.hpp):
* AffineToQValueIntParams / QValueToAffineIntParams integer triples.
* affineToQValueInt / qValueToAffineInt (+ buffer variants) reuse the
gemmlowp Q0.31 multiplyByQuantizedMultiplier primitive. No <cmath>,
no <type_traits>, no float at runtime. Gated on QUANT, independent
of FLOAT; deployable freestanding shape FLOAT=0 STD=0 QUANT=1.
Host helpers (FLOAT && STD gated):
* buildAffineToQValueIntParams / buildQValueToAffineIntParams emit
the integer triples at calibration time.
Importer (apps/import_pytorch/tinymind_import.py):
* QFormatDense descriptor (Q-format dense, raw QValue integer weights).
* HybridBoundary descriptor for precision-tier transitions.
* quantize_multiplier / quantize_qformat_weights helpers.
* Emitter writes one precomputed (multiplier, shift, zero_point)
triple per boundary; freestanding target reads them as data.
Documentation (apps/import_onnx/README.md, apps/import_pytorch/README.md):
* tf2onnx + onnxruntime.quantization QDQ recipe for TensorFlow / Keras
models reaching the same emitter.
* Hybrid int8 + Q-format flow worked through end-to-end.
Tests:
* 5 new qbridge unit tests in unit_test/quantization (parity vs the
float bridge within 1 LSB, round-trip closure, saturation,
buffer variants).
* embedded_smoke_test.cpp exercises the integer bridges in
quant_freestanding to lock the no-stdlib invariant.
* unit_test/integration adds the mixed_precision_mlp_int8_qformat
golden-byte fixture.
Exemplar (examples/mixed_precision_mlp_int8_qformat/):
* int8 QDense -> qrelu -> pure-integer affineToQValue bridge ->
Q8.8 dense matvec -> pure-integer qValueToAffine bridge ->
int8 QDense classifier. Same make run / bench / golden mode triple
as the other mixed-precision exemplars. Worst max-abs error vs
float reference ~0.005 on the bundled synthetic dataset.
Plan update (QUANTIZATION.md): Q-format documented as a first-class
peer in the mixed-precision storage tier list (was previously left
out alongside fp16 / bf16 / int8). Phase 17 entry added.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: CLAUDE.md
+4-3Lines changed: 4 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -115,11 +115,11 @@ The `unit_test/embedded` matrix continues to exercise the freestanding (`quant_f
115
115
116
116
Phase 9 of the roadmap adds composability between the previously orphaned `QValue` (Q-format) and `QAffineTensor` (int8 affine) pipelines, plus a half-precision storage tier for application-class CPUs.
117
117
118
-
-**`cpp/qbridge.hpp`** — Pointwise type converters at layer boundaries: `affineDequantize` / `affineQuantize`, `qValueToFloat` / `floatToQValue`, `qValueToAffine` / `affineToQValue`, plus buffer-batch versions. Float at runtime, no `<cmath>` (rounding via sign-aware cast). Gated on `TINYMIND_ENABLE_FLOAT`; freestanding-safe at `STD=0`. Enables hybrid pipelines like *int8 affine CNN frontend → Q-format LSTM head → int8 affine classifier*.
118
+
- **`cpp/qbridge.hpp`** — Pointwise type converters at layer boundaries: `affineDequantize` / `affineQuantize`, `qValueToFloat` / `floatToQValue`, `qValueToAffine` / `affineToQValue`, plus buffer-batch versions. Float at runtime, no `<cmath>` (rounding via sign-aware cast). Gated on `TINYMIND_ENABLE_FLOAT`; freestanding-safe at `STD=0`. Enables hybrid pipelines like *int8 affine CNN frontend → Q-format LSTM head → int8 affine classifier*. Phase 17 adds a parallel pure-integer path inside the same header gated on `TINYMIND_ENABLE_QUANTIZATION` (independent of `FLOAT`): `AffineToQValueIntParams<QV>` / `QValueToAffineIntParams<QV>` + `affineToQValueInt` / `qValueToAffineInt` (and buffer variants) reuse the gemmlowp Q0.31 `multiplyByQuantizedMultiplier` primitive, so the deployable freestanding shape `FLOAT=0 STD=0 QUANT=1` can mix Q-format and int8 affine tiers at runtime without `<cmath>`. Host-side helpers `buildAffineToQValueIntParams<QV>` / `buildQValueToAffineIntParams<QV>` build the integer triples at calibration time and ship them as data.
-**`cpp/qbridge.hpp`** also provides `fp16ToAffineI8` / `affineI8ToFp16` / `bf16ToAffineI8` / `affineI8ToBf16` when `TINYMIND_ENABLE_FP16=1`.
121
121
122
-
The `unit_test/embedded` matrix exercises this as `fp16_freestanding` (`FLOAT=1 FP16=1 QUANT=1 STD=0`) to confirm the half-precision and bridge headers stay freestanding-clean.
122
+
The `unit_test/embedded` matrix exercises the float bridges as `fp16_freestanding` (`FLOAT=1 FP16=1 QUANT=1 STD=0`); the Phase 17 integer bridges ride in the `quant_freestanding` corner (`QUANT=1 FLOAT=0 STD=0`) so both halves stay freestanding-clean.
- **`qlearn/`** — Boost.Test unit tests for Q-learning
185
185
- **`quantization/`** — Boost.Test unit tests for the int8 quantization path: Requantizer round-trip, per-tensor / per-channel calibration, QConv2D / QDepthwise / QPointwise / QPool / QDense forward passes against a float reference. Phase 11 additions cover `foldBatchNorm` (fused-conv parity vs unfused conv→BN), `QBatchNorm2D` parity, `QLayerNorm1D` parity and constant-row edge case, and `QSoftmax1D` parity plus dominant-class saturation. Phase 12 additions cover `QLSTMCell` single-step parity vs a float LSTM reference, `QLSTMCell` int16-cell-state drift over a 256-step sequence, and `QGRUCell` single-step parity vs a float GRU reference. Phase 13 additions cover Q1.15 twiddle round-trip, `QFFT1D` magnitude-spectrum parity vs a naive float DFT, `QFFT1D` forward/inverse round-trip, `QAttention1D` parity vs a float linear-attention reference, `QAttentionSoftmax1D` parity vs a float softmax-attention reference, and a `QMultiHeadLinearAttention1D` stacking test. Phase 14 additions cover SIMD bit-exactness across pathological lengths, INT8 extreme-value patterns, full-layer `QDense` and `QConv2D` parity, and the `activeBackendName()` dispatch report. Phase 15 additions cover `PercentileObserver` outlier clipping + empty-buffer edge case, `KLDivergenceObserver` clip-threshold convergence vs a Gaussian + outliers dataset + empty edge case, `crossLayerEqualizeDense` output preservation under ReLU + zero-row skip, and `crossLayerEqualizeConv2D` output preservation. Builds with `TINYMIND_ENABLE_QUANTIZATION=1`; pass `-DTINYMIND_ENABLE_SIMD_*=1` plus the matching `-march=` flag to exercise a SIMD backend.
186
186
- **`embedded/`** — Cross-corner regression matrix. Builds the smoke source under eight `(FLOAT, STD, QUANT, FP16, INT16_ACCUM, SIMD_*)` configurations: `freestanding`, `no_stdlib`, `no_fpu`, `hosted`, `quant_freestanding`, `fp16_freestanding`, `int16_accum_freestanding`, and `simd_disabled` (Phase 14 scalar-fallback corner — every `TINYMIND_ENABLE_SIMD_*=0` at the deployable freestanding shape). A separate `simd_prereq_regressions` make target locks the static_assert prerequisite chain (`AVX_VNNI=1, AVX2=0` and `AVX512_VNNI=1, AVX512F=0` must fail to compile).
187
-
- **`integration/`** — Phase 16 golden-byte suite. One Boost.Test fixture per exemplar (`resnet18_block_int8`, `mobilenetv2_int8`, `mixed_precision_kws`, `transformer_encoder_int8`) shells out to the example binary with `--golden` and asserts the emitted int8 byte stream matches a baked-in expected string. Catches silent regressions in the inference path regardless of which SIMD backend dispatch resolves to.
187
+
- **`integration/`** — Phase 16 golden-byte suite (extended in Phase 17). One Boost.Test fixture per exemplar (`resnet18_block_int8`, `mobilenetv2_int8`, `mixed_precision_kws`, `transformer_encoder_int8`, `mixed_precision_mlp_int8_qformat`) shells out to the example binary with `--golden` and asserts the emitted int8 byte stream matches a baked-in expected string. Catches silent regressions in the inference path regardless of which SIMD backend dispatch resolves to.
- **`resnet18_block_int8/`** — Phase 16 exemplar. int8 ResNet-18-shaped stem + one basic-block stage on a 16x16x3 input, 4 logits out. Same `make run` / `make bench` / `make golden` mode triple as the other Phase 16 exemplars.
202
202
- **`mobilenetv2_int8/`** — Phase 16 exemplar. int8 MobileNetV2-shaped pipeline: stride-2 stem + one stride-1 inverted-residual block with skip + one stride-2 inverted-residual block, then GAP + dense. Linear bottlenecks per MNv2 convention.
Copy file name to clipboardExpand all lines: QUANTIZATION.md
+23-1Lines changed: 23 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -433,6 +433,28 @@ CPU complex and SIMD capability are orthogonal — the rows below describe *typi
433
433
## Non-Goals (Still)
434
434
435
435
- **No QAT** in this roadmap. Post-training quantization remains the deployment path.
436
-
- **No sub-4-bit / mixed precision below int8.** Storage tier list is {int8, int16-accum, fp16, bf16, fp32, Q-format}.
436
+
- **No sub-4-bit / mixed precision below int8.** Storage tier list (in mixed-precision peer relationship via Phase 9 bridges and Phase 17 pure-integer bridges) is {int8 affine, int16-accum, fp16, bf16, fp32, **Q-format `QValue<I,F>`** as a first-class peer — Phase 17 closes the loop with integer-only `affineToQValueInt` / `qValueToAffineInt` so the Q-format tier participates in hybrid models at the deployable freestanding shape `FLOAT=0 STD=0 QUANT=1`}.
437
437
- **No dynamic / runtime model loading.** Compile-time template shapes remain — codegen-from-PyTorch flow is the integration model (Phase 15).
**Goal:** close the gap for the offline-training -> embedded-inference story where a model wants the int8 affine grid at the boundaries (PyTorch / TF / ONNX QDQ export shape) but a Q-format middle tier (existing `NeuralNet<Q8.8>` MCU code, or a hidden layer that prefers `QValue`'s compile-time fixed/fractional bit split). Phase 9 added the float-mediated `qValueToAffine` / `affineToQValue` bridges; Phase 17 ships the pure-integer counterparts so the inference path needs no `<cmath>` and no float at runtime.
442
+
443
+
**Scope (shipped):**
444
+
- `cpp/qbridge.hpp` additions: `AffineToQValueIntParams<QV>` / `QValueToAffineIntParams<QV>` (integer triples) plus `affineToQValueInt` / `qValueToAffineInt` (+ buffer variants). Uses the same Q0.31 `multiplyByQuantizedMultiplier` primitive that `Requantizer` does — no new runtime dependency. Gated on `TINYMIND_ENABLE_QUANTIZATION`, independent of `TINYMIND_ENABLE_FLOAT`. Host-side helper builders `buildAffineToQValueIntParams<QV>` / `buildQValueToAffineIntParams<QV>` gated on `FLOAT && STD`.
445
+
- `apps/import_pytorch/tinymind_import.py` additions: `QFormatDense` layer descriptor (Q-format dense weights/biases emitted as raw QValue integers, no scale or zero_point at runtime), `HybridBoundary` precision-tier transition descriptor, and `quantize_multiplier` / `quantize_qformat_weights` helpers. The emitter writes precomputed `(multiplier, shift, zero_point)` triples (plus `qmin`/`qmax` on the `qvalue_to_affine` side) directly into `weights.hpp` so the deployable target consumes them as data.
446
+
- `apps/import_onnx/README.md` + `apps/import_pytorch/README.md` document the TensorFlow / Keras path via `tf2onnx` + `onnxruntime.quantization.quantize_static(quant_format=QuantFormat.QDQ)` plus the hybrid `QFormatDense` + `HybridBoundary` flow.
447
+
448
+
**Tests (shipped):**
449
+
- `qbridge_int_affine_to_qvalue_matches_float_bridge` / `qbridge_int_qvalue_to_affine_matches_float_bridge` — pure-integer bridge stays within 1 LSB of the float bridge across the int8 / Q88 grid.
450
+
- `qbridge_int_round_trip_within_tolerance` — float -> Q88 -> int8 (integer bridge) -> Q88 (integer bridge) -> float closes back within one affine LSB plus one QValue LSB.
451
+
- `qbridge_int_qvalue_to_affine_saturates` — out-of-range Q88 inputs saturate to `[qmin, qmax]`.
- `unit_test/embedded/embedded_smoke_test.cpp` exercises `affineToQValueInt` / `qValueToAffineInt` in the `quant_freestanding` corner, confirming the integer bridge stays freestanding-clean (no `<cmath>`, no `<type_traits>`, no stdlib).
454
+
455
+
**Example:** `examples/mixed_precision_mlp_int8_qformat/` — int8 `QDense` -> `qrelu` -> Phase 17 `affineToQValueInt` bridge -> Q8.8 dense matvec -> Phase 17 `qValueToAffineInt` bridge -> int8 `QDense` classifier. `make run` reports max-abs error vs the float reference (~0.005 on the bundled dataset, well below the 60 %-of-output-range tolerance). `make golden` emits a deterministic int8 byte stream that the new `mixed_precision_mlp_int8_qformat_golden_match` integration fixture in `unit_test/integration/` locks at byte granularity.
456
+
457
+
**Success criteria:** an offline-trained model with one or more Q-format hidden layers and int8 affine boundaries deployable end-to-end at `FLOAT=0 STD=0 QUANT=1`. ✓ shipped.
458
+
459
+
**Risk:** low. Pure addition — no edits to existing runtime headers' behavior at any pre-Phase-17 gate combination.
0 commit comments