Skip to content

Commit 67fc8a1

Browse files
danmcleranclaude
andcommitted
Add Phase 17 pure-integer Q-format <-> int8 bridge + hybrid importer
Closes the offline-PyTorch/TF-training -> embedded-inference path for hybrid models that want the int8 affine grid at the boundaries (standard TFLite / ONNX QDQ export shape) but a Q-format middle tier (existing NeuralNet<Q8.8> MCU code, or a hidden layer that prefers QValue's compile-time fixed/fractional bit split). Runtime additions (cpp/qbridge.hpp): * AffineToQValueIntParams / QValueToAffineIntParams integer triples. * affineToQValueInt / qValueToAffineInt (+ buffer variants) reuse the gemmlowp Q0.31 multiplyByQuantizedMultiplier primitive. No <cmath>, no <type_traits>, no float at runtime. Gated on QUANT, independent of FLOAT; deployable freestanding shape FLOAT=0 STD=0 QUANT=1. Host helpers (FLOAT && STD gated): * buildAffineToQValueIntParams / buildQValueToAffineIntParams emit the integer triples at calibration time. Importer (apps/import_pytorch/tinymind_import.py): * QFormatDense descriptor (Q-format dense, raw QValue integer weights). * HybridBoundary descriptor for precision-tier transitions. * quantize_multiplier / quantize_qformat_weights helpers. * Emitter writes one precomputed (multiplier, shift, zero_point) triple per boundary; freestanding target reads them as data. Documentation (apps/import_onnx/README.md, apps/import_pytorch/README.md): * tf2onnx + onnxruntime.quantization QDQ recipe for TensorFlow / Keras models reaching the same emitter. * Hybrid int8 + Q-format flow worked through end-to-end. Tests: * 5 new qbridge unit tests in unit_test/quantization (parity vs the float bridge within 1 LSB, round-trip closure, saturation, buffer variants). * embedded_smoke_test.cpp exercises the integer bridges in quant_freestanding to lock the no-stdlib invariant. * unit_test/integration adds the mixed_precision_mlp_int8_qformat golden-byte fixture. Exemplar (examples/mixed_precision_mlp_int8_qformat/): * int8 QDense -> qrelu -> pure-integer affineToQValue bridge -> Q8.8 dense matvec -> pure-integer qValueToAffine bridge -> int8 QDense classifier. Same make run / bench / golden mode triple as the other mixed-precision exemplars. Worst max-abs error vs float reference ~0.005 on the bundled synthetic dataset. Plan update (QUANTIZATION.md): Q-format documented as a first-class peer in the mixed-precision storage tier list (was previously left out alongside fp16 / bf16 / int8). Phase 17 entry added. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 55c35b6 commit 67fc8a1

13 files changed

Lines changed: 1215 additions & 21 deletions

File tree

CLAUDE.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -115,11 +115,11 @@ The `unit_test/embedded` matrix continues to exercise the freestanding (`quant_f
115115

116116
Phase 9 of the roadmap adds composability between the previously orphaned `QValue` (Q-format) and `QAffineTensor` (int8 affine) pipelines, plus a half-precision storage tier for application-class CPUs.
117117

118-
- **`cpp/qbridge.hpp`** — Pointwise type converters at layer boundaries: `affineDequantize` / `affineQuantize`, `qValueToFloat` / `floatToQValue`, `qValueToAffine` / `affineToQValue`, plus buffer-batch versions. Float at runtime, no `<cmath>` (rounding via sign-aware cast). Gated on `TINYMIND_ENABLE_FLOAT`; freestanding-safe at `STD=0`. Enables hybrid pipelines like *int8 affine CNN frontend → Q-format LSTM head → int8 affine classifier*.
118+
- **`cpp/qbridge.hpp`** — Pointwise type converters at layer boundaries: `affineDequantize` / `affineQuantize`, `qValueToFloat` / `floatToQValue`, `qValueToAffine` / `affineToQValue`, plus buffer-batch versions. Float at runtime, no `<cmath>` (rounding via sign-aware cast). Gated on `TINYMIND_ENABLE_FLOAT`; freestanding-safe at `STD=0`. Enables hybrid pipelines like *int8 affine CNN frontend → Q-format LSTM head → int8 affine classifier*. Phase 17 adds a parallel pure-integer path inside the same header gated on `TINYMIND_ENABLE_QUANTIZATION` (independent of `FLOAT`): `AffineToQValueIntParams<QV>` / `QValueToAffineIntParams<QV>` + `affineToQValueInt` / `qValueToAffineInt` (and buffer variants) reuse the gemmlowp Q0.31 `multiplyByQuantizedMultiplier` primitive, so the deployable freestanding shape `FLOAT=0 STD=0 QUANT=1` can mix Q-format and int8 affine tiers at runtime without `<cmath>`. Host-side helpers `buildAffineToQValueIntParams<QV>` / `buildQValueToAffineIntParams<QV>` build the integer triples at calibration time and ship them as data.
119119
- **`cpp/include/tinymind_fp16.hpp`** — Software-only `fp16_t` (IEEE 754 binary16) and `bf16_t` (bfloat16) storage structs wrapping `uint16_t`. Conversion helpers (`floatToFp16` / `fp16ToFloat`, `floatToBf16` / `bf16ToFloat`) handle normals, subnormals, Inf, and NaN. Storage tier; SIMD specializations land via Phase 14's `simd_neon_fp16.hpp` (NEON FEAT_FP16 vector forms).
120120
- **`cpp/qbridge.hpp`** also provides `fp16ToAffineI8` / `affineI8ToFp16` / `bf16ToAffineI8` / `affineI8ToBf16` when `TINYMIND_ENABLE_FP16=1`.
121121

122-
The `unit_test/embedded` matrix exercises this as `fp16_freestanding` (`FLOAT=1 FP16=1 QUANT=1 STD=0`) to confirm the half-precision and bridge headers stay freestanding-clean.
122+
The `unit_test/embedded` matrix exercises the float bridges as `fp16_freestanding` (`FLOAT=1 FP16=1 QUANT=1 STD=0`); the Phase 17 integer bridges ride in the `quant_freestanding` corner (`QUANT=1 FLOAT=0 STD=0`) so both halves stay freestanding-clean.
123123

124124
### SIMD Performance Backend (optional, `TINYMIND_ENABLE_SIMD_*=1`)
125125

@@ -184,7 +184,7 @@ typedef NeuralNet<XorNNProperties> XorNN;
184184
- **`qlearn/`** — Boost.Test unit tests for Q-learning
185185
- **`quantization/`** — Boost.Test unit tests for the int8 quantization path: Requantizer round-trip, per-tensor / per-channel calibration, QConv2D / QDepthwise / QPointwise / QPool / QDense forward passes against a float reference. Phase 11 additions cover `foldBatchNorm` (fused-conv parity vs unfused conv→BN), `QBatchNorm2D` parity, `QLayerNorm1D` parity and constant-row edge case, and `QSoftmax1D` parity plus dominant-class saturation. Phase 12 additions cover `QLSTMCell` single-step parity vs a float LSTM reference, `QLSTMCell` int16-cell-state drift over a 256-step sequence, and `QGRUCell` single-step parity vs a float GRU reference. Phase 13 additions cover Q1.15 twiddle round-trip, `QFFT1D` magnitude-spectrum parity vs a naive float DFT, `QFFT1D` forward/inverse round-trip, `QAttention1D` parity vs a float linear-attention reference, `QAttentionSoftmax1D` parity vs a float softmax-attention reference, and a `QMultiHeadLinearAttention1D` stacking test. Phase 14 additions cover SIMD bit-exactness across pathological lengths, INT8 extreme-value patterns, full-layer `QDense` and `QConv2D` parity, and the `activeBackendName()` dispatch report. Phase 15 additions cover `PercentileObserver` outlier clipping + empty-buffer edge case, `KLDivergenceObserver` clip-threshold convergence vs a Gaussian + outliers dataset + empty edge case, `crossLayerEqualizeDense` output preservation under ReLU + zero-row skip, and `crossLayerEqualizeConv2D` output preservation. Builds with `TINYMIND_ENABLE_QUANTIZATION=1`; pass `-DTINYMIND_ENABLE_SIMD_*=1` plus the matching `-march=` flag to exercise a SIMD backend.
186186
- **`embedded/`** — Cross-corner regression matrix. Builds the smoke source under eight `(FLOAT, STD, QUANT, FP16, INT16_ACCUM, SIMD_*)` configurations: `freestanding`, `no_stdlib`, `no_fpu`, `hosted`, `quant_freestanding`, `fp16_freestanding`, `int16_accum_freestanding`, and `simd_disabled` (Phase 14 scalar-fallback corner — every `TINYMIND_ENABLE_SIMD_*=0` at the deployable freestanding shape). A separate `simd_prereq_regressions` make target locks the static_assert prerequisite chain (`AVX_VNNI=1, AVX2=0` and `AVX512_VNNI=1, AVX512F=0` must fail to compile).
187-
- **`integration/`** — Phase 16 golden-byte suite. One Boost.Test fixture per exemplar (`resnet18_block_int8`, `mobilenetv2_int8`, `mixed_precision_kws`, `transformer_encoder_int8`) shells out to the example binary with `--golden` and asserts the emitted int8 byte stream matches a baked-in expected string. Catches silent regressions in the inference path regardless of which SIMD backend dispatch resolves to.
187+
- **`integration/`** — Phase 16 golden-byte suite (extended in Phase 17). One Boost.Test fixture per exemplar (`resnet18_block_int8`, `mobilenetv2_int8`, `mixed_precision_kws`, `transformer_encoder_int8`, `mixed_precision_mlp_int8_qformat`) shells out to the example binary with `--golden` and asserts the emitted int8 byte stream matches a baked-in expected string. Catches silent regressions in the inference path regardless of which SIMD backend dispatch resolves to.
188188
189189
### Examples (`examples/`)
190190
@@ -201,6 +201,7 @@ typedef NeuralNet<XorNNProperties> XorNN;
201201
- **`resnet18_block_int8/`** — Phase 16 exemplar. int8 ResNet-18-shaped stem + one basic-block stage on a 16x16x3 input, 4 logits out. Same `make run` / `make bench` / `make golden` mode triple as the other Phase 16 exemplars.
202202
- **`mobilenetv2_int8/`** — Phase 16 exemplar. int8 MobileNetV2-shaped pipeline: stride-2 stem + one stride-1 inverted-residual block with skip + one stride-2 inverted-residual block, then GAP + dense. Linear bottlenecks per MNv2 convention.
203203
- **`mixed_precision_kws/`** — Phase 16 mixed-precision exemplar. int8 `QDense` frontend → Phase 9 `affineI8ToFp16` bridge → fp16 linear-attention head with residual skip + mean-pool → Phase 9 `fp16ToAffineI8` bridge → int8 `QDense` classifier. Requires `TINYMIND_ENABLE_FP16=1`.
204+
- **`mixed_precision_mlp_int8_qformat/`** — Phase 17 hybrid mixed-precision exemplar. int8 `QDense` frontend → `qrelu` → Phase 17 `affineToQValueIntBuffer` (pure-integer bridge) → Q8.8 dense matvec (int32 accumulator) → Phase 17 `qValueToAffineIntBuffer` (pure-integer bridge) → int8 `QDense` classifier. Deployable shape is `QUANT=1 FLOAT=0 STD=0`; the exemplar builds hosted for the parity report (~0.005 max-abs error vs the float reference) and wires into the integration suite via the same `--golden` mode.
204205
205206
### Apps (`apps/`)
206207

Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ check :
2020
cd examples/resnet18_block_int8 && make clean && make && make release && make run && cd -
2121
cd examples/mobilenetv2_int8 && make clean && make && make release && make run && cd -
2222
cd examples/mixed_precision_kws && make clean && make && make release && make run && cd -
23+
cd examples/mixed_precision_mlp_int8_qformat && make clean && make && make release && make run && cd -
2324
cd unit_test/integration && make clean && make && make run && cd -
2425
cd examples/pytorch_quant/xor && make clean && make && make release && make run && cd -
2526
cd examples/import_demo && make clean && make && make release && make run && cd -

QUANTIZATION.md

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -433,6 +433,28 @@ CPU complex and SIMD capability are orthogonal — the rows below describe *typi
433433
## Non-Goals (Still)
434434
435435
- **No QAT** in this roadmap. Post-training quantization remains the deployment path.
436-
- **No sub-4-bit / mixed precision below int8.** Storage tier list is {int8, int16-accum, fp16, bf16, fp32, Q-format}.
436+
- **No sub-4-bit / mixed precision below int8.** Storage tier list (in mixed-precision peer relationship via Phase 9 bridges and Phase 17 pure-integer bridges) is {int8 affine, int16-accum, fp16, bf16, fp32, **Q-format `QValue<I,F>`** as a first-class peer — Phase 17 closes the loop with integer-only `affineToQValueInt` / `qValueToAffineInt` so the Q-format tier participates in hybrid models at the deployable freestanding shape `FLOAT=0 STD=0 QUANT=1`}.
437437
- **No dynamic / runtime model loading.** Compile-time template shapes remain — codegen-from-PyTorch flow is the integration model (Phase 15).
438438
439+
## Phase 17 — Pure-Integer Q-format <-> int8 Bridge + Hybrid Importer [SHIPPED]
440+
441+
**Goal:** close the gap for the offline-training -> embedded-inference story where a model wants the int8 affine grid at the boundaries (PyTorch / TF / ONNX QDQ export shape) but a Q-format middle tier (existing `NeuralNet<Q8.8>` MCU code, or a hidden layer that prefers `QValue`'s compile-time fixed/fractional bit split). Phase 9 added the float-mediated `qValueToAffine` / `affineToQValue` bridges; Phase 17 ships the pure-integer counterparts so the inference path needs no `<cmath>` and no float at runtime.
442+
443+
**Scope (shipped):**
444+
- `cpp/qbridge.hpp` additions: `AffineToQValueIntParams<QV>` / `QValueToAffineIntParams<QV>` (integer triples) plus `affineToQValueInt` / `qValueToAffineInt` (+ buffer variants). Uses the same Q0.31 `multiplyByQuantizedMultiplier` primitive that `Requantizer` does — no new runtime dependency. Gated on `TINYMIND_ENABLE_QUANTIZATION`, independent of `TINYMIND_ENABLE_FLOAT`. Host-side helper builders `buildAffineToQValueIntParams<QV>` / `buildQValueToAffineIntParams<QV>` gated on `FLOAT && STD`.
445+
- `apps/import_pytorch/tinymind_import.py` additions: `QFormatDense` layer descriptor (Q-format dense weights/biases emitted as raw QValue integers, no scale or zero_point at runtime), `HybridBoundary` precision-tier transition descriptor, and `quantize_multiplier` / `quantize_qformat_weights` helpers. The emitter writes precomputed `(multiplier, shift, zero_point)` triples (plus `qmin`/`qmax` on the `qvalue_to_affine` side) directly into `weights.hpp` so the deployable target consumes them as data.
446+
- `apps/import_onnx/README.md` + `apps/import_pytorch/README.md` document the TensorFlow / Keras path via `tf2onnx` + `onnxruntime.quantization.quantize_static(quant_format=QuantFormat.QDQ)` plus the hybrid `QFormatDense` + `HybridBoundary` flow.
447+
448+
**Tests (shipped):**
449+
- `qbridge_int_affine_to_qvalue_matches_float_bridge` / `qbridge_int_qvalue_to_affine_matches_float_bridge` — pure-integer bridge stays within 1 LSB of the float bridge across the int8 / Q88 grid.
450+
- `qbridge_int_round_trip_within_tolerance` — float -> Q88 -> int8 (integer bridge) -> Q88 (integer bridge) -> float closes back within one affine LSB plus one QValue LSB.
451+
- `qbridge_int_qvalue_to_affine_saturates` — out-of-range Q88 inputs saturate to `[qmin, qmax]`.
452+
- `qbridge_int_buffer_round_trip` — buffer-variant parity.
453+
- `unit_test/embedded/embedded_smoke_test.cpp` exercises `affineToQValueInt` / `qValueToAffineInt` in the `quant_freestanding` corner, confirming the integer bridge stays freestanding-clean (no `<cmath>`, no `<type_traits>`, no stdlib).
454+
455+
**Example:** `examples/mixed_precision_mlp_int8_qformat/` — int8 `QDense` -> `qrelu` -> Phase 17 `affineToQValueInt` bridge -> Q8.8 dense matvec -> Phase 17 `qValueToAffineInt` bridge -> int8 `QDense` classifier. `make run` reports max-abs error vs the float reference (~0.005 on the bundled dataset, well below the 60 %-of-output-range tolerance). `make golden` emits a deterministic int8 byte stream that the new `mixed_precision_mlp_int8_qformat_golden_match` integration fixture in `unit_test/integration/` locks at byte granularity.
456+
457+
**Success criteria:** an offline-trained model with one or more Q-format hidden layers and int8 affine boundaries deployable end-to-end at `FLOAT=0 STD=0 QUANT=1`. ✓ shipped.
458+
459+
**Risk:** low. Pure addition — no edits to existing runtime headers' behavior at any pre-Phase-17 gate combination.
460+

apps/import_onnx/README.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,3 +47,63 @@ layers = import_onnx_model(
4747

4848
`onnx` Python package is imported lazily inside `parse_onnx_model`, so
4949
the rest of the module (the emitter) is usable without it.
50+
51+
## TensorFlow / Keras via ONNX
52+
53+
TensorFlow and Keras models reach this importer through `tf2onnx` plus
54+
the ONNX runtime's static-quantization API. Recipe:
55+
56+
```bash
57+
pip install tf2onnx onnx onnxruntime
58+
59+
# 1. Export your TF / Keras model to ONNX.
60+
python -m tf2onnx.convert \
61+
--saved-model path/to/saved_model \
62+
--output model.onnx \
63+
--opset 13
64+
```
65+
66+
```python
67+
# 2. Post-training quantize to QDQ format.
68+
from onnxruntime.quantization import (
69+
quantize_static, QuantFormat, QuantType, CalibrationDataReader,
70+
)
71+
72+
class MyCalibReader(CalibrationDataReader):
73+
def __init__(self, dataset):
74+
self._it = iter([{"input": x} for x in dataset])
75+
def get_next(self):
76+
return next(self._it, None)
77+
78+
quantize_static(
79+
"model.onnx", "model_int8.onnx",
80+
calibration_data_reader=MyCalibReader(calib_inputs),
81+
quant_format=QuantFormat.QDQ,
82+
weight_type=QuantType.QInt8,
83+
activation_type=QuantType.QInt8,
84+
per_channel=False,
85+
)
86+
```
87+
88+
```python
89+
# 3. Emit weights.hpp.
90+
from tinymind_import_onnx import import_onnx_model
91+
import_onnx_model(
92+
model_path="model_int8.onnx",
93+
output_path="weights.hpp",
94+
namespace="my_model",
95+
)
96+
```
97+
98+
The same recipe works for any framework that ONNX targets: JAX (via
99+
`jax2onnx`), MXNet, PaddlePaddle, etc.
100+
101+
## Hybrid int8 + Q-format models
102+
103+
The ONNX importer covers the int8 layers. If the deployable target
104+
inserts a Q-format hidden tier between two int8 layers (see
105+
`apps/import_pytorch/README.md` for the `QFormatDense` / `HybridBoundary`
106+
descriptors and `examples/mixed_precision_mlp_int8_qformat/` for the
107+
runnable C++ counterpart), parse the ONNX model with this importer
108+
to recover the int8 layers, then chain the result through the PyTorch
109+
importer's emitter passing the extra `boundaries` list.

apps/import_pytorch/README.md

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,3 +82,77 @@ MinMax.
8282

8383
`examples/import_demo/` exercises this importer on a small MLP and
8484
verifies the C++ int8 forward against the float reference.
85+
86+
## Hybrid int8 + Q-format models
87+
88+
The importer also handles models that mix an int8 affine tier with the
89+
TinyMind Q-format (`QValue`) pipeline -- useful when an existing
90+
`NeuralNet<Q8.8>` hand-tuned for the MCU sits between two int8 layers
91+
exported from PyTorch, or when a specific hidden layer wants Q-format's
92+
compile-time fixed/fractional bit split.
93+
94+
Two extra descriptor kinds:
95+
96+
* `QFormatDense` -- Q-format dense layer carrying float weights /
97+
bias plus the QValue tag (`fixed_bits`, `fractional_bits`, `signed`).
98+
The emitter writes raw QValue integers, no scale or zero_point.
99+
* `HybridBoundary` -- precision-tier transition between two adjacent
100+
layers (`kind = "affine_to_qvalue"` or `"qvalue_to_affine"`,
101+
plus a `qformat` pointer carrying the fractional-bit count).
102+
103+
Pass a `boundaries` list to `import_pytorch_model`; the emitter writes
104+
one precomputed integer triple per boundary -- the same
105+
`(multiplier, shift, zero_point)` that `cpp/qbridge.hpp::affineToQValueInt`
106+
and `qValueToAffineInt` consume pure-integer. The deployable target shape
107+
`TINYMIND_ENABLE_QUANTIZATION=1, FLOAT=0, STD=0` reads them as data,
108+
no host-side helper call at startup.
109+
110+
```python
111+
mid = QFormatDense(name="qfmt_mid",
112+
weight=w_mid, bias=b_mid,
113+
input_name="hidden",
114+
forward=lambda x: x @ w_mid.T + b_mid,
115+
fractional_bits=8, fixed_bits=8, signed=True,
116+
observer=MinMaxObserver())
117+
layers = [
118+
Dense(name="fc1", weight=w1, bias=b1, input_name="input",
119+
forward=lambda x: x @ w1.T + b1,
120+
observer=MinMaxObserver()),
121+
ReLU(name="hidden", input_name="fc1"),
122+
mid,
123+
Dense(name="fc2", weight=w2, bias=b2, input_name="qfmt_mid",
124+
forward=lambda x: x @ w2.T + b2,
125+
observer=MinMaxObserver()),
126+
]
127+
boundaries = [
128+
HybridBoundary(from_name="hidden", to_name="qfmt_mid",
129+
kind="affine_to_qvalue", qformat=mid),
130+
HybridBoundary(from_name="qfmt_mid", to_name="fc2",
131+
kind="qvalue_to_affine", qformat=mid,
132+
qmin=-128, qmax=127),
133+
]
134+
import_pytorch_model(layers, ..., boundaries=boundaries)
135+
```
136+
137+
`examples/mixed_precision_mlp_int8_qformat/` is the runnable C++
138+
counterpart -- it builds the same pipeline shape with hand-crafted
139+
weights and reports max-abs error vs the float reference.
140+
141+
## Importing from TensorFlow / Keras
142+
143+
The PyTorch importer also covers Keras / TensorFlow models via the
144+
ONNX QDQ recipe described in `apps/import_onnx/README.md`. The short
145+
form:
146+
147+
1. Train + export TF / Keras model.
148+
2. Convert to ONNX: `python -m tf2onnx.convert --saved-model ... --output model.onnx`.
149+
3. Post-training quantize: `onnxruntime.quantization.quantize_static(
150+
model.onnx, model_int8.onnx, calibration_data_reader=...,
151+
quant_format=QuantFormat.QDQ, weight_type=QInt8, activation_type=QInt8)`.
152+
4. Parse with `apps/import_onnx/tinymind_import_onnx.py` and emit
153+
`weights.hpp`.
154+
155+
The hybrid int8 + Q-format flow above plugs into either entry point --
156+
the ONNX path emits the int8 layers' descriptors, then the caller
157+
inserts `QFormatDense` + `HybridBoundary` entries in the layer list
158+
before calling the emitter.

0 commit comments

Comments
 (0)