danmcleran
diff --git a/‎CLAUDE.md‎
Lines changed: 4 additions & 3 deletions b/‎CLAUDE.md‎
Lines changed: 4 additions & 3 deletions
diff --git a/‎Makefile‎
Lines changed: 1 addition & 0 deletions b/‎Makefile‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎QUANTIZATION.md‎
Lines changed: 23 additions & 1 deletion b/‎QUANTIZATION.md‎
Lines changed: 23 additions & 1 deletion
diff --git a/‎apps/import_onnx/README.md‎
Lines changed: 60 additions & 0 deletions b/‎apps/import_onnx/README.md‎
Lines changed: 60 additions & 0 deletions
diff --git a/‎apps/import_pytorch/README.md‎
Lines changed: 74 additions & 0 deletions b/‎apps/import_pytorch/README.md‎
Lines changed: 74 additions & 0 deletions
@@ -115,11 +115,11 @@ The `unit_test/embedded` matrix continues to exercise the freestanding (`quant_f
 
 Phase 9 of the roadmap adds composability between the previously orphaned `QValue` (Q-format) and `QAffineTensor` (int8 affine) pipelines, plus a half-precision storage tier for application-class CPUs.
 
-- **`cpp/qbridge.hpp`** — Pointwise type converters at layer boundaries: `affineDequantize` / `affineQuantize`, `qValueToFloat` / `floatToQValue`, `qValueToAffine` / `affineToQValue`, plus buffer-batch versions. Float at runtime, no `<cmath>` (rounding via sign-aware cast). Gated on `TINYMIND_ENABLE_FLOAT`; freestanding-safe at `STD=0`. Enables hybrid pipelines like *int8 affine CNN frontend → Q-format LSTM head → int8 affine classifier*.
+- **`cpp/qbridge.hpp`** — Pointwise type converters at layer boundaries: `affineDequantize` / `affineQuantize`, `qValueToFloat` / `floatToQValue`, `qValueToAffine` / `affineToQValue`, plus buffer-batch versions. Float at runtime, no `<cmath>` (rounding via sign-aware cast). Gated on `TINYMIND_ENABLE_FLOAT`; freestanding-safe at `STD=0`. Enables hybrid pipelines like *int8 affine CNN frontend → Q-format LSTM head → int8 affine classifier*. Phase 17 adds a parallel pure-integer path inside the same header gated on `TINYMIND_ENABLE_QUANTIZATION` (independent of `FLOAT`): `AffineToQValueIntParams<QV>` / `QValueToAffineIntParams<QV>` + `affineToQValueInt` / `qValueToAffineInt` (and buffer variants) reuse the gemmlowp Q0.31 `multiplyByQuantizedMultiplier` primitive, so the deployable freestanding shape `FLOAT=0 STD=0 QUANT=1` can mix Q-format and int8 affine tiers at runtime without `<cmath>`. Host-side helpers `buildAffineToQValueIntParams<QV>` / `buildQValueToAffineIntParams<QV>` build the integer triples at calibration time and ship them as data.
 - **`cpp/include/tinymind_fp16.hpp`** — Software-only `fp16_t` (IEEE 754 binary16) and `bf16_t` (bfloat16) storage structs wrapping `uint16_t`. Conversion helpers (`floatToFp16` / `fp16ToFloat`, `floatToBf16` / `bf16ToFloat`) handle normals, subnormals, Inf, and NaN. Storage tier; SIMD specializations land via Phase 14's `simd_neon_fp16.hpp` (NEON FEAT_FP16 vector forms).
 - **`cpp/qbridge.hpp`** also provides `fp16ToAffineI8` / `affineI8ToFp16` / `bf16ToAffineI8` / `affineI8ToBf16` when `TINYMIND_ENABLE_FP16=1`.
 
-The `unit_test/embedded` matrix exercises this as `fp16_freestanding` (`FLOAT=1 FP16=1 QUANT=1 STD=0`) to confirm the half-precision and bridge headers stay freestanding-clean.
+The `unit_test/embedded` matrix exercises the float bridges as `fp16_freestanding` (`FLOAT=1 FP16=1 QUANT=1 STD=0`); the Phase 17 integer bridges ride in the `quant_freestanding` corner (`QUANT=1 FLOAT=0 STD=0`) so both halves stay freestanding-clean.
 
 ### SIMD Performance Backend (optional, `TINYMIND_ENABLE_SIMD_*=1`)
 
@@ -184,7 +184,7 @@ typedef NeuralNet<XorNNProperties> XorNN;
 - **`qlearn/`** — Boost.Test unit tests for Q-learning
 - **`quantization/`** — Boost.Test unit tests for the int8 quantization path: Requantizer round-trip, per-tensor / per-channel calibration, QConv2D / QDepthwise / QPointwise / QPool / QDense forward passes against a float reference. Phase 11 additions cover `foldBatchNorm` (fused-conv parity vs unfused conv→BN), `QBatchNorm2D` parity, `QLayerNorm1D` parity and constant-row edge case, and `QSoftmax1D` parity plus dominant-class saturation. Phase 12 additions cover `QLSTMCell` single-step parity vs a float LSTM reference, `QLSTMCell` int16-cell-state drift over a 256-step sequence, and `QGRUCell` single-step parity vs a float GRU reference. Phase 13 additions cover Q1.15 twiddle round-trip, `QFFT1D` magnitude-spectrum parity vs a naive float DFT, `QFFT1D` forward/inverse round-trip, `QAttention1D` parity vs a float linear-attention reference, `QAttentionSoftmax1D` parity vs a float softmax-attention reference, and a `QMultiHeadLinearAttention1D` stacking test. Phase 14 additions cover SIMD bit-exactness across pathological lengths, INT8 extreme-value patterns, full-layer `QDense` and `QConv2D` parity, and the `activeBackendName()` dispatch report. Phase 15 additions cover `PercentileObserver` outlier clipping + empty-buffer edge case, `KLDivergenceObserver` clip-threshold convergence vs a Gaussian + outliers dataset + empty edge case, `crossLayerEqualizeDense` output preservation under ReLU + zero-row skip, and `crossLayerEqualizeConv2D` output preservation. Builds with `TINYMIND_ENABLE_QUANTIZATION=1`; pass `-DTINYMIND_ENABLE_SIMD_*=1` plus the matching `-march=` flag to exercise a SIMD backend.
 - **`embedded/`** — Cross-corner regression matrix. Builds the smoke source under eight `(FLOAT, STD, QUANT, FP16, INT16_ACCUM, SIMD_*)` configurations: `freestanding`, `no_stdlib`, `no_fpu`, `hosted`, `quant_freestanding`, `fp16_freestanding`, `int16_accum_freestanding`, and `simd_disabled` (Phase 14 scalar-fallback corner — every `TINYMIND_ENABLE_SIMD_*=0` at the deployable freestanding shape). A separate `simd_prereq_regressions` make target locks the static_assert prerequisite chain (`AVX_VNNI=1, AVX2=0` and `AVX512_VNNI=1, AVX512F=0` must fail to compile).
-- **`integration/`** — Phase 16 golden-byte suite. One Boost.Test fixture per exemplar (`resnet18_block_int8`, `mobilenetv2_int8`, `mixed_precision_kws`, `transformer_encoder_int8`) shells out to the example binary with `--golden` and asserts the emitted int8 byte stream matches a baked-in expected string. Catches silent regressions in the inference path regardless of which SIMD backend dispatch resolves to.
+- **`integration/`** — Phase 16 golden-byte suite (extended in Phase 17). One Boost.Test fixture per exemplar (`resnet18_block_int8`, `mobilenetv2_int8`, `mixed_precision_kws`, `transformer_encoder_int8`, `mixed_precision_mlp_int8_qformat`) shells out to the example binary with `--golden` and asserts the emitted int8 byte stream matches a baked-in expected string. Catches silent regressions in the inference path regardless of which SIMD backend dispatch resolves to.
 
 ### Examples (`examples/`)
 
@@ -201,6 +201,7 @@ typedef NeuralNet<XorNNProperties> XorNN;
 - **`resnet18_block_int8/`** — Phase 16 exemplar. int8 ResNet-18-shaped stem + one basic-block stage on a 16x16x3 input, 4 logits out. Same `make run` / `make bench` / `make golden` mode triple as the other Phase 16 exemplars.
 - **`mobilenetv2_int8/`** — Phase 16 exemplar. int8 MobileNetV2-shaped pipeline: stride-2 stem + one stride-1 inverted-residual block with skip + one stride-2 inverted-residual block, then GAP + dense. Linear bottlenecks per MNv2 convention.
 - **`mixed_precision_kws/`** — Phase 16 mixed-precision exemplar. int8 `QDense` frontend → Phase 9 `affineI8ToFp16` bridge → fp16 linear-attention head with residual skip + mean-pool → Phase 9 `fp16ToAffineI8` bridge → int8 `QDense` classifier. Requires `TINYMIND_ENABLE_FP16=1`.
+- **`mixed_precision_mlp_int8_qformat/`** — Phase 17 hybrid mixed-precision exemplar. int8 `QDense` frontend → `qrelu` → Phase 17 `affineToQValueIntBuffer` (pure-integer bridge) → Q8.8 dense matvec (int32 accumulator) → Phase 17 `qValueToAffineIntBuffer` (pure-integer bridge) → int8 `QDense` classifier. Deployable shape is `QUANT=1 FLOAT=0 STD=0`; the exemplar builds hosted for the parity report (~0.005 max-abs error vs the float reference) and wires into the integration suite via the same `--golden` mode.
 
 ### Apps (`apps/`)
 
 
@@ -20,6 +20,7 @@ check :
 	cd examples/resnet18_block_int8 && make clean && make && make release && make run && cd -
 	cd examples/mobilenetv2_int8 && make clean && make && make release && make run && cd -
 	cd examples/mixed_precision_kws && make clean && make && make release && make run && cd -
+	cd examples/mixed_precision_mlp_int8_qformat && make clean && make && make release && make run && cd -
 	cd unit_test/integration && make clean && make && make run && cd -
 	cd examples/pytorch_quant/xor && make clean && make && make release && make run && cd -
 	cd examples/import_demo && make clean && make && make release && make run && cd -
 
@@ -433,6 +433,28 @@ CPU complex and SIMD capability are orthogonal — the rows below describe *typi
 ## Non-Goals (Still)
 
 - **No QAT** in this roadmap. Post-training quantization remains the deployment path.
-- **No sub-4-bit / mixed precision below int8.** Storage tier list is {int8, int16-accum, fp16, bf16, fp32, Q-format}.
+- **No sub-4-bit / mixed precision below int8.** Storage tier list (in mixed-precision peer relationship via Phase 9 bridges and Phase 17 pure-integer bridges) is {int8 affine, int16-accum, fp16, bf16, fp32, **Q-format `QValue<I,F>`** as a first-class peer — Phase 17 closes the loop with integer-only `affineToQValueInt` / `qValueToAffineInt` so the Q-format tier participates in hybrid models at the deployable freestanding shape `FLOAT=0 STD=0 QUANT=1`}.
 - **No dynamic / runtime model loading.** Compile-time template shapes remain — codegen-from-PyTorch flow is the integration model (Phase 15).
 
+## Phase 17 — Pure-Integer Q-format <-> int8 Bridge + Hybrid Importer [SHIPPED]
+
+**Goal:** close the gap for the offline-training -> embedded-inference story where a model wants the int8 affine grid at the boundaries (PyTorch / TF / ONNX QDQ export shape) but a Q-format middle tier (existing `NeuralNet<Q8.8>` MCU code, or a hidden layer that prefers `QValue`'s compile-time fixed/fractional bit split). Phase 9 added the float-mediated `qValueToAffine` / `affineToQValue` bridges; Phase 17 ships the pure-integer counterparts so the inference path needs no `<cmath>` and no float at runtime.
+
+**Scope (shipped):**
+- `cpp/qbridge.hpp` additions: `AffineToQValueIntParams<QV>` / `QValueToAffineIntParams<QV>` (integer triples) plus `affineToQValueInt` / `qValueToAffineInt` (+ buffer variants). Uses the same Q0.31 `multiplyByQuantizedMultiplier` primitive that `Requantizer` does — no new runtime dependency. Gated on `TINYMIND_ENABLE_QUANTIZATION`, independent of `TINYMIND_ENABLE_FLOAT`. Host-side helper builders `buildAffineToQValueIntParams<QV>` / `buildQValueToAffineIntParams<QV>` gated on `FLOAT && STD`.
+- `apps/import_pytorch/tinymind_import.py` additions: `QFormatDense` layer descriptor (Q-format dense weights/biases emitted as raw QValue integers, no scale or zero_point at runtime), `HybridBoundary` precision-tier transition descriptor, and `quantize_multiplier` / `quantize_qformat_weights` helpers. The emitter writes precomputed `(multiplier, shift, zero_point)` triples (plus `qmin`/`qmax` on the `qvalue_to_affine` side) directly into `weights.hpp` so the deployable target consumes them as data.
+- `apps/import_onnx/README.md` + `apps/import_pytorch/README.md` document the TensorFlow / Keras path via `tf2onnx` + `onnxruntime.quantization.quantize_static(quant_format=QuantFormat.QDQ)` plus the hybrid `QFormatDense` + `HybridBoundary` flow.
+
+**Tests (shipped):**
+- `qbridge_int_affine_to_qvalue_matches_float_bridge` / `qbridge_int_qvalue_to_affine_matches_float_bridge` — pure-integer bridge stays within 1 LSB of the float bridge across the int8 / Q88 grid.
+- `qbridge_int_round_trip_within_tolerance` — float -> Q88 -> int8 (integer bridge) -> Q88 (integer bridge) -> float closes back within one affine LSB plus one QValue LSB.
+- `qbridge_int_qvalue_to_affine_saturates` — out-of-range Q88 inputs saturate to `[qmin, qmax]`.
+- `qbridge_int_buffer_round_trip` — buffer-variant parity.
+- `unit_test/embedded/embedded_smoke_test.cpp` exercises `affineToQValueInt` / `qValueToAffineInt` in the `quant_freestanding` corner, confirming the integer bridge stays freestanding-clean (no `<cmath>`, no `<type_traits>`, no stdlib).
+
+**Example:** `examples/mixed_precision_mlp_int8_qformat/` — int8 `QDense` -> `qrelu` -> Phase 17 `affineToQValueInt` bridge -> Q8.8 dense matvec -> Phase 17 `qValueToAffineInt` bridge -> int8 `QDense` classifier. `make run` reports max-abs error vs the float reference (~0.005 on the bundled dataset, well below the 60 %-of-output-range tolerance). `make golden` emits a deterministic int8 byte stream that the new `mixed_precision_mlp_int8_qformat_golden_match` integration fixture in `unit_test/integration/` locks at byte granularity.
+
+**Success criteria:** an offline-trained model with one or more Q-format hidden layers and int8 affine boundaries deployable end-to-end at `FLOAT=0 STD=0 QUANT=1`. ✓ shipped.
+
+**Risk:** low. Pure addition — no edits to existing runtime headers' behavior at any pre-Phase-17 gate combination.
+
@@ -47,3 +47,63 @@ layers = import_onnx_model(
 
 `onnx` Python package is imported lazily inside `parse_onnx_model`, so
 the rest of the module (the emitter) is usable without it.
+
+## TensorFlow / Keras via ONNX
+
+TensorFlow and Keras models reach this importer through `tf2onnx` plus
+the ONNX runtime's static-quantization API. Recipe:
+
+```bash
+pip install tf2onnx onnx onnxruntime
+
+# 1. Export your TF / Keras model to ONNX.
+python -m tf2onnx.convert \
+    --saved-model path/to/saved_model \
+    --output model.onnx \
+    --opset 13
+```
+
+```python
+# 2. Post-training quantize to QDQ format.
+from onnxruntime.quantization import (
+    quantize_static, QuantFormat, QuantType, CalibrationDataReader,
+)
+
+class MyCalibReader(CalibrationDataReader):
+    def __init__(self, dataset):
+        self._it = iter([{"input": x} for x in dataset])
+    def get_next(self):
+        return next(self._it, None)
+
+quantize_static(
+    "model.onnx", "model_int8.onnx",
+    calibration_data_reader=MyCalibReader(calib_inputs),
+    quant_format=QuantFormat.QDQ,
+    weight_type=QuantType.QInt8,
+    activation_type=QuantType.QInt8,
+    per_channel=False,
+)
+```
+
+```python
+# 3. Emit weights.hpp.
+from tinymind_import_onnx import import_onnx_model
+import_onnx_model(
+    model_path="model_int8.onnx",
+    output_path="weights.hpp",
+    namespace="my_model",
+)
+```
+
+The same recipe works for any framework that ONNX targets: JAX (via
+`jax2onnx`), MXNet, PaddlePaddle, etc.
+
+## Hybrid int8 + Q-format models
+
+The ONNX importer covers the int8 layers. If the deployable target
+inserts a Q-format hidden tier between two int8 layers (see
+`apps/import_pytorch/README.md` for the `QFormatDense` / `HybridBoundary`
+descriptors and `examples/mixed_precision_mlp_int8_qformat/` for the
+runnable C++ counterpart), parse the ONNX model with this importer
+to recover the int8 layers, then chain the result through the PyTorch
+importer's emitter passing the extra `boundaries` list.
@@ -82,3 +82,77 @@ MinMax.
 
 `examples/import_demo/` exercises this importer on a small MLP and
 verifies the C++ int8 forward against the float reference.
+
+## Hybrid int8 + Q-format models
+
+The importer also handles models that mix an int8 affine tier with the
+TinyMind Q-format (`QValue`) pipeline -- useful when an existing
+`NeuralNet<Q8.8>` hand-tuned for the MCU sits between two int8 layers
+exported from PyTorch, or when a specific hidden layer wants Q-format's
+compile-time fixed/fractional bit split.
+
+Two extra descriptor kinds:
+
+  * `QFormatDense` -- Q-format dense layer carrying float weights /
+    bias plus the QValue tag (`fixed_bits`, `fractional_bits`, `signed`).
+    The emitter writes raw QValue integers, no scale or zero_point.
+  * `HybridBoundary` -- precision-tier transition between two adjacent
+    layers (`kind = "affine_to_qvalue"` or `"qvalue_to_affine"`,
+    plus a `qformat` pointer carrying the fractional-bit count).
+
+Pass a `boundaries` list to `import_pytorch_model`; the emitter writes
+one precomputed integer triple per boundary -- the same
+`(multiplier, shift, zero_point)` that `cpp/qbridge.hpp::affineToQValueInt`
+and `qValueToAffineInt` consume pure-integer. The deployable target shape
+`TINYMIND_ENABLE_QUANTIZATION=1, FLOAT=0, STD=0` reads them as data,
+no host-side helper call at startup.
+
+```python
+mid = QFormatDense(name="qfmt_mid",
+                   weight=w_mid, bias=b_mid,
+                   input_name="hidden",
+                   forward=lambda x: x @ w_mid.T + b_mid,
+                   fractional_bits=8, fixed_bits=8, signed=True,
+                   observer=MinMaxObserver())
+layers = [
+    Dense(name="fc1", weight=w1, bias=b1, input_name="input",
+          forward=lambda x: x @ w1.T + b1,
+          observer=MinMaxObserver()),
+    ReLU(name="hidden", input_name="fc1"),
+    mid,
+    Dense(name="fc2", weight=w2, bias=b2, input_name="qfmt_mid",
+          forward=lambda x: x @ w2.T + b2,
+          observer=MinMaxObserver()),
+]
+boundaries = [
+    HybridBoundary(from_name="hidden", to_name="qfmt_mid",
+                   kind="affine_to_qvalue", qformat=mid),
+    HybridBoundary(from_name="qfmt_mid", to_name="fc2",
+                   kind="qvalue_to_affine", qformat=mid,
+                   qmin=-128, qmax=127),
+]
+import_pytorch_model(layers, ..., boundaries=boundaries)
+```
+
+`examples/mixed_precision_mlp_int8_qformat/` is the runnable C++
+counterpart -- it builds the same pipeline shape with hand-crafted
+weights and reports max-abs error vs the float reference.
+
+## Importing from TensorFlow / Keras
+
+The PyTorch importer also covers Keras / TensorFlow models via the
+ONNX QDQ recipe described in `apps/import_onnx/README.md`. The short
+form:
+
+1.  Train + export TF / Keras model.
+2.  Convert to ONNX: `python -m tf2onnx.convert --saved-model ... --output model.onnx`.
+3.  Post-training quantize: `onnxruntime.quantization.quantize_static(
+    model.onnx, model_int8.onnx, calibration_data_reader=...,
+    quant_format=QuantFormat.QDQ, weight_type=QInt8, activation_type=QInt8)`.
+4.  Parse with `apps/import_onnx/tinymind_import_onnx.py` and emit
+    `weights.hpp`.
+
+The hybrid int8 + Q-format flow above plugs into either entry point --
+the ONNX path emits the int8 layers' descriptors, then the caller
+inserts `QFormatDense` + `HybridBoundary` entries in the layer list
+before calling the emitter.