danmcleran
diff --git a/‎.gitignore‎
Lines changed: 4 additions & 0 deletions b/‎.gitignore‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 17 additions & 0 deletions b/‎CLAUDE.md‎
Lines changed: 17 additions & 0 deletions
diff --git a/‎Makefile‎
Lines changed: 4 additions & 0 deletions b/‎Makefile‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎QUANTIZATION.md‎
Lines changed: 10 additions & 13 deletions b/‎QUANTIZATION.md‎
Lines changed: 10 additions & 13 deletions
diff --git a/‎README.md‎
Lines changed: 4 additions & 0 deletions b/‎README.md‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎examples/mixed_precision_kws/Makefile‎
Lines changed: 31 additions & 0 deletions b/‎examples/mixed_precision_kws/Makefile‎
Lines changed: 31 additions & 0 deletions
@@ -24,3 +24,7 @@ unit_test/kan/output/
 unit_test/lookuptable/output/
 unit_test/embedded/output/
 unit_test/quantization/output/
+unit_test/integration/output/
+examples/resnet18_block_int8/output/
+examples/mobilenetv2_int8/output/
+examples/mixed_precision_kws/output/
@@ -148,6 +148,19 @@ Phase 15 reduces the accuracy gap vs a PyTorch / TFLite reference and lowers fri
 
 The Phase 15 deployable shape is unchanged: `TINYMIND_ENABLE_QUANTIZATION=1, FLOAT=0, STD=0`. All new helpers (Observers + CLE + importer scripts) live behind `FLOAT && STD` (the `qcalibration.hpp` gate) or in Python tooling. `examples/import_demo/` ships an end-to-end Phase 15 exemplar (C++ side: 3-8-4-2 MLP, three observers + CLE, calibration + int8 forward parity vs float, ~0.004 max-abs error on the bundled seed; Python side: full PyTorch-to-weights.hpp flow via `apps/import_pytorch`).
 
+### Mixed-Precision Exemplars + Verification (Phase 16)
+
+Phase 16 ships four reference int8 / mixed-precision exemplars and a `unit_test/integration` Boost.Test suite that locks their byte output across SIMD gate combos. Pure addition — no runtime header changes.
+
+- **`examples/resnet18_block_int8/`** — int8 ResNet-18-shaped stem plus one basic-block stage (`QPad2D` → `QConv2DPerChannel 7x7 s=2` → `qrelu` → `QMaxPool2D` → basic block: `QPad2D` → 3x3 conv → `qrelu` → `QPad2D` → 3x3 conv → `QAdd` skip → `qrelu` → `QGlobalAvgPool2D` → `QDense`). Demonstrates that `QMaxPool2D`, `qreluBuffer`, and `QGlobalAvgPool2D` are pass-throughs on the int8 affine grid, so consecutive layers reuse the upstream `(scale, zero_point)` rather than burning new requantizers.
+- **`examples/mobilenetv2_int8/`** — int8 MobileNetV2-shaped pipeline. Two inverted-residual blocks (one stride-1 with a residual skip, one stride-2 without), wired around a stride-2 stem and a GAP + dense head. The projection convolutions are linear (no `qrelu`), matching MNv2's "linear bottleneck" design rule. The inverted-residual unit is the load-bearing primitive of MNv2 / V3 / EfficientNet — the build pattern in this file scales linearly to a full model.
+- **`examples/transformer_encoder_int8/`** — already present from Phase 13. Phase 16 wires it into the integration suite with the same `--golden` mode as the new exemplars.
+- **`examples/mixed_precision_kws/`** — mixed-precision exemplar that exercises the Phase 9 qbridge converters in production shape: int8 `QDense` frontend → `affineI8ToFp16` bridge → fp16 linear-attention head with residual skip + mean-pool → `fp16ToAffineI8` bridge → int8 `QDense` classifier. `TINYMIND_ENABLE_FP16=1` required at the Makefile level. Inner attention arithmetic runs in float promoted from `fp16_t`; on targets that ship vector fp16 arithmetic (NEON FEAT_FP16, AVX-512 fp16) the promote pair is near-free, on every other target it is the cost of admission for fp16 storage on an MCU.
+
+Each exemplar Makefile exposes the same three modes: `make run` (parity report vs float reference), `make bench` (CSV cycle/byte report), `make golden` (int8 byte stream for the bundled test set in a stable text format).
+
+`unit_test/integration/` — new Boost.Test suite. One fixture per exemplar shells out to the example binary with `--golden` via `popen()` and compares the emitted byte stream to a baked-in expected string. The exemplar binaries are deterministic (hand-crafted weights, fixed synthetic dataset, pure-integer forward), so the output is invariant across SIMD gate combos by Phase 14's bit-exactness guarantee. Any silent drift in the example pipeline, the `qaffine.hpp` requantizer, the `qcalibration.hpp` helpers, or any SIMD specialization that claims bit-exactness trips the test. The root `Makefile`'s `check` target orders the integration test after the example builds so the binaries always exist when the test runs.
+
 ### Design Pattern
 
 Neural networks are configured through a properties struct that bundles all template policies:
@@ -171,6 +184,7 @@ typedef NeuralNet<XorNNProperties> XorNN;
 - **`qlearn/`** — Boost.Test unit tests for Q-learning
 - **`quantization/`** — Boost.Test unit tests for the int8 quantization path: Requantizer round-trip, per-tensor / per-channel calibration, QConv2D / QDepthwise / QPointwise / QPool / QDense forward passes against a float reference. Phase 11 additions cover `foldBatchNorm` (fused-conv parity vs unfused conv→BN), `QBatchNorm2D` parity, `QLayerNorm1D` parity and constant-row edge case, and `QSoftmax1D` parity plus dominant-class saturation. Phase 12 additions cover `QLSTMCell` single-step parity vs a float LSTM reference, `QLSTMCell` int16-cell-state drift over a 256-step sequence, and `QGRUCell` single-step parity vs a float GRU reference. Phase 13 additions cover Q1.15 twiddle round-trip, `QFFT1D` magnitude-spectrum parity vs a naive float DFT, `QFFT1D` forward/inverse round-trip, `QAttention1D` parity vs a float linear-attention reference, `QAttentionSoftmax1D` parity vs a float softmax-attention reference, and a `QMultiHeadLinearAttention1D` stacking test. Phase 14 additions cover SIMD bit-exactness across pathological lengths, INT8 extreme-value patterns, full-layer `QDense` and `QConv2D` parity, and the `activeBackendName()` dispatch report. Phase 15 additions cover `PercentileObserver` outlier clipping + empty-buffer edge case, `KLDivergenceObserver` clip-threshold convergence vs a Gaussian + outliers dataset + empty edge case, `crossLayerEqualizeDense` output preservation under ReLU + zero-row skip, and `crossLayerEqualizeConv2D` output preservation. Builds with `TINYMIND_ENABLE_QUANTIZATION=1`; pass `-DTINYMIND_ENABLE_SIMD_*=1` plus the matching `-march=` flag to exercise a SIMD backend.
 - **`embedded/`** — Cross-corner regression matrix. Builds the smoke source under eight `(FLOAT, STD, QUANT, FP16, INT16_ACCUM, SIMD_*)` configurations: `freestanding`, `no_stdlib`, `no_fpu`, `hosted`, `quant_freestanding`, `fp16_freestanding`, `int16_accum_freestanding`, and `simd_disabled` (Phase 14 scalar-fallback corner — every `TINYMIND_ENABLE_SIMD_*=0` at the deployable freestanding shape). A separate `simd_prereq_regressions` make target locks the static_assert prerequisite chain (`AVX_VNNI=1, AVX2=0` and `AVX512_VNNI=1, AVX512F=0` must fail to compile).
+- **`integration/`** — Phase 16 golden-byte suite. One Boost.Test fixture per exemplar (`resnet18_block_int8`, `mobilenetv2_int8`, `mixed_precision_kws`, `transformer_encoder_int8`) shells out to the example binary with `--golden` and asserts the emitted int8 byte stream matches a baked-in expected string. Catches silent regressions in the inference path regardless of which SIMD backend dispatch resolves to.
 
 ### Examples (`examples/`)
 
@@ -184,6 +198,9 @@ typedef NeuralNet<XorNNProperties> XorNN;
 - **`transformer_encoder_int8/`** — Phase 13 demonstration: int8 transformer encoder block (`QLayerNorm1D` → `QAttention1D` linear attention → `QAdd` skip → `QLayerNorm1D` → `QDense` + `qrelu` → `QDense` → `QAdd` skip). Calibrates per-tensor activations and per-tensor symmetric weight scales on the host, then runs the block end-to-end on int8 and reports max-abs error vs the float reference (~2% of output range on the bundled dataset).
 - **`perf_matrix/`** — Phase 14 SIMD gate bench. Builds the same `QConv2D` 3x3 + `QDense` int8 block under each enabled `TINYMIND_ENABLE_SIMD_*` combination on the host (default Makefile builds scalar / AVX2 / AVX-512F / AVX-512-VNNI). Emits one CSV row per backend with per-call timing and an `output_checksum` that is invariant across backends — the row is the bit-exactness regression and the cycle delta is the perf headline.
 - **`import_demo/`** — Phase 15 importer end-to-end. C++ binary carries a deterministic 3-8-4-2 MLP, drives a 64-sample synthetic calibration set through `RangeObserver` / `PercentileObserver` / `KLDivergenceObserver` plus `crossLayerEqualizeDense`, then runs both the float reference and the pure-integer int8 forward and reports max-abs error (~0.004 on the bundled seed; tolerance 0.08). Standalone — no torch dependency. `demo.py` is the production-flow counterpart that consumes `torch.state_dict` and drives `apps/import_pytorch/tinymind_import` to emit a real `weights.hpp`.
+- **`resnet18_block_int8/`** — Phase 16 exemplar. int8 ResNet-18-shaped stem + one basic-block stage on a 16x16x3 input, 4 logits out. Same `make run` / `make bench` / `make golden` mode triple as the other Phase 16 exemplars.
+- **`mobilenetv2_int8/`** — Phase 16 exemplar. int8 MobileNetV2-shaped pipeline: stride-2 stem + one stride-1 inverted-residual block with skip + one stride-2 inverted-residual block, then GAP + dense. Linear bottlenecks per MNv2 convention.
+- **`mixed_precision_kws/`** — Phase 16 mixed-precision exemplar. int8 `QDense` frontend → Phase 9 `affineI8ToFp16` bridge → fp16 linear-attention head with residual skip + mean-pool → Phase 9 `fp16ToAffineI8` bridge → int8 `QDense` classifier. Requires `TINYMIND_ENABLE_FP16=1`.
 
 ### Apps (`apps/`)
 
 
@@ -17,6 +17,10 @@ check :
 	cd examples/kws_cortex_m_int8 && make clean && make && make release && cd -
 	cd examples/resnet_block_int8 && make clean && make && make release && make run && cd -
 	cd examples/transformer_encoder_int8 && make clean && make && make release && make run && cd -
+	cd examples/resnet18_block_int8 && make clean && make && make release && make run && cd -
+	cd examples/mobilenetv2_int8 && make clean && make && make release && make run && cd -
+	cd examples/mixed_precision_kws && make clean && make && make release && make run && cd -
+	cd unit_test/integration && make clean && make && make run && cd -
 	cd examples/pytorch_quant/xor && make clean && make && make release && make run && cd -
 	cd examples/import_demo && make clean && make && make release && make run && cd -
 	cd examples/perf_matrix && make clean && make && make report && cd -
 
@@ -363,27 +363,24 @@ Medium. Bit-exactness across SIMD widths needs careful order-of-accumulation han
 
 **Risk:** low-medium. Mostly Python tooling, isolated from runtime.
 
-## Phase 16 — Mixed-Precision Exemplars + Verification
+## Phase 16 — Mixed-Precision Exemplars + Verification [SHIPPED]
 
 **Goal:** prove end-to-end models really run. Lock with regression tests.
 
-**Scope:** four reference models, each a directory under `examples/`:
+**Scope (shipped):** four reference exemplars, one directory per model:
 
-1. `examples/resnet18_block_int8/` — int8 ResNet stem + 1 stage. Exercises Phase 10 ops.
-2. `examples/mobilenetv2_int8/` — full MobileNetV2 (depthwise-separable + inverted-residual). Exercises Phase 10 ops at scale.
-3. `examples/transformer_encoder_int8/` — single encoder block with int8 attention + int8 LayerNorm + int8 softmax. Exercises Phase 11+13.
-4. `examples/mixed_precision_kws/` — **mixed-precision exemplar:** int8 CNN feature extractor → fp16 attention head → int8 dense classifier. Exercises Phase 9 bridges in production shape.
+1. `examples/resnet18_block_int8/` — int8 ResNet-18 stem + one basic-block stage (`QPad2D` → `QConv2DPerChannel 7x7 s=2` → `qrelu` → `QMaxPool2D` → basic block: `QPad2D` → 3x3 conv → `qrelu` → `QPad2D` → 3x3 conv → `QAdd` skip → `qrelu` → `QGlobalAvgPool2D` → `QDense`). Exercises Phase 10 `QPad2D` / `QConv2DPerChannel` / `QAdd` at deeper spatial dimensions than the original `resnet_block_int8`. Demonstrates that `QMaxPool2D`, `qreluBuffer`, and `QGlobalAvgPool2D` are pass-throughs on the int8 affine grid (max, clamp, integer-mean all preserve `(scale, zero_point)`), so consecutive layers reuse the upstream grid rather than burning new requantizers.
+2. `examples/mobilenetv2_int8/` — int8 MobileNetV2-shaped pipeline. Stride-2 stem + one stride-1 inverted-residual block with skip + one stride-2 inverted-residual block without skip, then GAP + dense. Linear bottlenecks per MNv2 convention (no `qrelu` after the 1x1 projection). Exercises the `expand → DW → project` triple — the load-bearing primitive of MNv2 / V3 / EfficientNet.
+3. `examples/transformer_encoder_int8/` — already present from Phase 13; Phase 16 wires it into the integration suite with a matching `--golden` mode.
+4. `examples/mixed_precision_kws/` — mixed-precision exemplar. int8 `QDense` frontend → Phase 9 `affineI8ToFp16` bridge → fp16 linear-attention head with residual skip + mean-pool → Phase 9 `fp16ToAffineI8` bridge → int8 `QDense` classifier. Requires `TINYMIND_ENABLE_FP16=1`. Inner attention arithmetic runs in float promoted from `fp16_t`; on targets that ship vector fp16 arithmetic the promote pair is near-free, on every other target it is the cost of admission for fp16 storage on an MCU.
 
-Each ships with:
-- CSV cycle/byte report (extends existing `bench::report.hpp`)
-- Golden float reference in Python, parity test in C++
-- README documenting precision tier per layer
+Each exemplar Makefile exposes three modes: `make run` (parity report vs float reference; PASS within 40-50 % of output dynamic range), `make bench` (CSV cycle/byte report — one row per layer, mirrors `examples/kws_cortex_m_int8/`), `make golden` (int8 byte stream for the bundled deterministic test set in a stable text format). Each ships with a per-precision-tier README.
 
-**Tests:** `unit_test/integration/` — new directory. Boost.Test fixtures load exemplar weights and verify int8 output matches stored golden int8 output, byte-for-byte. Locks regressions across SIMD gate combos.
+**Tests (shipped):** `unit_test/integration/` — new Boost.Test suite. One fixture per exemplar shells out to the example binary with `--golden` via `popen()` and asserts the emitted byte stream matches a baked-in expected string. The exemplar binaries are deterministic (hand-crafted weights, fixed synthetic dataset, pure-integer forward), so the output is invariant across SIMD gate combos by Phase 14's bit-exactness guarantee. Any silent drift in the example pipeline, the `qaffine.hpp` requantizer, the `qcalibration.hpp` helpers, or any SIMD specialization that claims bit-exactness trips the test.
 
-**Success criteria:** repo ships four working mixed-precision models with reproducible benchmark numbers. Future PRs cannot regress them silently.
+**Success criteria:** repo ships four working mixed-precision exemplars with reproducible benchmark numbers, locked at byte granularity by the integration suite. ✓ shipped.
 
-**Risk:** low. Final phase — all components landed in prior phases.
+**Risk:** low. Final phase — every runtime component landed in Phases 9-14, every host-side helper in Phase 15.
 
 ## Dependency Graph
 
 
@@ -73,6 +73,10 @@ A parallel TFLite/CMSIS-NN style affine quantization path that runs **alongside*
 - **End-to-end examples**:
   - [`examples/pytorch_quant/xor/`](examples/pytorch_quant/xor/) -- PyTorch float training + per-tensor calibration + `weights.hpp` emission, then a pure-integer C++ forward pass through `QDense` + `qrelu` + `QDense` + int8 sigmoid LUT
   - [`examples/kws_cortex_m_int8/`](examples/kws_cortex_m_int8/) -- side-by-side counterpart to `examples/kws_cortex_m/`; same MobileNet-style KWS pipeline, comparable CSV cycle/byte report, ~4x smaller weight footprint on the convolutional layers
+  - [`examples/resnet18_block_int8/`](examples/resnet18_block_int8/) -- Phase 16 exemplar. int8 ResNet-18-shaped stem + one basic-block stage. `make run`, `make bench`, `make golden`.
+  - [`examples/mobilenetv2_int8/`](examples/mobilenetv2_int8/) -- Phase 16 exemplar. int8 MobileNetV2 inverted-residual block sequence with linear bottlenecks.
+  - [`examples/mixed_precision_kws/`](examples/mixed_precision_kws/) -- Phase 16 mixed-precision exemplar. int8 frontend -> fp16 attention head -> int8 classifier, exercises Phase 9 qbridge converters.
+  - [`unit_test/integration/`](unit_test/integration/) -- Phase 16 golden-byte regression suite. Locks the four exemplars' int8 output byte-for-byte across SIMD gate combos.
 
 ### Activation Functions
 
 
@@ -0,0 +1,31 @@
+# Phase 16 exemplar: mixed-precision KWS pipeline.
+#   int8 CNN feature extractor -> fp16 attention head -> int8 dense classifier.
+# Exercises the Phase 9 qbridge converters in production shape.
+
+MKDIR=mkdir -p ./output
+CC=g++
+WARN=-Wall -Wextra -Werror -Wpedantic
+OUT=./output/mixed_precision_kws
+SOURCES=mixed_precision_kws.cpp
+INCLUDES=-I../../cpp -I../../cpp/include -I../../include/
+DEFINES=-DTINYMIND_ENABLE_FLOAT=1 -DTINYMIND_ENABLE_STD=1 -DTINYMIND_ENABLE_QUANTIZATION=1 -DTINYMIND_ENABLE_FP16=1
+
+default :
+	$(MKDIR)
+	$(CC) -g $(WARN) -o $(OUT) $(SOURCES) $(INCLUDES) $(DEFINES)
+
+release :
+	$(MKDIR)
+	$(CC) -g -O3 $(WARN) -o $(OUT) $(SOURCES) $(INCLUDES) $(DEFINES)
+
+run :
+	cd ./output && ./mixed_precision_kws && cd ../
+
+bench :
+	cd ./output && ./mixed_precision_kws --bench > mixed_precision_kws.csv && cd ../
+
+golden :
+	cd ./output && ./mixed_precision_kws --golden > mixed_precision_kws.golden && cd ../
+
+clean :
+	rm -f ./output/*