Skip to content

Commit 050a54e

Browse files
danmcleranclaude
andcommitted
Add Phase 16 mixed-precision exemplars + golden-byte regression suite
Four reference int8 / mixed-precision exemplars plus a Boost.Test integration suite that locks their int8 output byte-for-byte across SIMD gate combos: - examples/resnet18_block_int8/ int8 ResNet stem + basic block - examples/mobilenetv2_int8/ int8 MobileNetV2 inverted-residual blocks - examples/mixed_precision_kws/ int8 -> fp16 attn -> int8 (Phase 9 bridges) - examples/transformer_encoder_int8/ --golden mode added - unit_test/integration/ popen() one fixture per exemplar Each exemplar Makefile exposes make run / make bench / make golden. Root Makefile orders the integration suite after the example builds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 1ce55ee commit 050a54e

18 files changed

Lines changed: 2423 additions & 14 deletions

File tree

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,3 +24,7 @@ unit_test/kan/output/
2424
unit_test/lookuptable/output/
2525
unit_test/embedded/output/
2626
unit_test/quantization/output/
27+
unit_test/integration/output/
28+
examples/resnet18_block_int8/output/
29+
examples/mobilenetv2_int8/output/
30+
examples/mixed_precision_kws/output/

CLAUDE.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -148,6 +148,19 @@ Phase 15 reduces the accuracy gap vs a PyTorch / TFLite reference and lowers fri
148148

149149
The Phase 15 deployable shape is unchanged: `TINYMIND_ENABLE_QUANTIZATION=1, FLOAT=0, STD=0`. All new helpers (Observers + CLE + importer scripts) live behind `FLOAT && STD` (the `qcalibration.hpp` gate) or in Python tooling. `examples/import_demo/` ships an end-to-end Phase 15 exemplar (C++ side: 3-8-4-2 MLP, three observers + CLE, calibration + int8 forward parity vs float, ~0.004 max-abs error on the bundled seed; Python side: full PyTorch-to-weights.hpp flow via `apps/import_pytorch`).
150150

151+
### Mixed-Precision Exemplars + Verification (Phase 16)
152+
153+
Phase 16 ships four reference int8 / mixed-precision exemplars and a `unit_test/integration` Boost.Test suite that locks their byte output across SIMD gate combos. Pure addition — no runtime header changes.
154+
155+
- **`examples/resnet18_block_int8/`** — int8 ResNet-18-shaped stem plus one basic-block stage (`QPad2D``QConv2DPerChannel 7x7 s=2``qrelu``QMaxPool2D` → basic block: `QPad2D` → 3x3 conv → `qrelu``QPad2D` → 3x3 conv → `QAdd` skip → `qrelu``QGlobalAvgPool2D``QDense`). Demonstrates that `QMaxPool2D`, `qreluBuffer`, and `QGlobalAvgPool2D` are pass-throughs on the int8 affine grid, so consecutive layers reuse the upstream `(scale, zero_point)` rather than burning new requantizers.
156+
- **`examples/mobilenetv2_int8/`** — int8 MobileNetV2-shaped pipeline. Two inverted-residual blocks (one stride-1 with a residual skip, one stride-2 without), wired around a stride-2 stem and a GAP + dense head. The projection convolutions are linear (no `qrelu`), matching MNv2's "linear bottleneck" design rule. The inverted-residual unit is the load-bearing primitive of MNv2 / V3 / EfficientNet — the build pattern in this file scales linearly to a full model.
157+
- **`examples/transformer_encoder_int8/`** — already present from Phase 13. Phase 16 wires it into the integration suite with the same `--golden` mode as the new exemplars.
158+
- **`examples/mixed_precision_kws/`** — mixed-precision exemplar that exercises the Phase 9 qbridge converters in production shape: int8 `QDense` frontend → `affineI8ToFp16` bridge → fp16 linear-attention head with residual skip + mean-pool → `fp16ToAffineI8` bridge → int8 `QDense` classifier. `TINYMIND_ENABLE_FP16=1` required at the Makefile level. Inner attention arithmetic runs in float promoted from `fp16_t`; on targets that ship vector fp16 arithmetic (NEON FEAT_FP16, AVX-512 fp16) the promote pair is near-free, on every other target it is the cost of admission for fp16 storage on an MCU.
159+
160+
Each exemplar Makefile exposes the same three modes: `make run` (parity report vs float reference), `make bench` (CSV cycle/byte report), `make golden` (int8 byte stream for the bundled test set in a stable text format).
161+
162+
`unit_test/integration/` — new Boost.Test suite. One fixture per exemplar shells out to the example binary with `--golden` via `popen()` and compares the emitted byte stream to a baked-in expected string. The exemplar binaries are deterministic (hand-crafted weights, fixed synthetic dataset, pure-integer forward), so the output is invariant across SIMD gate combos by Phase 14's bit-exactness guarantee. Any silent drift in the example pipeline, the `qaffine.hpp` requantizer, the `qcalibration.hpp` helpers, or any SIMD specialization that claims bit-exactness trips the test. The root `Makefile`'s `check` target orders the integration test after the example builds so the binaries always exist when the test runs.
163+
151164
### Design Pattern
152165

153166
Neural networks are configured through a properties struct that bundles all template policies:
@@ -171,6 +184,7 @@ typedef NeuralNet<XorNNProperties> XorNN;
171184
- **`qlearn/`** — Boost.Test unit tests for Q-learning
172185
- **`quantization/`** — Boost.Test unit tests for the int8 quantization path: Requantizer round-trip, per-tensor / per-channel calibration, QConv2D / QDepthwise / QPointwise / QPool / QDense forward passes against a float reference. Phase 11 additions cover `foldBatchNorm` (fused-conv parity vs unfused conv→BN), `QBatchNorm2D` parity, `QLayerNorm1D` parity and constant-row edge case, and `QSoftmax1D` parity plus dominant-class saturation. Phase 12 additions cover `QLSTMCell` single-step parity vs a float LSTM reference, `QLSTMCell` int16-cell-state drift over a 256-step sequence, and `QGRUCell` single-step parity vs a float GRU reference. Phase 13 additions cover Q1.15 twiddle round-trip, `QFFT1D` magnitude-spectrum parity vs a naive float DFT, `QFFT1D` forward/inverse round-trip, `QAttention1D` parity vs a float linear-attention reference, `QAttentionSoftmax1D` parity vs a float softmax-attention reference, and a `QMultiHeadLinearAttention1D` stacking test. Phase 14 additions cover SIMD bit-exactness across pathological lengths, INT8 extreme-value patterns, full-layer `QDense` and `QConv2D` parity, and the `activeBackendName()` dispatch report. Phase 15 additions cover `PercentileObserver` outlier clipping + empty-buffer edge case, `KLDivergenceObserver` clip-threshold convergence vs a Gaussian + outliers dataset + empty edge case, `crossLayerEqualizeDense` output preservation under ReLU + zero-row skip, and `crossLayerEqualizeConv2D` output preservation. Builds with `TINYMIND_ENABLE_QUANTIZATION=1`; pass `-DTINYMIND_ENABLE_SIMD_*=1` plus the matching `-march=` flag to exercise a SIMD backend.
173186
- **`embedded/`** — Cross-corner regression matrix. Builds the smoke source under eight `(FLOAT, STD, QUANT, FP16, INT16_ACCUM, SIMD_*)` configurations: `freestanding`, `no_stdlib`, `no_fpu`, `hosted`, `quant_freestanding`, `fp16_freestanding`, `int16_accum_freestanding`, and `simd_disabled` (Phase 14 scalar-fallback corner — every `TINYMIND_ENABLE_SIMD_*=0` at the deployable freestanding shape). A separate `simd_prereq_regressions` make target locks the static_assert prerequisite chain (`AVX_VNNI=1, AVX2=0` and `AVX512_VNNI=1, AVX512F=0` must fail to compile).
187+
- **`integration/`** — Phase 16 golden-byte suite. One Boost.Test fixture per exemplar (`resnet18_block_int8`, `mobilenetv2_int8`, `mixed_precision_kws`, `transformer_encoder_int8`) shells out to the example binary with `--golden` and asserts the emitted int8 byte stream matches a baked-in expected string. Catches silent regressions in the inference path regardless of which SIMD backend dispatch resolves to.
174188
175189
### Examples (`examples/`)
176190
@@ -184,6 +198,9 @@ typedef NeuralNet<XorNNProperties> XorNN;
184198
- **`transformer_encoder_int8/`** — Phase 13 demonstration: int8 transformer encoder block (`QLayerNorm1D` → `QAttention1D` linear attention → `QAdd` skip → `QLayerNorm1D` → `QDense` + `qrelu` → `QDense` → `QAdd` skip). Calibrates per-tensor activations and per-tensor symmetric weight scales on the host, then runs the block end-to-end on int8 and reports max-abs error vs the float reference (~2% of output range on the bundled dataset).
185199
- **`perf_matrix/`** — Phase 14 SIMD gate bench. Builds the same `QConv2D` 3x3 + `QDense` int8 block under each enabled `TINYMIND_ENABLE_SIMD_*` combination on the host (default Makefile builds scalar / AVX2 / AVX-512F / AVX-512-VNNI). Emits one CSV row per backend with per-call timing and an `output_checksum` that is invariant across backends — the row is the bit-exactness regression and the cycle delta is the perf headline.
186200
- **`import_demo/`** — Phase 15 importer end-to-end. C++ binary carries a deterministic 3-8-4-2 MLP, drives a 64-sample synthetic calibration set through `RangeObserver` / `PercentileObserver` / `KLDivergenceObserver` plus `crossLayerEqualizeDense`, then runs both the float reference and the pure-integer int8 forward and reports max-abs error (~0.004 on the bundled seed; tolerance 0.08). Standalone — no torch dependency. `demo.py` is the production-flow counterpart that consumes `torch.state_dict` and drives `apps/import_pytorch/tinymind_import` to emit a real `weights.hpp`.
201+
- **`resnet18_block_int8/`** — Phase 16 exemplar. int8 ResNet-18-shaped stem + one basic-block stage on a 16x16x3 input, 4 logits out. Same `make run` / `make bench` / `make golden` mode triple as the other Phase 16 exemplars.
202+
- **`mobilenetv2_int8/`** — Phase 16 exemplar. int8 MobileNetV2-shaped pipeline: stride-2 stem + one stride-1 inverted-residual block with skip + one stride-2 inverted-residual block, then GAP + dense. Linear bottlenecks per MNv2 convention.
203+
- **`mixed_precision_kws/`** — Phase 16 mixed-precision exemplar. int8 `QDense` frontend → Phase 9 `affineI8ToFp16` bridge → fp16 linear-attention head with residual skip + mean-pool → Phase 9 `fp16ToAffineI8` bridge → int8 `QDense` classifier. Requires `TINYMIND_ENABLE_FP16=1`.
187204
188205
### Apps (`apps/`)
189206

Makefile

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,10 @@ check :
1717
cd examples/kws_cortex_m_int8 && make clean && make && make release && cd -
1818
cd examples/resnet_block_int8 && make clean && make && make release && make run && cd -
1919
cd examples/transformer_encoder_int8 && make clean && make && make release && make run && cd -
20+
cd examples/resnet18_block_int8 && make clean && make && make release && make run && cd -
21+
cd examples/mobilenetv2_int8 && make clean && make && make release && make run && cd -
22+
cd examples/mixed_precision_kws && make clean && make && make release && make run && cd -
23+
cd unit_test/integration && make clean && make && make run && cd -
2024
cd examples/pytorch_quant/xor && make clean && make && make release && make run && cd -
2125
cd examples/import_demo && make clean && make && make release && make run && cd -
2226
cd examples/perf_matrix && make clean && make && make report && cd -

QUANTIZATION.md

Lines changed: 10 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -363,27 +363,24 @@ Medium. Bit-exactness across SIMD widths needs careful order-of-accumulation han
363363
364364
**Risk:** low-medium. Mostly Python tooling, isolated from runtime.
365365
366-
## Phase 16 — Mixed-Precision Exemplars + Verification
366+
## Phase 16 — Mixed-Precision Exemplars + Verification [SHIPPED]
367367
368368
**Goal:** prove end-to-end models really run. Lock with regression tests.
369369
370-
**Scope:** four reference models, each a directory under `examples/`:
370+
**Scope (shipped):** four reference exemplars, one directory per model:
371371
372-
1. `examples/resnet18_block_int8/` — int8 ResNet stem + 1 stage. Exercises Phase 10 ops.
373-
2. `examples/mobilenetv2_int8/` — full MobileNetV2 (depthwise-separable + inverted-residual). Exercises Phase 10 ops at scale.
374-
3. `examples/transformer_encoder_int8/` — single encoder block with int8 attention + int8 LayerNorm + int8 softmax. Exercises Phase 11+13.
375-
4. `examples/mixed_precision_kws/` — **mixed-precision exemplar:** int8 CNN feature extractor → fp16 attention head int8 dense classifier. Exercises Phase 9 bridges in production shape.
372+
1. `examples/resnet18_block_int8/` — int8 ResNet-18 stem + one basic-block stage (`QPad2D` → `QConv2DPerChannel 7x7 s=2` → `qrelu` → `QMaxPool2D` → basic block: `QPad2D` → 3x3 conv → `qrelu` → `QPad2D` → 3x3 conv → `QAdd` skip → `qrelu` → `QGlobalAvgPool2D` → `QDense`). Exercises Phase 10 `QPad2D` / `QConv2DPerChannel` / `QAdd` at deeper spatial dimensions than the original `resnet_block_int8`. Demonstrates that `QMaxPool2D`, `qreluBuffer`, and `QGlobalAvgPool2D` are pass-throughs on the int8 affine grid (max, clamp, integer-mean all preserve `(scale, zero_point)`), so consecutive layers reuse the upstream grid rather than burning new requantizers.
373+
2. `examples/mobilenetv2_int8/` — int8 MobileNetV2-shaped pipeline. Stride-2 stem + one stride-1 inverted-residual block with skip + one stride-2 inverted-residual block without skip, then GAP + dense. Linear bottlenecks per MNv2 convention (no `qrelu` after the 1x1 projection). Exercises the `expand → DW → project` triple — the load-bearing primitive of MNv2 / V3 / EfficientNet.
374+
3. `examples/transformer_encoder_int8/` — already present from Phase 13; Phase 16 wires it into the integration suite with a matching `--golden` mode.
375+
4. `examples/mixed_precision_kws/` — mixed-precision exemplar. int8 `QDense` frontend → Phase 9 `affineI8ToFp16` bridge → fp16 linear-attention head with residual skip + mean-pool → Phase 9 `fp16ToAffineI8` bridge → int8 `QDense` classifier. Requires `TINYMIND_ENABLE_FP16=1`. Inner attention arithmetic runs in float promoted from `fp16_t`; on targets that ship vector fp16 arithmetic the promote pair is near-free, on every other target it is the cost of admission for fp16 storage on an MCU.
376376
377-
Each ships with:
378-
- CSV cycle/byte report (extends existing `bench::report.hpp`)
379-
- Golden float reference in Python, parity test in C++
380-
- README documenting precision tier per layer
377+
Each exemplar Makefile exposes three modes: `make run` (parity report vs float reference; PASS within 40-50 % of output dynamic range), `make bench` (CSV cycle/byte report — one row per layer, mirrors `examples/kws_cortex_m_int8/`), `make golden` (int8 byte stream for the bundled deterministic test set in a stable text format). Each ships with a per-precision-tier README.
381378
382-
**Tests:** `unit_test/integration/` — new directory. Boost.Test fixtures load exemplar weights and verify int8 output matches stored golden int8 output, byte-for-byte. Locks regressions across SIMD gate combos.
379+
**Tests (shipped):** `unit_test/integration/` — new Boost.Test suite. One fixture per exemplar shells out to the example binary with `--golden` via `popen()` and asserts the emitted byte stream matches a baked-in expected string. The exemplar binaries are deterministic (hand-crafted weights, fixed synthetic dataset, pure-integer forward), so the output is invariant across SIMD gate combos by Phase 14's bit-exactness guarantee. Any silent drift in the example pipeline, the `qaffine.hpp` requantizer, the `qcalibration.hpp` helpers, or any SIMD specialization that claims bit-exactness trips the test.
383380
384-
**Success criteria:** repo ships four working mixed-precision models with reproducible benchmark numbers. Future PRs cannot regress them silently.
381+
**Success criteria:** repo ships four working mixed-precision exemplars with reproducible benchmark numbers, locked at byte granularity by the integration suite. ✓ shipped.
385382
386-
**Risk:** low. Final phase — all components landed in prior phases.
383+
**Risk:** low. Final phase — every runtime component landed in Phases 9-14, every host-side helper in Phase 15.
387384
388385
## Dependency Graph
389386

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,10 @@ A parallel TFLite/CMSIS-NN style affine quantization path that runs **alongside*
7373
- **End-to-end examples**:
7474
- [`examples/pytorch_quant/xor/`](examples/pytorch_quant/xor/) -- PyTorch float training + per-tensor calibration + `weights.hpp` emission, then a pure-integer C++ forward pass through `QDense` + `qrelu` + `QDense` + int8 sigmoid LUT
7575
- [`examples/kws_cortex_m_int8/`](examples/kws_cortex_m_int8/) -- side-by-side counterpart to `examples/kws_cortex_m/`; same MobileNet-style KWS pipeline, comparable CSV cycle/byte report, ~4x smaller weight footprint on the convolutional layers
76+
- [`examples/resnet18_block_int8/`](examples/resnet18_block_int8/) -- Phase 16 exemplar. int8 ResNet-18-shaped stem + one basic-block stage. `make run`, `make bench`, `make golden`.
77+
- [`examples/mobilenetv2_int8/`](examples/mobilenetv2_int8/) -- Phase 16 exemplar. int8 MobileNetV2 inverted-residual block sequence with linear bottlenecks.
78+
- [`examples/mixed_precision_kws/`](examples/mixed_precision_kws/) -- Phase 16 mixed-precision exemplar. int8 frontend -> fp16 attention head -> int8 classifier, exercises Phase 9 qbridge converters.
79+
- [`unit_test/integration/`](unit_test/integration/) -- Phase 16 golden-byte regression suite. Locks the four exemplars' int8 output byte-for-byte across SIMD gate combos.
7680

7781
### Activation Functions
7882

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Phase 16 exemplar: mixed-precision KWS pipeline.
2+
# int8 CNN feature extractor -> fp16 attention head -> int8 dense classifier.
3+
# Exercises the Phase 9 qbridge converters in production shape.
4+
5+
MKDIR=mkdir -p ./output
6+
CC=g++
7+
WARN=-Wall -Wextra -Werror -Wpedantic
8+
OUT=./output/mixed_precision_kws
9+
SOURCES=mixed_precision_kws.cpp
10+
INCLUDES=-I../../cpp -I../../cpp/include -I../../include/
11+
DEFINES=-DTINYMIND_ENABLE_FLOAT=1 -DTINYMIND_ENABLE_STD=1 -DTINYMIND_ENABLE_QUANTIZATION=1 -DTINYMIND_ENABLE_FP16=1
12+
13+
default :
14+
$(MKDIR)
15+
$(CC) -g $(WARN) -o $(OUT) $(SOURCES) $(INCLUDES) $(DEFINES)
16+
17+
release :
18+
$(MKDIR)
19+
$(CC) -g -O3 $(WARN) -o $(OUT) $(SOURCES) $(INCLUDES) $(DEFINES)
20+
21+
run :
22+
cd ./output && ./mixed_precision_kws && cd ../
23+
24+
bench :
25+
cd ./output && ./mixed_precision_kws --bench > mixed_precision_kws.csv && cd ../
26+
27+
golden :
28+
cd ./output && ./mixed_precision_kws --golden > mixed_precision_kws.golden && cd ../
29+
30+
clean :
31+
rm -f ./output/*

0 commit comments

Comments
 (0)