You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add Phase 16 mixed-precision exemplars + golden-byte regression suite
Four reference int8 / mixed-precision exemplars plus a Boost.Test
integration suite that locks their int8 output byte-for-byte across
SIMD gate combos:
- examples/resnet18_block_int8/ int8 ResNet stem + basic block
- examples/mobilenetv2_int8/ int8 MobileNetV2 inverted-residual blocks
- examples/mixed_precision_kws/ int8 -> fp16 attn -> int8 (Phase 9 bridges)
- examples/transformer_encoder_int8/ --golden mode added
- unit_test/integration/ popen() one fixture per exemplar
Each exemplar Makefile exposes make run / make bench / make golden.
Root Makefile orders the integration suite after the example builds.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: CLAUDE.md
+17Lines changed: 17 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -148,6 +148,19 @@ Phase 15 reduces the accuracy gap vs a PyTorch / TFLite reference and lowers fri
148
148
149
149
The Phase 15 deployable shape is unchanged: `TINYMIND_ENABLE_QUANTIZATION=1, FLOAT=0, STD=0`. All new helpers (Observers + CLE + importer scripts) live behind `FLOAT && STD` (the `qcalibration.hpp` gate) or in Python tooling. `examples/import_demo/` ships an end-to-end Phase 15 exemplar (C++ side: 3-8-4-2 MLP, three observers + CLE, calibration + int8 forward parity vs float, ~0.004 max-abs error on the bundled seed; Python side: full PyTorch-to-weights.hpp flow via `apps/import_pytorch`).
Phase 16 ships four reference int8 / mixed-precision exemplars and a `unit_test/integration` Boost.Test suite that locks their byte output across SIMD gate combos. Pure addition — no runtime header changes.
154
+
155
+
-**`examples/resnet18_block_int8/`** — int8 ResNet-18-shaped stem plus one basic-block stage (`QPad2D` → `QConv2DPerChannel 7x7 s=2` → `qrelu` → `QMaxPool2D` → basic block: `QPad2D` → 3x3 conv → `qrelu` → `QPad2D` → 3x3 conv → `QAdd` skip → `qrelu` → `QGlobalAvgPool2D` → `QDense`). Demonstrates that `QMaxPool2D`, `qreluBuffer`, and `QGlobalAvgPool2D` are pass-throughs on the int8 affine grid, so consecutive layers reuse the upstream `(scale, zero_point)` rather than burning new requantizers.
156
+
-**`examples/mobilenetv2_int8/`** — int8 MobileNetV2-shaped pipeline. Two inverted-residual blocks (one stride-1 with a residual skip, one stride-2 without), wired around a stride-2 stem and a GAP + dense head. The projection convolutions are linear (no `qrelu`), matching MNv2's "linear bottleneck" design rule. The inverted-residual unit is the load-bearing primitive of MNv2 / V3 / EfficientNet — the build pattern in this file scales linearly to a full model.
157
+
-**`examples/transformer_encoder_int8/`** — already present from Phase 13. Phase 16 wires it into the integration suite with the same `--golden` mode as the new exemplars.
158
+
-**`examples/mixed_precision_kws/`** — mixed-precision exemplar that exercises the Phase 9 qbridge converters in production shape: int8 `QDense` frontend → `affineI8ToFp16` bridge → fp16 linear-attention head with residual skip + mean-pool → `fp16ToAffineI8` bridge → int8 `QDense` classifier. `TINYMIND_ENABLE_FP16=1` required at the Makefile level. Inner attention arithmetic runs in float promoted from `fp16_t`; on targets that ship vector fp16 arithmetic (NEON FEAT_FP16, AVX-512 fp16) the promote pair is near-free, on every other target it is the cost of admission for fp16 storage on an MCU.
159
+
160
+
Each exemplar Makefile exposes the same three modes: `make run` (parity report vs float reference), `make bench` (CSV cycle/byte report), `make golden` (int8 byte stream for the bundled test set in a stable text format).
161
+
162
+
`unit_test/integration/` — new Boost.Test suite. One fixture per exemplar shells out to the example binary with `--golden` via `popen()` and compares the emitted byte stream to a baked-in expected string. The exemplar binaries are deterministic (hand-crafted weights, fixed synthetic dataset, pure-integer forward), so the output is invariant across SIMD gate combos by Phase 14's bit-exactness guarantee. Any silent drift in the example pipeline, the `qaffine.hpp` requantizer, the `qcalibration.hpp` helpers, or any SIMD specialization that claims bit-exactness trips the test. The root `Makefile`'s `check` target orders the integration test after the example builds so the binaries always exist when the test runs.
163
+
151
164
### Design Pattern
152
165
153
166
Neural networks are configured through a properties struct that bundles all template policies:
- **`qlearn/`** — Boost.Test unit tests for Q-learning
172
185
- **`quantization/`** — Boost.Test unit tests for the int8 quantization path: Requantizer round-trip, per-tensor / per-channel calibration, QConv2D / QDepthwise / QPointwise / QPool / QDense forward passes against a float reference. Phase 11 additions cover `foldBatchNorm` (fused-conv parity vs unfused conv→BN), `QBatchNorm2D` parity, `QLayerNorm1D` parity and constant-row edge case, and `QSoftmax1D` parity plus dominant-class saturation. Phase 12 additions cover `QLSTMCell` single-step parity vs a float LSTM reference, `QLSTMCell` int16-cell-state drift over a 256-step sequence, and `QGRUCell` single-step parity vs a float GRU reference. Phase 13 additions cover Q1.15 twiddle round-trip, `QFFT1D` magnitude-spectrum parity vs a naive float DFT, `QFFT1D` forward/inverse round-trip, `QAttention1D` parity vs a float linear-attention reference, `QAttentionSoftmax1D` parity vs a float softmax-attention reference, and a `QMultiHeadLinearAttention1D` stacking test. Phase 14 additions cover SIMD bit-exactness across pathological lengths, INT8 extreme-value patterns, full-layer `QDense` and `QConv2D` parity, and the `activeBackendName()` dispatch report. Phase 15 additions cover `PercentileObserver` outlier clipping + empty-buffer edge case, `KLDivergenceObserver` clip-threshold convergence vs a Gaussian + outliers dataset + empty edge case, `crossLayerEqualizeDense` output preservation under ReLU + zero-row skip, and `crossLayerEqualizeConv2D` output preservation. Builds with `TINYMIND_ENABLE_QUANTIZATION=1`; pass `-DTINYMIND_ENABLE_SIMD_*=1` plus the matching `-march=` flag to exercise a SIMD backend.
173
186
- **`embedded/`** — Cross-corner regression matrix. Builds the smoke source under eight `(FLOAT, STD, QUANT, FP16, INT16_ACCUM, SIMD_*)` configurations: `freestanding`, `no_stdlib`, `no_fpu`, `hosted`, `quant_freestanding`, `fp16_freestanding`, `int16_accum_freestanding`, and `simd_disabled` (Phase 14 scalar-fallback corner — every `TINYMIND_ENABLE_SIMD_*=0` at the deployable freestanding shape). A separate `simd_prereq_regressions` make target locks the static_assert prerequisite chain (`AVX_VNNI=1, AVX2=0` and `AVX512_VNNI=1, AVX512F=0` must fail to compile).
187
+
- **`integration/`** — Phase 16 golden-byte suite. One Boost.Test fixture per exemplar (`resnet18_block_int8`, `mobilenetv2_int8`, `mixed_precision_kws`, `transformer_encoder_int8`) shells out to the example binary with `--golden` and asserts the emitted int8 byte stream matches a baked-in expected string. Catches silent regressions in the inference path regardless of which SIMD backend dispatch resolves to.
- **`transformer_encoder_int8/`** — Phase 13 demonstration: int8 transformer encoder block (`QLayerNorm1D` → `QAttention1D` linear attention → `QAdd` skip → `QLayerNorm1D` → `QDense` + `qrelu` → `QDense` → `QAdd` skip). Calibrates per-tensor activations and per-tensor symmetric weight scales on the host, then runs the block end-to-end on int8 and reports max-abs error vs the float reference (~2% of output range on the bundled dataset).
185
199
- **`perf_matrix/`** — Phase 14 SIMD gate bench. Builds the same `QConv2D` 3x3 + `QDense` int8 block under each enabled `TINYMIND_ENABLE_SIMD_*` combination on the host (default Makefile builds scalar / AVX2 / AVX-512F / AVX-512-VNNI). Emits one CSV row per backend with per-call timing and an `output_checksum` that is invariant across backends — the row is the bit-exactness regression and the cycle delta is the perf headline.
186
200
- **`import_demo/`** — Phase 15 importer end-to-end. C++ binary carries a deterministic 3-8-4-2 MLP, drives a 64-sample synthetic calibration set through `RangeObserver` / `PercentileObserver` / `KLDivergenceObserver` plus `crossLayerEqualizeDense`, then runs both the float reference and the pure-integer int8 forward and reports max-abs error (~0.004 on the bundled seed; tolerance 0.08). Standalone — no torch dependency. `demo.py` is the production-flow counterpart that consumes `torch.state_dict` and drives `apps/import_pytorch/tinymind_import` to emit a real `weights.hpp`.
201
+
- **`resnet18_block_int8/`** — Phase 16 exemplar. int8 ResNet-18-shaped stem + one basic-block stage on a 16x16x3 input, 4 logits out. Same `make run` / `make bench` / `make golden` mode triple as the other Phase 16 exemplars.
202
+
- **`mobilenetv2_int8/`** — Phase 16 exemplar. int8 MobileNetV2-shaped pipeline: stride-2 stem + one stride-1 inverted-residual block with skip + one stride-2 inverted-residual block, then GAP + dense. Linear bottlenecks per MNv2 convention.
2. `examples/mobilenetv2_int8/` — full MobileNetV2 (depthwise-separable + inverted-residual). Exercises Phase 10 ops at scale.
374
-
3. `examples/transformer_encoder_int8/` — single encoder block with int8 attention + int8 LayerNorm + int8 softmax. Exercises Phase 11+13.
375
-
4. `examples/mixed_precision_kws/` — **mixed-precision exemplar:** int8 CNN feature extractor → fp16 attention head → int8 dense classifier. Exercises Phase 9 bridges in production shape.
372
+
1. `examples/resnet18_block_int8/` — int8 ResNet-18 stem + one basic-block stage (`QPad2D` → `QConv2DPerChannel 7x7 s=2` → `qrelu` → `QMaxPool2D` → basic block: `QPad2D` → 3x3 conv → `qrelu` → `QPad2D` → 3x3 conv → `QAdd` skip → `qrelu` → `QGlobalAvgPool2D` → `QDense`). Exercises Phase 10 `QPad2D` / `QConv2DPerChannel` / `QAdd` at deeper spatial dimensions than the original `resnet_block_int8`. Demonstrates that `QMaxPool2D`, `qreluBuffer`, and `QGlobalAvgPool2D` are pass-throughs on the int8 affine grid (max, clamp, integer-mean all preserve `(scale, zero_point)`), so consecutive layers reuse the upstream grid rather than burning new requantizers.
373
+
2. `examples/mobilenetv2_int8/` — int8 MobileNetV2-shaped pipeline. Stride-2 stem + one stride-1 inverted-residual block with skip + one stride-2 inverted-residual block without skip, then GAP + dense. Linear bottlenecks per MNv2 convention (no `qrelu` after the 1x1 projection). Exercises the `expand → DW → project` triple — the load-bearing primitive of MNv2 / V3 / EfficientNet.
374
+
3. `examples/transformer_encoder_int8/` — already present from Phase 13; Phase 16 wires it into the integration suite with a matching `--golden` mode.
375
+
4. `examples/mixed_precision_kws/` — mixed-precision exemplar. int8 `QDense` frontend → Phase 9 `affineI8ToFp16` bridge → fp16 linear-attention head with residual skip + mean-pool → Phase 9 `fp16ToAffineI8` bridge → int8 `QDense` classifier. Requires `TINYMIND_ENABLE_FP16=1`. Inner attention arithmetic runs in float promoted from `fp16_t`; on targets that ship vector fp16 arithmetic the promote pair is near-free, on every other target it is the cost of admission for fp16 storage on an MCU.
- Golden float reference in Python, parity test in C++
380
-
- README documenting precision tier per layer
377
+
Each exemplar Makefile exposes three modes: `make run` (parity report vs float reference; PASS within 40-50 % of output dynamic range), `make bench` (CSV cycle/byte report — one row per layer, mirrors `examples/kws_cortex_m_int8/`), `make golden` (int8 byte stream for the bundled deterministic test set in a stable text format). Each ships with a per-precision-tier README.
381
378
382
-
**Tests:** `unit_test/integration/` — new directory. Boost.Test fixtures load exemplar weights and verify int8 output matches stored golden int8 output, byte-for-byte. Locks regressions across SIMD gate combos.
379
+
**Tests (shipped):** `unit_test/integration/` — new Boost.Test suite. One fixture per exemplar shells out to the example binary with `--golden` via `popen()` and asserts the emitted byte stream matches a baked-in expected string. The exemplar binaries are deterministic (hand-crafted weights, fixed synthetic dataset, pure-integer forward), so the output is invariant across SIMD gate combos by Phase 14's bit-exactness guarantee. Any silent drift in the example pipeline, the `qaffine.hpp` requantizer, the `qcalibration.hpp` helpers, or any SIMD specialization that claims bit-exactness trips the test.
383
380
384
-
**Success criteria:** repo ships four working mixed-precision models with reproducible benchmark numbers. Future PRs cannot regress them silently.
381
+
**Success criteria:** repo ships four working mixed-precision exemplars with reproducible benchmark numbers, locked at byte granularity by the integration suite. ✓ shipped.
385
382
386
-
**Risk:** low. Final phase — all components landed in prior phases.
383
+
**Risk:** low. Final phase — every runtime component landed in Phases 9-14, every host-side helper in Phase 15.
0 commit comments