You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Pages site should read standalone without requiring the reader to
cross-reference QUANTIZATION.md plan milestones. Replace "Phase 9/10/11/
12/13/14/15/16 ships X" lead-ins with feature-descriptive language,
drop "(Phase N)" suffixes from section headings, and remove "Phase N."
prefixes from table descriptions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: docs/architectures/fft.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -285,4 +285,4 @@ These estimates include twiddle multiplication (2 multiplies + 2 adds per butter
285
285
286
286
## Int8 Quantized Counterpart
287
287
288
-
Phase 13 ships `QFFT1D<N>`, a radix-2 DIT FFT on int16 buffers with Q1.15 twiddle factors. Twiddles are caller-owned, built host-side by `buildQFFTTwiddles(n, cos_out, sin_out)`. Scaled butterflies (right-shift by 1 per stage; total scaling 1/N) keep the int16 working register bounded. `magnitudeSquared` emits int32; the int8 boundary on either side is expressed as an ordinary `Requantizer`. Inverse via the conjugate trick. See [Int8 Affine Quantization]({{ site.baseurl }}/architectures/int8-quantization) for the surrounding integer pipeline.
288
+
TinyMind ships `QFFT1D<N>`, a radix-2 DIT FFT on int16 buffers with Q1.15 twiddle factors. Twiddles are caller-owned, built host-side by `buildQFFTTwiddles(n, cos_out, sin_out)`. Scaled butterflies (right-shift by 1 per stage; total scaling 1/N) keep the int16 working register bounded. `magnitudeSquared` emits int32; the int8 boundary on either side is expressed as an ordinary `Requantizer`. Inverse via the conjugate trick. See [Int8 Affine Quantization]({{ site.baseurl }}/architectures/int8-quantization) for the surrounding integer pipeline.
Copy file name to clipboardExpand all lines: docs/architectures/lstm-gru.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -251,7 +251,7 @@ See the [Weight Import Export and PyTorch Interoperability]({{ site.baseurl }}/t
251
251
252
252
# Int8 Quantized Counterparts
253
253
254
-
For inference-only deployment that does not need the trainable Q-format pipeline at all, Phase 12 ships pure-integer int8 cells alongside `LstmNeuralNetwork` / `GruNeuralNetwork`:
254
+
For inference-only deployment that does not need the trainable Q-format pipeline at all, TinyMind ships pure-integer int8 cells alongside `LstmNeuralNetwork` / `GruNeuralNetwork`:
255
255
256
256
- `QLSTMCell` — four gates (i, f, g, o) in TFLite ordering. Two rescalers per gate (input-MAC + recurrent-MAC) into a shared sigmoid / tanh LUT input scale; cell update via two `multiplyByQuantizedMultiplier` calls. Cell-state storage `int8_t` (default) or `int16_t` for long unroll horizons (gate `TINYMIND_ENABLE_INT16_ACCUM=1`).
257
257
- `QGRUCell` — three gates (r, z, n) in canonical ordering. Reset-before-multiply formulation, `(1 - z_t)` computed exactly in the sigmoid grid as `-z_t`.
Copy file name to clipboardExpand all lines: docs/architectures/mixed-precision.md
+7-9Lines changed: 7 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,19 +7,17 @@ nav_order: 8
7
7
8
8
# Mixed Precision
9
9
10
-
Phase 9 adds composability between the previously orphaned numeric pipelines and a software half-precision storage tier. The result is a small set of **pointwise converters** that live at layer boundaries, so a single network can run an int8 affine CNN frontend, hand off to a Q-format LSTM head, hand off again to an fp16 attention block, and project back to int8 for the classifier — every layer keeps the runtime cost of its own grid, the bridges only run once per tensor crossing.
10
+
TinyMind composes its three numeric pipelines through a small set of **pointwise converters** that live at layer boundaries, plus a software half-precision storage tier. A single network can run an int8 affine CNN frontend, hand off to a Q-format LSTM head, hand off again to an fp16 attention block, and project back to int8 for the classifier — every layer keeps the runtime cost of its own grid, the bridges only run once per tensor crossing.
11
11
12
12
## Three pipelines, one model
13
13
14
-
Pre-Phase-9, TinyMind shipped three numeric pipelines that did not talk to each other:
15
-
16
14
| Pipeline | Storage | Where it lives | When it wins |
17
15
|---|---|---|---|
18
16
|`QValue` Q-format | int8 / int16 / int32 / int64 with a compile-time binary point |`cpp/qformat.hpp` + `cpp/neuralnet.hpp`| Trainable on-MCU, single global grid, no per-tensor metadata |
19
17
| Float |`float` / `double`| Same templates, different `ValueType`| Host development, training |
20
-
| Int8 affine | int8 weights + int8 activations + per-tensor `(scale, zero_point)`|`cpp/q*.hpp` family (Phase 1–8) | TFLite-shape inference, multi-grid (each tensor picks its own range) |
18
+
| Int8 affine | int8 weights + int8 activations + per-tensor `(scale, zero_point)`|`cpp/q*.hpp` family | TFLite-shape inference, multi-grid (each tensor picks its own range) |
21
19
22
-
Phase 9 wires the three together. Phase 14's `simd_neon_fp16.hpp`later added vector specializations for fp16 storage; this page covers the storage tier and the converters.
20
+
The qbridge converters tie the three together. The `simd_neon_fp16.hpp`backend adds vector specializations for fp16 storage on Arm hardware that supports it; this page covers the storage tier and the converters.
23
21
24
22
## qbridge.hpp — pointwise converters
25
23
@@ -74,20 +72,20 @@ The `unit_test/embedded/Makefile` exercises this corner as `fp16_freestanding` (
[`examples/mixed_precision_kws/`](https://github.com/danmcleran/tinymind/tree/master/examples/mixed_precision_kws)(Phase 16) wires the qbridge converters in production shape:
75
+
[`examples/mixed_precision_kws/`](https://github.com/danmcleran/tinymind/tree/master/examples/mixed_precision_kws) wires the qbridge converters in production shape:
78
76
79
77
```
80
78
input [S=8][E=8] float
81
79
----[ int8 frontend ]----------------------------
82
80
QDense E -> E (one call per sequence step)
83
81
qrelu -> [S][E] int8
84
-
----[ Phase 9 bridge: affineI8 -> fp16 ]---------
82
+
----[ qbridge: affineI8 -> fp16 ]----------------
85
83
-> [S][E] fp16
86
84
----[ fp16 attention head ]----------------------
87
85
Linear (ReLU-kernel) self-attention with residual
88
86
skip from the post-relu feature buffer, then
89
87
mean-pool over S -> [E] fp16
90
-
----[ Phase 9 bridge: fp16 -> affineI8 ]---------
88
+
----[ qbridge: fp16 -> affineI8 ]----------------
91
89
-> [E] int8
92
90
----[ int8 classifier ]--------------------------
93
91
QDense E -> NUM_CLASSES -> [NUM_CLASSES] int8 logits
@@ -105,7 +103,7 @@ The precision-tier pattern — int8 front + classifier bracketing an fp16 head
105
103
106
104
-**Not QAT.** Mixed precision is a deployment story, not a training story.
107
105
-**Not fp16 arithmetic.** The library treats fp16 as a storage tier; inner arithmetic promotes to float. The vector fp16 ISA gates (`SIMD_NEON_FP16`, AVX-512 fp16) get there on hardware that supports it, but the library does not synthesize fp16 software arithmetic.
108
-
-**Not int4.** Storage is int8 / int16 / int32 / fp16 / bf16 / float / double. Sub-byte storage is a non-goal of this phase.
106
+
-**Not int4.** Storage is int8 / int16 / int32 / fp16 / bf16 / float / double. Sub-byte storage is out of scope.
-`QAttention1D` — int8 linear (ReLU-kernel) attention. Same shape and math; ReLU on Q'/K' folded into the requantizer by raising `qmin = zero_point`. Caller-owned weight, bias, and scratch buffers.
280
280
-`QAttentionSoftmax1D` — standard softmax attention. Score requantizer folds the `1 / sqrt(d_k)` factor via `qAttentionInvSqrt(P)`; softmax uses the same 256-entry int32 exp LUT as `QSoftmax1D`.
Copy file name to clipboardExpand all lines: docs/architectures/simd-backends.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ nav_order: 7
7
7
8
8
# SIMD Backends
9
9
10
-
Phase 14 wires ISA-capability-gated SIMD specializations into the inner reduction loop of the int8 affine layer family (`QDense`, `QConv2D`, `QConv2DPerChannel`). The library never sniffs the CPU. Every backend lives behind a `TINYMIND_ENABLE_SIMD_*` preprocessor gate, every gate defaults to `0`, and with all gates off the layer bodies fall back to a scalar dispatch that emits **byte-identical** output to the pre-Phase-14 build.
10
+
TinyMind ships ISA-capability-gated SIMD specializations in the inner reduction loop of the int8 affine layer family (`QDense`, `QConv2D`, `QConv2DPerChannel`). The library never sniffs the CPU. Every backend lives behind a `TINYMIND_ENABLE_SIMD_*` preprocessor gate, every gate defaults to `0`, and with all gates off the layer bodies fall back to a scalar dispatch that emits **byte-identical** output to the scalar reference.
11
11
12
12
## Design rules
13
13
@@ -59,7 +59,7 @@ The public entry point is `tinymind::simd::int8DotWithZeroPoint` in [`cpp/includ
59
59
60
60
## Bit-exactness invariant — why it matters
61
61
62
-
The integer SIMD backends produce byte-identical output to the scalar reference for any input. The Phase 16 integration suite (`unit_test/integration/`) leans on this: each exemplar's `make golden` mode emits an int8 byte stream, and the integration test asserts that stream matches a baked-in expected string. Because the inference path is deterministic and the SIMD backends are bit-exact, the same expected string passes regardless of which gate combination the example binary was built with. Any silent drift in `qaffine.hpp`, `qcalibration.hpp`, or any SIMD specialization that claims bit-exactness trips the test.
62
+
The integer SIMD backends produce byte-identical output to the scalar reference for any input. The integration suite (`unit_test/integration/`) leans on this: each exemplar's `make golden` mode emits an int8 byte stream, and the integration test asserts that stream matches a baked-in expected string. Because the inference path is deterministic and the SIMD backends are bit-exact, the same expected string passes regardless of which gate combination the example binary was built with. Any silent drift in `qaffine.hpp`, `qcalibration.hpp`, or any SIMD specialization that claims bit-exactness trips the test.
63
63
64
64
The AVX2 backend deliberately avoids `PMADDUBSW`: that instruction saturates on the pair-sum step, which would break the bit-exactness guarantee on pathological inputs. AVX-VNNI and AVX-512-VNNI use the canonical uint8-shift trick so `VPDPBUSD` reduces a uint8 / int8 product exactly.
65
65
@@ -96,11 +96,11 @@ Run the resulting binary on the target hardware (or under `qemu-aarch64` for cor
96
96
97
97
## What about non-int8 layers?
98
98
99
-
Phase 14 specializes the int8 affine layer family because that is where the integer dot product wins big. The Q-format pipeline (`QValue<Q, F, signed>`) and float pipeline rely on compiler auto-vectorization with `-O3 -march=native` — no library-side specialization. The `SIMD_NEON_FP16` and `SIMD_HELIUM_MVE_F` float gates land via `cpp/include/simd/simd_neon_fp16.hpp`, used by the mixed-precision exemplar.
99
+
TinyMind specializes the int8 affine layer family because that is where the integer dot product wins big. The Q-format pipeline (`QValue<Q, F, signed>`) and float pipeline rely on compiler auto-vectorization with `-O3 -march=native` — no library-side specialization. The `SIMD_NEON_FP16` and `SIMD_HELIUM_MVE_F` float gates land via `cpp/include/simd/simd_neon_fp16.hpp`, used by the mixed-precision exemplar.
100
100
101
101
## See Also
102
102
103
103
- [Int8 Affine Quantization]({{ site.baseurl }}/architectures/int8-quantization) — the layer family these backends accelerate.
104
-
- [Mixed Precision]({{ site.baseurl }}/architectures/mixed-precision) — Phase 9 qbridge + fp16 storage, the consumer of the float vector gates.
104
+
- [Mixed Precision]({{ site.baseurl }}/architectures/mixed-precision) — qbridge + fp16 storage, the consumer of the float vector gates.
Copy file name to clipboardExpand all lines: docs/getting-started/mobilenetv2-int8.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,9 +7,9 @@ nav_order: 8
7
7
8
8
# MobileNetV2-shaped int8
9
9
10
-
This tutorial walks the Phase 16 [`examples/mobilenetv2_int8/`](https://github.com/danmcleran/tinymind/tree/master/examples/mobilenetv2_int8) exemplar — a deterministic int8 MobileNetV2-shaped pipeline that exercises the inverted-residual block, linear bottlenecks, residual skips through `QAdd`, and the GAP + dense head. The build pattern in this file scales linearly to a full MobileNetV2-1.0 model (same block, 17× with the channel and stride schedule from the spec).
10
+
This tutorial walks the [`examples/mobilenetv2_int8/`](https://github.com/danmcleran/tinymind/tree/master/examples/mobilenetv2_int8) exemplar — a deterministic int8 MobileNetV2-shaped pipeline that exercises the inverted-residual block, linear bottlenecks, residual skips through `QAdd`, and the GAP + dense head. The build pattern in this file scales linearly to a full MobileNetV2-1.0 model (same block, 17× with the channel and stride schedule from the spec).
11
11
12
-
It is also the first exemplar that ships a `make golden` mode — the int8 logit byte stream is locked by the `unit_test/integration/` Boost.Test suite, regardless of which Phase 14 SIMD backend the build resolves to.
12
+
The exemplar ships a `make golden` mode — the int8 logit byte stream is locked by the `unit_test/integration/` Boost.Test suite, regardless of which SIMD backend the build resolves to.
13
13
14
14
## Pipeline (NHWC)
15
15
@@ -89,7 +89,7 @@ make golden # int8 logits for the bundled 4-sample test set
89
89
90
90
`make run` prints per-tensor affine params and the worst max-abs error vs the float reference; the bundled dataset passes within 50% of the logits range.
91
91
92
-
`make golden` writes a stable text dump of the int8 logit bytes that the integration suite asserts byte-for-byte. Because Phase 14's bit-exactness guarantee holds for every enabled SIMD backend, the same expected string passes regardless of which gate combination the example binary was built with.
92
+
`make golden` writes a stable text dump of the int8 logit bytes that the integration suite asserts byte-for-byte. Because the SIMD backends' bit-exactness guarantee holds for every enabled backend, the same expected string passes regardless of which gate combination the example binary was built with.
Copy file name to clipboardExpand all lines: docs/getting-started/pytorch-importer.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ nav_order: 7
7
7
8
8
# PyTorch → TinyMind int8 (production importer)
9
9
10
-
This tutorial walks the **Phase 15 importer flow**: take a trained PyTorch model, pull weights from `torch.state_dict`, run per-layer calibration with any of `MinMaxObserver` / `PercentileObserver` / `KLDivergenceObserver`, optionally apply Cross-Layer Equalization to recover accuracy on imbalanced layers, and emit a TinyMind-format `weights.hpp` that snaps straight into the int8 `Q*` layer family.
10
+
This tutorial walks the **production PyTorch importer flow**: take a trained PyTorch model, pull weights from `torch.state_dict`, run per-layer calibration with any of `MinMaxObserver` / `PercentileObserver` / `KLDivergenceObserver`, optionally apply Cross-Layer Equalization to recover accuracy on imbalanced layers, and emit a TinyMind-format `weights.hpp` that snaps straight into the int8 `Q*` layer family.
11
11
12
12
It's the heavier-lift counterpart to [PyTorch → TinyMind int8 (XOR)]({{ site.baseurl }}/getting-started/pytorch-quant-xor). Same destination — pure-integer C++ inference — but instead of hand-rolling the calibration loop for one tiny network, you describe each layer once and the importer handles range estimation, Conv+BN fusion, weight quantization, and header emission.
13
13
@@ -88,7 +88,7 @@ The three observers cover different activation shapes:
88
88
|`PercentileObserver(lo, hi)`| Heavy-tail activations (post-conv with large receptive field, pre-softmax logits). `(0.05, 99.95)` clips the worst ~0.1% so the int8 grid is not wasted on a handful of extreme samples |
89
89
|`KLDivergenceObserver`| When percentile clipping is too crude. TensorRT-style: fix a 2048-bin histogram width, fill it, sweep threshold T in `[128, 2048]` to minimize KL between the clipped float distribution and its int8-quantized form. Heaviest but highest fidelity |
90
90
91
-
Match the observer to each tensor's empirical shape; the importer does not try to auto-pick. The Phase 15 [`examples/import_demo/`](https://github.com/danmcleran/tinymind/tree/master/examples/import_demo) C++ binary exercises all three on a deterministic 3-8-4-2 MLP so the calibration math is easy to inspect side by side.
91
+
Match the observer to each tensor's empirical shape; the importer does not try to auto-pick. The [`examples/import_demo/`](https://github.com/danmcleran/tinymind/tree/master/examples/import_demo) C++ binary exercises all three on a deterministic 3-8-4-2 MLP so the calibration math is easy to inspect side by side.
0 commit comments