|
| 1 | +# 16-Bit DWT Path for 5/3 Reversible Wavelet |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +The 16-bit DWT path performs the 5/3 reversible (lossless) inverse discrete |
| 6 | +wavelet transform entirely using `int16_t` arithmetic instead of `int32_t`. |
| 7 | +This halves memory usage and bandwidth for wavelet coefficient buffers, |
| 8 | +improving cache utilization and throughput for eligible images. |
| 9 | + |
| 10 | +## Architecture |
| 11 | + |
| 12 | +### Decision Point |
| 13 | + |
| 14 | +The 16-bit path is selected at runtime in `TileProcessor::decompressInit()` |
| 15 | +(TileProcessor.cpp) when all of the following hold: |
| 16 | + |
| 17 | +1. **Reversible wavelet** (`qmfbid == 1`) |
| 18 | +2. **Whole-tile decoding** (no region-of-interest partial decode) |
| 19 | +3. **Precision + headroom ≤ 16 bits**: |
| 20 | + - MCT components (inverse RCT): `prec + 5 ≤ 16` → max precision 11 bits |
| 21 | + - Non-MCT components (DC shift only): `prec + 4 ≤ 16` → max precision 12 bits |
| 22 | + |
| 23 | +### Data Flow |
| 24 | + |
| 25 | +``` |
| 26 | +Tier-1 decode (int32) |
| 27 | + → NarrowShiftFilter (int32 → int16, in PostDecodeFilters.h) |
| 28 | + → 16-bit DWT synthesis (WaveletReverse.cpp) |
| 29 | + → DC shift (fused into final store, or via mct.cpp for MCT) |
| 30 | + → MCT inverse RCT (mct.cpp DecompressRev16 / DecompressDcShiftRev16) |
| 31 | + → composite to output image (GrkImage.h compositePlanar with int16→int32 widening) |
| 32 | +``` |
| 33 | + |
| 34 | +### Key Files |
| 35 | + |
| 36 | +| File | Role | |
| 37 | +|------|------| |
| 38 | +| `TileProcessor.cpp` | 16-bit eligibility decision | |
| 39 | +| `WaveletReverse.cpp` | All DWT kernels (int32 and int16, scalar and HWY SIMD) | |
| 40 | +| `WaveletReverse.h` | 16-bit entry point declarations | |
| 41 | +| `PostDecodeFilters.h` | `NarrowShiftFilter` — int32→int16 narrowing after T1 decode | |
| 42 | +| `mct.cpp` | `DecompressRev16` / `DecompressDcShiftRev16` — int16 MCT+DC shift | |
| 43 | +| `GrkImage.h` | `compositePlanar()` — int16→int32 widening for multi-tile compositing | |
| 44 | +| `GrkImage.cpp` | `transferDataFrom_T()` — int16 buffer transfer | |
| 45 | +| `buffer.h` | Buffer type with `data_type` field for int16/int32 dispatch | |
| 46 | + |
| 47 | +## BIBO Gain Analysis |
| 48 | + |
| 49 | +The Bounded-Input-Bounded-Output (BIBO) gain determines the worst-case output |
| 50 | +range expansion through the inverse DWT filter bank. This analysis determines |
| 51 | +how many extra bits of headroom are needed beyond the image precision. |
| 52 | + |
| 53 | +### 5/3 Lifting Steps (Synthesis) |
| 54 | + |
| 55 | +The 5/3 reversible inverse DWT (synthesis) is performed as two lifting steps: |
| 56 | + |
| 57 | +1. **Undo update (even samples)**: `s[n] -= floor((d[n-1] + d[n] + 2) / 4)` |
| 58 | +2. **Undo predict (odd samples)**: `d[n] += floor((s[n] + s[n+1]) / 2)` |
| 59 | + |
| 60 | +where `s[]` are even (low-pass) samples and `d[]` are odd (high-pass) samples. |
| 61 | + |
| 62 | +### First-Principles BIBO Gain Derivation |
| 63 | + |
| 64 | +The BIBO gain is the worst-case ratio of output magnitude to input magnitude, |
| 65 | +computed by tracking the maximum possible amplification through each lifting |
| 66 | +step. |
| 67 | + |
| 68 | +**Single-level analysis:** |
| 69 | + |
| 70 | +Consider subband coefficients bounded by some maximum magnitude M. |
| 71 | + |
| 72 | +Step 1 (undo update): Each even sample `s[n]` is modified by subtracting |
| 73 | +`floor((d[n-1] + d[n] + 2) / 4)`. The subtracted quantity has magnitude |
| 74 | +at most `floor((M + M + 2) / 4) ≈ M/2`. So the even sample after this step |
| 75 | +has magnitude at most `M + M/2 = 3M/2`. |
| 76 | + |
| 77 | +Step 2 (undo predict): Each odd sample `d[n]` is modified by adding |
| 78 | +`floor((s[n] + s[n+1]) / 2)`. The added quantity has magnitude at most |
| 79 | +`floor((3M/2 + 3M/2) / 2) = 3M/2`. So the odd sample has magnitude |
| 80 | +at most `M + 3M/2 = 5M/2`. |
| 81 | + |
| 82 | +The worst-case 1D gain from a single synthesis level is therefore: |
| 83 | +- Even (low) output: 3/2 = 1.5 |
| 84 | +- Odd (high) output: 5/2 = 2.0 (but this is the absolute worst case; the |
| 85 | + typical bound is tighter since `d[n]` and the update correction are correlated) |
| 86 | + |
| 87 | +These match the first entries of the OJPH `bibo_gains` tables in |
| 88 | +`QuantizerOJPH.cpp`: |
| 89 | +``` |
| 90 | +gain_5x3_l[1] = 1.500 (low-pass output, 1 level) |
| 91 | +gain_5x3_h[0] = 2.000 (high-pass output, 0 additional levels) |
| 92 | +``` |
| 93 | + |
| 94 | +**Multi-level recursion:** |
| 95 | + |
| 96 | +For L decomposition levels, the LL subband is further decomposed. The gain |
| 97 | +builds recursively but converges because each additional level only affects the |
| 98 | +lowest-frequency subband through a filter with gain < 2, applied to an |
| 99 | +increasingly small fraction of the signal energy. |
| 100 | + |
| 101 | +The per-subband, per-direction gains from the OJPH `bibo_gains` class: |
| 102 | +``` |
| 103 | +gain_5x3_l[]: 1.000, 1.500, 1.625, 1.688, 1.696, 1.707, 1.712, ... → 1.716 |
| 104 | +gain_5x3_h[]: 2.000, 2.500, 2.750, 2.805, 2.820, 2.841, 2.856, ... → 2.867 |
| 105 | +``` |
| 106 | + |
| 107 | +The 2D worst-case gain to any subband is the product of horizontal and vertical |
| 108 | +gains. The overall BIBO gain (maximum across all subbands and both dimensions) |
| 109 | +converges: |
| 110 | +- For ≤5 decomposition levels: overall gain < 2^3 (= 8) |
| 111 | +- For >5 levels: gain slightly exceeds 2^3, converging to ~2^3.04 |
| 112 | + |
| 113 | +**Why it converges**: The update step has a transfer function that attenuates |
| 114 | +by 1/4 (the `>> 2` shift), and each additional decomposition level compounds |
| 115 | +this through an increasingly refined low-pass subband. The geometric series |
| 116 | +converges, yielding a finite asymptotic gain of approximately `2.867^2 ≈ 8.22` |
| 117 | +in 2D, which is `~2^3.04`. |
| 118 | + |
| 119 | +### Lifting Step Intermediate Safety |
| 120 | + |
| 121 | +The update step computes `floor((a + b + 2) / 4)` where `a` and `b` are |
| 122 | +odd-indexed (high-pass) samples. When implemented using an **averaging |
| 123 | +operation** (as in `update_avg_16_53()`), the intermediate value never exceeds |
| 124 | +the magnitude of the inputs — the hardware averaging instruction uses a wider |
| 125 | +internal accumulator, so there is no additional overflow risk beyond the BIBO |
| 126 | +gain itself. |
| 127 | + |
| 128 | +The predict step computes `d + floor((s_prev + s_next) / 2)` where `s_prev` |
| 129 | +and `s_next` are already-updated even samples. The intermediate sum |
| 130 | +`s_prev + s_next` (before the `>> 1` shift) has BIBO gain that can be shown |
| 131 | +to always remain below 2^3 (since each `s` value has gain at most ~1.716, |
| 132 | +and the sum is immediately halved). |
| 133 | + |
| 134 | +### Headroom Calculation |
| 135 | + |
| 136 | +For **16-bit synthesis**: |
| 137 | + |
| 138 | +- **Non-MCT** (`rct_comp ≤ 1`): **4 bits** of headroom. |
| 139 | + This covers ~3 bits for the DWT BIBO gain (< 2^3 for typical ≤5 levels) |
| 140 | + plus ~0.5–1 bit safety margin for quantization errors (reversibly processed |
| 141 | + content can be quantized by code-block truncation). |
| 142 | + Max precision: `prec + 4 ≤ 16` → prec ≤ 12. |
| 143 | + For 12-bit: output range 2^12 × 2^3 ≈ 2^15 fits in int16_t. |
| 144 | + |
| 145 | +- **MCT chrominance** (`rct_comp ≥ 2`): **5 bits** of headroom. |
| 146 | + The inverse RCT (Reversible Colour Transform) has additional BIBO gain |
| 147 | + of 1.5× for luminance and 2.0× for chrominance (Db/Dr) channels, |
| 148 | + since `Cr = B - G` and `Cb = R - G` can double the range, and |
| 149 | + `Y = G + floor((Cb + Cr) / 4)` adds up to 1.5× for luminance. |
| 150 | + Max precision: `prec + 5 ≤ 16` → prec ≤ 11. |
| 151 | + |
| 152 | +Note: Our implementation conservatively uses 5 bits for ALL MCT components |
| 153 | +(including luminance), simplifying the logic while remaining safe. |
| 154 | + |
| 155 | +## Overflow-Safe SIMD Averaging |
| 156 | + |
| 157 | +### The Problem |
| 158 | + |
| 159 | +The 5/3 DWT update and predict steps compute averages of two sample values: |
| 160 | + |
| 161 | +- **Update step**: `even -= floor((odd_prev + odd_next + 2) / 4)` |
| 162 | +- **Predict step**: `odd += floor((even_prev + even_next) / 2)` |
| 163 | + |
| 164 | +In scalar C++ code, `int16_t` operands are implicitly promoted to `int` before |
| 165 | +arithmetic — no overflow is possible. However, in **SIMD code (HWY/Highway)**, |
| 166 | +arithmetic stays within the vector lane width (int16). The intermediate sum |
| 167 | +`a + b` of two int16 values can overflow the int16 range. |
| 168 | + |
| 169 | +### Solution 1: Update Step — `update_avg_16_53()` |
| 170 | + |
| 171 | +Computes `floor((a + b + 2) / 4)` without overflow using unsigned hardware |
| 172 | +averaging. |
| 173 | + |
| 174 | +**Identity chain**: |
| 175 | +``` |
| 176 | +floor((a + b + 2) / 4) = (floor((a + b) / 2) + 1) >> 1 |
| 177 | +``` |
| 178 | + |
| 179 | +**Algorithm**: |
| 180 | +1. Convert signed → unsigned: `a_u = a XOR 0x8000` |
| 181 | +2. Bias b: `b_biased = b_unsigned + 0x7FFF` (equivalent to `b_unsigned - 1` mod 2^16) |
| 182 | +3. Hardware unsigned average: `AverageRound(a_u, b_biased)` = |
| 183 | + `floor((a_u + b_biased + 1) / 2)` = `floor((a_u + b_u) / 2)` |
| 184 | + (the +1 from AverageRound cancels the -1 from the bias) |
| 185 | +4. Unbias: `step = avg - 0x7FFF` → yields `floor((a+b)/2) + 1` in signed domain |
| 186 | +5. Final shift: `step >> 1` → `floor((a + b + 2) / 4)` |
| 187 | + |
| 188 | +The hardware `AverageRound` instruction (`_mm256_avg_epu16` on x86) uses a |
| 189 | +17-bit intermediate accumulator internally, so the sum never overflows. |
| 190 | + |
| 191 | +### Solution 2: Predict Step — `predict_avg_16_53()` |
| 192 | + |
| 193 | +Computes `floor((a + b) / 2)` without overflow using bit decomposition. |
| 194 | + |
| 195 | +**Identity**: |
| 196 | +``` |
| 197 | +(a + b) >> 1 = (a >> 1) + (b >> 1) + ((a & b) & 1) |
| 198 | +``` |
| 199 | + |
| 200 | +Each term is individually safe: |
| 201 | +- `a >> 1` and `b >> 1` are half the original range |
| 202 | +- `(a & b) & 1` is 0 or 1 (the "carry" from the lost LSBs) |
| 203 | + |
| 204 | +### Why Scalar Code Doesn't Need This |
| 205 | + |
| 206 | +In non-SIMD (scalar) C++ code, the standard integer promotion rules |
| 207 | +automatically widen `int16_t` operands to `int` (at least 32 bits) before any |
| 208 | +arithmetic operation. The intermediate sum `a + b` is computed in 32-bit |
| 209 | +precision, so no overflow occurs. Only the SIMD kernels need the overflow-safe |
| 210 | +averaging functions. |
0 commit comments