Skip to content

Commit 04935aa

Browse files
author
Grok Compression
committed
lossless decompression: add 16 bit pipeline
1 parent 2db974b commit 04935aa

30 files changed

Lines changed: 2210 additions & 331 deletions

doc/16BitDWT.md

Lines changed: 210 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,210 @@
1+
# 16-Bit DWT Path for 5/3 Reversible Wavelet
2+
3+
## Overview
4+
5+
The 16-bit DWT path performs the 5/3 reversible (lossless) inverse discrete
6+
wavelet transform entirely using `int16_t` arithmetic instead of `int32_t`.
7+
This halves memory usage and bandwidth for wavelet coefficient buffers,
8+
improving cache utilization and throughput for eligible images.
9+
10+
## Architecture
11+
12+
### Decision Point
13+
14+
The 16-bit path is selected at runtime in `TileProcessor::decompressInit()`
15+
(TileProcessor.cpp) when all of the following hold:
16+
17+
1. **Reversible wavelet** (`qmfbid == 1`)
18+
2. **Whole-tile decoding** (no region-of-interest partial decode)
19+
3. **Precision + headroom ≤ 16 bits**:
20+
- MCT components (inverse RCT): `prec + 5 ≤ 16` → max precision 11 bits
21+
- Non-MCT components (DC shift only): `prec + 4 ≤ 16` → max precision 12 bits
22+
23+
### Data Flow
24+
25+
```
26+
Tier-1 decode (int32)
27+
→ NarrowShiftFilter (int32 → int16, in PostDecodeFilters.h)
28+
→ 16-bit DWT synthesis (WaveletReverse.cpp)
29+
→ DC shift (fused into final store, or via mct.cpp for MCT)
30+
→ MCT inverse RCT (mct.cpp DecompressRev16 / DecompressDcShiftRev16)
31+
→ composite to output image (GrkImage.h compositePlanar with int16→int32 widening)
32+
```
33+
34+
### Key Files
35+
36+
| File | Role |
37+
|------|------|
38+
| `TileProcessor.cpp` | 16-bit eligibility decision |
39+
| `WaveletReverse.cpp` | All DWT kernels (int32 and int16, scalar and HWY SIMD) |
40+
| `WaveletReverse.h` | 16-bit entry point declarations |
41+
| `PostDecodeFilters.h` | `NarrowShiftFilter` — int32→int16 narrowing after T1 decode |
42+
| `mct.cpp` | `DecompressRev16` / `DecompressDcShiftRev16` — int16 MCT+DC shift |
43+
| `GrkImage.h` | `compositePlanar()` — int16→int32 widening for multi-tile compositing |
44+
| `GrkImage.cpp` | `transferDataFrom_T()` — int16 buffer transfer |
45+
| `buffer.h` | Buffer type with `data_type` field for int16/int32 dispatch |
46+
47+
## BIBO Gain Analysis
48+
49+
The Bounded-Input-Bounded-Output (BIBO) gain determines the worst-case output
50+
range expansion through the inverse DWT filter bank. This analysis determines
51+
how many extra bits of headroom are needed beyond the image precision.
52+
53+
### 5/3 Lifting Steps (Synthesis)
54+
55+
The 5/3 reversible inverse DWT (synthesis) is performed as two lifting steps:
56+
57+
1. **Undo update (even samples)**: `s[n] -= floor((d[n-1] + d[n] + 2) / 4)`
58+
2. **Undo predict (odd samples)**: `d[n] += floor((s[n] + s[n+1]) / 2)`
59+
60+
where `s[]` are even (low-pass) samples and `d[]` are odd (high-pass) samples.
61+
62+
### First-Principles BIBO Gain Derivation
63+
64+
The BIBO gain is the worst-case ratio of output magnitude to input magnitude,
65+
computed by tracking the maximum possible amplification through each lifting
66+
step.
67+
68+
**Single-level analysis:**
69+
70+
Consider subband coefficients bounded by some maximum magnitude M.
71+
72+
Step 1 (undo update): Each even sample `s[n]` is modified by subtracting
73+
`floor((d[n-1] + d[n] + 2) / 4)`. The subtracted quantity has magnitude
74+
at most `floor((M + M + 2) / 4) ≈ M/2`. So the even sample after this step
75+
has magnitude at most `M + M/2 = 3M/2`.
76+
77+
Step 2 (undo predict): Each odd sample `d[n]` is modified by adding
78+
`floor((s[n] + s[n+1]) / 2)`. The added quantity has magnitude at most
79+
`floor((3M/2 + 3M/2) / 2) = 3M/2`. So the odd sample has magnitude
80+
at most `M + 3M/2 = 5M/2`.
81+
82+
The worst-case 1D gain from a single synthesis level is therefore:
83+
- Even (low) output: 3/2 = 1.5
84+
- Odd (high) output: 5/2 = 2.0 (but this is the absolute worst case; the
85+
typical bound is tighter since `d[n]` and the update correction are correlated)
86+
87+
These match the first entries of the OJPH `bibo_gains` tables in
88+
`QuantizerOJPH.cpp`:
89+
```
90+
gain_5x3_l[1] = 1.500 (low-pass output, 1 level)
91+
gain_5x3_h[0] = 2.000 (high-pass output, 0 additional levels)
92+
```
93+
94+
**Multi-level recursion:**
95+
96+
For L decomposition levels, the LL subband is further decomposed. The gain
97+
builds recursively but converges because each additional level only affects the
98+
lowest-frequency subband through a filter with gain < 2, applied to an
99+
increasingly small fraction of the signal energy.
100+
101+
The per-subband, per-direction gains from the OJPH `bibo_gains` class:
102+
```
103+
gain_5x3_l[]: 1.000, 1.500, 1.625, 1.688, 1.696, 1.707, 1.712, ... → 1.716
104+
gain_5x3_h[]: 2.000, 2.500, 2.750, 2.805, 2.820, 2.841, 2.856, ... → 2.867
105+
```
106+
107+
The 2D worst-case gain to any subband is the product of horizontal and vertical
108+
gains. The overall BIBO gain (maximum across all subbands and both dimensions)
109+
converges:
110+
- For ≤5 decomposition levels: overall gain < 2^3 (= 8)
111+
- For >5 levels: gain slightly exceeds 2^3, converging to ~2^3.04
112+
113+
**Why it converges**: The update step has a transfer function that attenuates
114+
by 1/4 (the `>> 2` shift), and each additional decomposition level compounds
115+
this through an increasingly refined low-pass subband. The geometric series
116+
converges, yielding a finite asymptotic gain of approximately `2.867^2 ≈ 8.22`
117+
in 2D, which is `~2^3.04`.
118+
119+
### Lifting Step Intermediate Safety
120+
121+
The update step computes `floor((a + b + 2) / 4)` where `a` and `b` are
122+
odd-indexed (high-pass) samples. When implemented using an **averaging
123+
operation** (as in `update_avg_16_53()`), the intermediate value never exceeds
124+
the magnitude of the inputs — the hardware averaging instruction uses a wider
125+
internal accumulator, so there is no additional overflow risk beyond the BIBO
126+
gain itself.
127+
128+
The predict step computes `d + floor((s_prev + s_next) / 2)` where `s_prev`
129+
and `s_next` are already-updated even samples. The intermediate sum
130+
`s_prev + s_next` (before the `>> 1` shift) has BIBO gain that can be shown
131+
to always remain below 2^3 (since each `s` value has gain at most ~1.716,
132+
and the sum is immediately halved).
133+
134+
### Headroom Calculation
135+
136+
For **16-bit synthesis**:
137+
138+
- **Non-MCT** (`rct_comp ≤ 1`): **4 bits** of headroom.
139+
This covers ~3 bits for the DWT BIBO gain (< 2^3 for typical ≤5 levels)
140+
plus ~0.5–1 bit safety margin for quantization errors (reversibly processed
141+
content can be quantized by code-block truncation).
142+
Max precision: `prec + 4 ≤ 16` → prec ≤ 12.
143+
For 12-bit: output range 2^12 × 2^3 ≈ 2^15 fits in int16_t.
144+
145+
- **MCT chrominance** (`rct_comp ≥ 2`): **5 bits** of headroom.
146+
The inverse RCT (Reversible Colour Transform) has additional BIBO gain
147+
of 1.5× for luminance and 2.0× for chrominance (Db/Dr) channels,
148+
since `Cr = B - G` and `Cb = R - G` can double the range, and
149+
`Y = G + floor((Cb + Cr) / 4)` adds up to 1.5× for luminance.
150+
Max precision: `prec + 5 ≤ 16` → prec ≤ 11.
151+
152+
Note: Our implementation conservatively uses 5 bits for ALL MCT components
153+
(including luminance), simplifying the logic while remaining safe.
154+
155+
## Overflow-Safe SIMD Averaging
156+
157+
### The Problem
158+
159+
The 5/3 DWT update and predict steps compute averages of two sample values:
160+
161+
- **Update step**: `even -= floor((odd_prev + odd_next + 2) / 4)`
162+
- **Predict step**: `odd += floor((even_prev + even_next) / 2)`
163+
164+
In scalar C++ code, `int16_t` operands are implicitly promoted to `int` before
165+
arithmetic — no overflow is possible. However, in **SIMD code (HWY/Highway)**,
166+
arithmetic stays within the vector lane width (int16). The intermediate sum
167+
`a + b` of two int16 values can overflow the int16 range.
168+
169+
### Solution 1: Update Step — `update_avg_16_53()`
170+
171+
Computes `floor((a + b + 2) / 4)` without overflow using unsigned hardware
172+
averaging.
173+
174+
**Identity chain**:
175+
```
176+
floor((a + b + 2) / 4) = (floor((a + b) / 2) + 1) >> 1
177+
```
178+
179+
**Algorithm**:
180+
1. Convert signed → unsigned: `a_u = a XOR 0x8000`
181+
2. Bias b: `b_biased = b_unsigned + 0x7FFF` (equivalent to `b_unsigned - 1` mod 2^16)
182+
3. Hardware unsigned average: `AverageRound(a_u, b_biased)` =
183+
`floor((a_u + b_biased + 1) / 2)` = `floor((a_u + b_u) / 2)`
184+
(the +1 from AverageRound cancels the -1 from the bias)
185+
4. Unbias: `step = avg - 0x7FFF` → yields `floor((a+b)/2) + 1` in signed domain
186+
5. Final shift: `step >> 1``floor((a + b + 2) / 4)`
187+
188+
The hardware `AverageRound` instruction (`_mm256_avg_epu16` on x86) uses a
189+
17-bit intermediate accumulator internally, so the sum never overflows.
190+
191+
### Solution 2: Predict Step — `predict_avg_16_53()`
192+
193+
Computes `floor((a + b) / 2)` without overflow using bit decomposition.
194+
195+
**Identity**:
196+
```
197+
(a + b) >> 1 = (a >> 1) + (b >> 1) + ((a & b) & 1)
198+
```
199+
200+
Each term is individually safe:
201+
- `a >> 1` and `b >> 1` are half the original range
202+
- `(a & b) & 1` is 0 or 1 (the "carry" from the lost LSBs)
203+
204+
### Why Scalar Code Doesn't Need This
205+
206+
In non-SIMD (scalar) C++ code, the standard integer promotion rules
207+
automatically widen `int16_t` operands to `int` (at least 32 bits) before any
208+
arithmetic operation. The intermediate sum `a + b` is computed in 32-bit
209+
precision, so no overflow occurs. Only the SIMD kernels need the overflow-safe
210+
averaging functions.

src/lib/codec/formats/BMPFormat.h

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -351,10 +351,14 @@ bool BMPFormat<T>::writeImageBand(uint32_t yBegin, uint32_t yEnd)
351351
for(uint32_t i = 0; i < w; i++)
352352
{
353353
uint8_t rc[4] = {0, 0, 0, 0};
354+
bool isInt16 = image_->comps[0].data_type == GRK_INT_16;
354355
for(uint16_t compno = 0; compno < decompress_num_comps; ++compno)
355356
{
356-
auto src = (T*)image_->comps[compno].data;
357-
T r = src[srcIndex_ + i];
357+
int32_t r;
358+
if(isInt16)
359+
r = ((int16_t*)image_->comps[compno].data)[srcIndex_ + i];
360+
else
361+
r = ((T*)image_->comps[compno].data)[srcIndex_ + i];
358362
r += shift[compno];
359363
if(scaleType[compno] == 1)
360364
r *= scale[compno];

src/lib/codec/formats/JPEGFormat.h

Lines changed: 44 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,7 @@ class JPEGFormat : public ImageFormat
124124
struct jpeg_compress_struct cinfo;
125125
T const* planes[4];
126126
grk::PlanarToInterleaved<T>* interleaver_;
127+
grk::PlanarToInterleaved<int16_t>* interleaver16_;
127128
};
128129

129130
template<typename T>
@@ -384,13 +385,14 @@ grk_image* JPEGFormat<T>::jpegtoimage(const char* filename, grk_cparameters* par
384385
template<typename T>
385386
JPEGFormat<T>::JPEGFormat(void)
386387
: success(true), buffer(nullptr), buffer32s(nullptr), color_space(JCS_UNKNOWN), adjust(0),
387-
readFromStdin(false), planes{0, 0, 0}, interleaver_(nullptr)
388+
readFromStdin(false), planes{0, 0, 0}, interleaver_(nullptr), interleaver16_(nullptr)
388389
{}
389390

390391
template<typename T>
391392
JPEGFormat<T>::~JPEGFormat()
392393
{
393394
delete interleaver_;
395+
delete interleaver16_;
394396
}
395397

396398
template<typename T>
@@ -610,23 +612,50 @@ bool JPEGFormat<T>::writeImageBand(uint32_t yBegin, uint32_t yEnd)
610612
cinfo.next_scanline);
611613
return false;
612614
}
613-
if(!interleaver_)
615+
auto stride = image_->comps[0].stride;
616+
bool isInt16 = image_->comps[0].data_type == GRK_INT_16;
617+
if(isInt16)
614618
{
615-
interleaver_ = grk::InterleaverFactory<T>::makeInterleaver(8);
616-
if(!interleaver_)
617-
return false;
619+
if(!interleaver16_)
620+
{
621+
interleaver16_ = grk::InterleaverFactory<int16_t>::makeInterleaver(8);
622+
if(!interleaver16_)
623+
return false;
624+
}
625+
int16_t* bandPlanes16[4];
626+
for(uint16_t i = 0; i < image_->decompress_num_comps; ++i)
627+
bandPlanes16[i] = (int16_t*)image_->comps[i].data + (uint64_t)yBegin * stride;
628+
int16_t adjust16 = (int16_t)adjust;
629+
while(cinfo.next_scanline < yEnd)
630+
{
631+
interleaver16_->interleave(bandPlanes16, image_->decompress_num_comps, (uint8_t*)buffer,
632+
image_->decompress_width, stride, image_->decompress_width, 1,
633+
adjust16);
634+
JSAMPROW row_pointer[1];
635+
row_pointer[0] = buffer;
636+
jpeg_write_scanlines(&cinfo, row_pointer, 1);
637+
}
618638
}
619-
auto stride = image_->comps[0].stride;
620-
T* bandPlanes[4];
621-
for(uint16_t i = 0; i < image_->decompress_num_comps; ++i)
622-
bandPlanes[i] = (T*)image_->comps[i].data + (uint64_t)yBegin * stride;
623-
while(cinfo.next_scanline < yEnd)
639+
else
624640
{
625-
interleaver_->interleave(bandPlanes, image_->decompress_num_comps, (uint8_t*)buffer,
626-
image_->decompress_width, stride, image_->decompress_width, 1, adjust);
627-
JSAMPROW row_pointer[1];
628-
row_pointer[0] = buffer;
629-
jpeg_write_scanlines(&cinfo, row_pointer, 1);
641+
if(!interleaver_)
642+
{
643+
interleaver_ = grk::InterleaverFactory<T>::makeInterleaver(8);
644+
if(!interleaver_)
645+
return false;
646+
}
647+
T* bandPlanes[4];
648+
for(uint16_t i = 0; i < image_->decompress_num_comps; ++i)
649+
bandPlanes[i] = (T*)image_->comps[i].data + (uint64_t)yBegin * stride;
650+
while(cinfo.next_scanline < yEnd)
651+
{
652+
interleaver_->interleave(bandPlanes, image_->decompress_num_comps, (uint8_t*)buffer,
653+
image_->decompress_width, stride, image_->decompress_width, 1,
654+
adjust);
655+
JSAMPROW row_pointer[1];
656+
row_pointer[0] = buffer;
657+
jpeg_write_scanlines(&cinfo, row_pointer, 1);
658+
}
630659
}
631660

632661
return true;

src/lib/codec/formats/PGXFormat.h

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -265,16 +265,16 @@ bool PGXFormat<T>::writeImage(void)
265265
size_t outCount = 0;
266266
size_t index = 0;
267267
uint32_t stride_diff = comp->stride - w;
268+
bool isInt16 = comp->data_type == GRK_INT_16;
268269
if(nbytes == 1)
269270
{
270-
auto src = (T*)comp->data;
271271
uint8_t buf[bufSize];
272272
uint8_t* outPtr = buf;
273273
for(uint32_t j = 0; j < h; ++j)
274274
{
275275
for(uint32_t i = 0; i < w; ++i)
276276
{
277-
const T val = src[index++];
277+
const int32_t val = isInt16 ? ((int16_t*)comp->data)[index++] : ((T*)comp->data)[index++];
278278
if(!grk::writeBytes<uint8_t>((uint8_t)val, buf, &outPtr, &outCount, bufSize, true,
279279
fileIO_->getFileHandle()))
280280
{
@@ -298,12 +298,12 @@ bool PGXFormat<T>::writeImage(void)
298298
{
299299
uint16_t buf[bufSize];
300300
uint16_t* outPtr = buf;
301-
auto src = (T*)image_->comps[compno].data;
302301
for(uint32_t j = 0; j < h; ++j)
303302
{
304303
for(uint32_t i = 0; i < w; ++i)
305304
{
306-
const T val = src[index++];
305+
const int32_t val = isInt16 ? ((int16_t*)image_->comps[compno].data)[index++]
306+
: ((T*)image_->comps[compno].data)[index++];
307307
if(!grk::writeBytes<uint16_t>((uint16_t)val, buf, &outPtr, &outCount, bufSize, true,
308308
fileIO_->getFileHandle()))
309309
{

0 commit comments

Comments
 (0)