You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
j2k_dequant() reads the per-sample decoded-bit-plane index p from block_states (block_dequant.cpp:61, 91):
```cpp
N_b = 30 - (state >> 3) + 1;
```
This is the only remaining dependency on block_states once #315 is done. If we can get p from somewhere else, the block_states buffer can be deleted entirely — saving one allocation/zero per codeblock and its ~4 KB of memory traffic in the hot path.
Book recipe (§17.1.3)
Instead of storing p in a separate byte-per-sample array, embed it in sample_buf itself via a marker bit:
When a sample first becomes significant (σ transitions 0 → 1), set a marker bit in sample_buf[j2 + j1*sstride] at position pLSB - 1 (one below the current bit-plane's LSB). This happens in sigprop's samples[...] |= 1 << p path and cleanup's equivalent.
Dequant recovers p by counting trailing zeros of the sample (after masking off the sign bit), which identifies the lowest set bit — the marker.
Scope
block_decoding.cpp: at each σ=1 transition, also OR the marker bit (1 << (p - 1) is fine in the natural flow since p >= 1 at decode time).
block_dequant.cpp + NEON/AVX2/AVX512 variants: replace the block_states byte load with a trailing-zero-count on the decoded sample.
coding_units.cpp + subband_row_buf.cpp: delete block_states / blkstate_stride / all allocation and zeroing for the Part 1 decode path. HT keeps its own block_states (different bit semantics, out of scope).
Without #315 in place, block_states still carries σ/σ̄/π/χ̂ for several consumers, so deleting it isn't possible. D alone on top of the current branch would save only the single block_states byte load in dequant — dequant is already SIMD'd, so the ceiling is ~2–5 % and the cost in touching all dequant kernels isn't justified.
D becomes attractive once #315 is merged: at that point block_states holds only p-index, and D lets us delete the whole buffer.
Enables: deleting block_states for Part 1 entirely — one aligned_mem_alloc + one memset per precinct per DWT level gone, plus the corresponding memory footprint (up to a few MB for large tiles).
Not applicable to: Part 15 HT (different block_states semantics, own allocation path).
Context
j2k_dequant()reads the per-sample decoded-bit-plane indexpfromblock_states(block_dequant.cpp:61, 91):```cpp
N_b = 30 - (state >> 3) + 1;
```
This is the only remaining dependency on
block_statesonce #315 is done. If we can getpfrom somewhere else, theblock_statesbuffer can be deleted entirely — saving one allocation/zero per codeblock and its ~4 KB of memory traffic in the hot path.Book recipe (§17.1.3)
Instead of storing
pin a separate byte-per-sample array, embed it insample_bufitself via a marker bit:sample_buf[j2 + j1*sstride]at positionpLSB - 1(one below the current bit-plane's LSB). This happens in sigprop'ssamples[...] |= 1 << ppath and cleanup's equivalent.pby counting trailing zeros of the sample (after masking off the sign bit), which identifies the lowest set bit — the marker.Scope
block_decoding.cpp: at each σ=1 transition, also OR the marker bit (1 << (p - 1)is fine in the natural flow sincep >= 1at decode time).block_dequant.cpp+ NEON/AVX2/AVX512 variants: replace theblock_statesbyte load with a trailing-zero-count on the decoded sample.coding_units.cpp+subband_row_buf.cpp: deleteblock_states/blkstate_stride/ all allocation and zeroing for the Part 1 decode path. HT keeps its ownblock_states(different bit semantics, out of scope).j2k_codeblock: deleteblock_states/blkstate_stridefields (Part 1 only — HT uses a separate allocation path after Part 1 decode: full consumer port of σ/σ̄/π/χ̂ onto packed stripe-column word #315 migrates them).Why this is not shipped yet
Without #315 in place,
block_statesstill carries σ/σ̄/π/χ̂ for several consumers, so deleting it isn't possible. D alone on top of the current branch would save only the single block_states byte load in dequant — dequant is already SIMD'd, so the ceiling is ~2–5 % and the cost in touching all dequant kernels isn't justified.D becomes attractive once #315 is merged: at that point
block_statesholds only p-index, and D lets us delete the whole buffer.Dependency graph
block_statesfor Part 1 entirely — one aligned_mem_alloc + one memset per precinct per DWT level gone, plus the corresponding memory footprint (up to a few MB for large tiles).Acceptance
block_statesallocation / zeroing / loads.p0_07.j2k/p0_08.j2k(many bit-planes, worst case for marker-bit recovery).