Describe what you are looking for
Feature request: Stringzilla LUT — faster than OpenCV, float32 LUT/output, per-channel LUT
Summary
We use Stringzilla’s LUT (via sz.translate / sz_lut) in Albucore and AlbumentationsX for uint8→uint8 lookup. We need:
- Faster than OpenCV — LUT apply should be faster than
cv2.LUT (or at least not slower).
- uint8 image + float32 LUT → float32 image — so we can feed a float32 LUT and get float32 output (e.g. for normalization: uint8 image normalized to float32 in one pass without a separate convert-then-normalize path).
- Per-channel LUT without a loop — accept a LUT array of shape (C, N) (e.g. C channels, 256 entries) and apply channel
c’s LUT to channel c in one call, instead of looping over channels in Python.
Current usage
Albucore
sz_lut(img, lut, inplace) — uint8 image, uint8 LUT (256 entries), uses sz.translate. Used for add/multiply/power LUT-based ops, and from Albumentations for any uint8 LUT path.
- Per-channel LUT: When LUT is per-channel (e.g. shape
(256, C) or (C, 256)), the code loops over channels: result[..., i] = sz_lut(img[..., i], luts[i], inplace). Same in apply_lut for per-channel value.
- Normalization: For mean-std or min-max normalization on uint8, Albucore builds a float32 LUT and uses
cv2.LUT (not Stringzilla) because sz_lut is uint8-only. So we need a path that accepts float32 LUT and outputs float32 image.
AlbumentationsX
We use sz_lut in several transforms. Categorized below. We also need LUT to accept any input shape: (H, W, C), (N, H, W, C), (D, H, W, C), or (N, D, H, W, C). Ideally without reshape; today we reshape to a contiguous last-dimension and it works, but native support for these shapes would be better.
Working well (single LUT, no loop) — no need to change:
- HueSaturationValue — hue shift via single LUT on H channel.
- Solarize — single threshold LUT.
- Posterize — when same bits for all channels; single LUT.
- Equalize — single-channel equalization (PIL and CV paths); one LUT per call.
- MoveToneCurve — when tone curve is scalar (same for all channels); single LUT.
- RandomGamma — single gamma LUT.
- RandomDither — binary or single quantization LUT (same for all channels).
Application in loop (need improvement — per-channel LUT in one call):
- Posterize — when bits are per-channel (e.g.
bits=[3, 4, 5]); we loop over channels and call sz_lut per channel.
- MoveToneCurve — when
low_y / high_y are per-channel arrays; we build LUTs shape (C, 256) and loop: result[..., i] = sz_lut(img[..., i], luts[i], ...).
- AutoContrast — per-channel LUT from histogram bounds; loop over channels.
- EqualizeHistogram — per-channel equalization; loop over channels.
- OrderedDither — multi-level dithering; we reuse the same LUT but still loop over channels.
Need float32 output (need improvement):
- RandomBrightnessContrast / BrightnessContrast — we build a float32 LUT then cast to uint8 and call
sz_lut; we lose precision. We want uint8 image + float32 LUT → float32 image when output is float.
- Normalize — uint8 image normalized to float32 (mean-std or min-max); needs float32 LUT → float32 image in one pass (currently Albucore uses
cv2.LUT for this path).
What we need
1. Performance
- LUT apply should be faster than OpenCV
cv2.LUT for the same input (uint8 image, 256-element LUT). We already use Stringzilla for uint8/uint8; keeping or improving that advantage is important.
2. uint8 image + float32 LUT → float32 image
- Input: uint8 image of shape
(H, W), (H, W, C), (N, H, W, C), (D, H, W, C), or (N, D, H, W, C); float32 LUT of length 256 (values in any range, e.g. normalized [0, 1] or mean-std normalized).
- Output: float32 image of the same shape;
out[..., c] = lut[img[..., c]] (last dimension is channel when present).
- Use case: Normalization (mean-std, min-max) and any pipeline that wants to map uint8 indices through a float LUT without converting the whole image to float first. So we can do “uint8 in → float32 out” in one LUT call.
3. Per-channel LUT: shape (C, N) in one call
- Input: uint8 image of shape
(H, W, C), (N, H, W, C), (D, H, W, C), or (N, D, H, W, C); LUT array of shape (C, 256) (or (C, N) with N=256): one LUT per channel.
- Output: same shape uint8 (or float32 if LUT is float32); channel
c is transformed by luts[c, :].
- Current workaround: We do a Python loop:
for c in range(C): result[..., c] = sz_lut(img[..., c], luts[c], ...). We want a single API call that applies each channel’s LUT to that channel (no loop), so the backend can vectorize over channels.
If the same API supports both “single LUT (1D)” and “per-channel LUT (C, 256)”, that would cover our use cases. All of the above should accept any input shape (H, W, C), (N, H, W, C), (D, H, W, C), or (N, D, H, W, C) without the caller having to reshape; we currently reshape to a contiguous last-dimension and it works, but native support for these shapes is preferred.
Summary table
| Feature |
Current |
Request |
| Speed |
Stringzilla used for uint8/uint8 |
Stay faster than OpenCV cv2.LUT |
| Dtypes |
uint8 image + uint8 LUT → uint8 |
Add: uint8 image + float32 LUT → float32 image (e.g. for normalization) |
| Per-channel |
Loop in Python: one sz_lut per channel |
Accept LUT shape (C, 256) and apply in one call (no loop) |
Why this helps
- Normalization: One pass uint8→float32 with a float32 LUT instead of convert-to-float then normalize (or using cv2.LUT in a separate code path). Same pipeline, better performance and simpler code.
- Per-channel ops: Equalize, tone curve, reduce_bits, and any per-channel LUT run in a loop today; a single (C, 256) LUT call would remove the loop and let the backend optimize (SIMD over channels, etc.).
If this is implemented in Stringzilla (or exposed in Albucore on top of it), we would use it in Albucore and AlbumentationsX to simplify code, improve performance, and support uint8→float32 LUT and per-channel LUT without Python loops.
Can you contribute to the implementation?
Is your feature request specific to a certain interface?
It applies to everything
Contact Details
No response
Is there an existing issue for this?
Code of Conduct
Describe what you are looking for
Feature request: Stringzilla LUT — faster than OpenCV, float32 LUT/output, per-channel LUT
Summary
We use Stringzilla’s LUT (via
sz.translate/sz_lut) in Albucore and AlbumentationsX for uint8→uint8 lookup. We need:cv2.LUT(or at least not slower).c’s LUT to channelcin one call, instead of looping over channels in Python.Current usage
Albucore
sz_lut(img, lut, inplace)— uint8 image, uint8 LUT (256 entries), usessz.translate. Used for add/multiply/power LUT-based ops, and from Albumentations for any uint8 LUT path.(256, C)or(C, 256)), the code loops over channels:result[..., i] = sz_lut(img[..., i], luts[i], inplace). Same inapply_lutfor per-channel value.cv2.LUT(not Stringzilla) becausesz_lutis uint8-only. So we need a path that accepts float32 LUT and outputs float32 image.AlbumentationsX
We use
sz_lutin several transforms. Categorized below. We also need LUT to accept any input shape:(H, W, C),(N, H, W, C),(D, H, W, C), or(N, D, H, W, C). Ideally without reshape; today we reshape to a contiguous last-dimension and it works, but native support for these shapes would be better.Working well (single LUT, no loop) — no need to change:
Application in loop (need improvement — per-channel LUT in one call):
bits=[3, 4, 5]); we loop over channels and callsz_lutper channel.low_y/high_yare per-channel arrays; we build LUTs shape(C, 256)and loop:result[..., i] = sz_lut(img[..., i], luts[i], ...).Need float32 output (need improvement):
sz_lut; we lose precision. We want uint8 image + float32 LUT → float32 image when output is float.cv2.LUTfor this path).What we need
1. Performance
cv2.LUTfor the same input (uint8 image, 256-element LUT). We already use Stringzilla for uint8/uint8; keeping or improving that advantage is important.2. uint8 image + float32 LUT → float32 image
(H, W),(H, W, C),(N, H, W, C),(D, H, W, C), or(N, D, H, W, C); float32 LUT of length 256 (values in any range, e.g. normalized [0, 1] or mean-std normalized).out[..., c] = lut[img[..., c]](last dimension is channel when present).3. Per-channel LUT: shape (C, N) in one call
(H, W, C),(N, H, W, C),(D, H, W, C), or(N, D, H, W, C); LUT array of shape (C, 256) (or (C, N) with N=256): one LUT per channel.cis transformed byluts[c, :].for c in range(C): result[..., c] = sz_lut(img[..., c], luts[c], ...). We want a single API call that applies each channel’s LUT to that channel (no loop), so the backend can vectorize over channels.If the same API supports both “single LUT (1D)” and “per-channel LUT (C, 256)”, that would cover our use cases. All of the above should accept any input shape
(H, W, C),(N, H, W, C),(D, H, W, C), or(N, D, H, W, C)without the caller having to reshape; we currently reshape to a contiguous last-dimension and it works, but native support for these shapes is preferred.Summary table
cv2.LUTsz_lutper channelWhy this helps
If this is implemented in Stringzilla (or exposed in Albucore on top of it), we would use it in Albucore and AlbumentationsX to simplify code, improve performance, and support uint8→float32 LUT and per-channel LUT without Python loops.
Can you contribute to the implementation?
Is your feature request specific to a certain interface?
It applies to everything
Contact Details
No response
Is there an existing issue for this?
Code of Conduct