Feature: Stringzilla LUT — faster than OpenCV, float32 LUT/output, per-channel LUT

### Describe what you are looking for

# Feature request: Stringzilla LUT — faster than OpenCV, float32 LUT/output, per-channel LUT

## Summary

We use Stringzilla’s LUT (via `sz.translate` / `sz_lut`) in [Albucore](https://github.com/albumentations-team/albucore) and [AlbumentationsX](https://github.com/albumentations-team/albumentations) for uint8→uint8 lookup. We need:

1. **Faster than OpenCV** — LUT apply should be faster than `cv2.LUT` (or at least not slower).
2. **uint8 image + float32 LUT → float32 image** — so we can feed a float32 LUT and get float32 output (e.g. for normalization: uint8 image normalized to float32 in one pass without a separate convert-then-normalize path).
3. **Per-channel LUT without a loop** — accept a LUT array of shape **(C, N)** (e.g. C channels, 256 entries) and apply channel `c`’s LUT to channel `c` in one call, instead of looping over channels in Python.

---

## Current usage

### Albucore

- **`sz_lut(img, lut, inplace)`** — uint8 image, uint8 LUT (256 entries), uses `sz.translate`. Used for add/multiply/power LUT-based ops, and from Albumentations for any uint8 LUT path.
- **Per-channel LUT:** When LUT is per-channel (e.g. shape `(256, C)` or `(C, 256)`), the code **loops** over channels: `result[..., i] = sz_lut(img[..., i], luts[i], inplace)`. Same in `apply_lut` for per-channel value.
- **Normalization:** For mean-std or min-max normalization on uint8, Albucore builds a **float32** LUT and uses **`cv2.LUT`** (not Stringzilla) because `sz_lut` is uint8-only. So we need a path that accepts float32 LUT and outputs float32 image.

### AlbumentationsX

We use `sz_lut` in several transforms. Categorized below. We also need LUT to accept **any input shape**: `(H, W, C)`, `(N, H, W, C)`, `(D, H, W, C)`, or `(N, D, H, W, C)`. Ideally without reshape; today we reshape to a contiguous last-dimension and it works, but native support for these shapes would be better.

**Working well (single LUT, no loop) — no need to change:**
- **HueSaturationValue** — hue shift via single LUT on H channel.
- **Solarize** — single threshold LUT.
- **Posterize** — when same bits for all channels; single LUT.
- **Equalize** — single-channel equalization (PIL and CV paths); one LUT per call.
- **MoveToneCurve** — when tone curve is scalar (same for all channels); single LUT.
- **RandomGamma** — single gamma LUT.
- **RandomDither** — binary or single quantization LUT (same for all channels).

**Application in loop (need improvement — per-channel LUT in one call):**
- **Posterize** — when bits are per-channel (e.g. `bits=[3, 4, 5]`); we loop over channels and call `sz_lut` per channel.
- **MoveToneCurve** — when `low_y` / `high_y` are per-channel arrays; we build LUTs shape `(C, 256)` and loop: `result[..., i] = sz_lut(img[..., i], luts[i], ...)`.
- **AutoContrast** — per-channel LUT from histogram bounds; loop over channels.
- **EqualizeHistogram** — per-channel equalization; loop over channels.
- **OrderedDither** — multi-level dithering; we reuse the same LUT but still loop over channels.

**Need float32 output (need improvement):**
- **RandomBrightnessContrast / BrightnessContrast** — we build a float32 LUT then cast to uint8 and call `sz_lut`; we lose precision. We want uint8 image + float32 LUT → float32 image when output is float.
- **Normalize** — uint8 image normalized to float32 (mean-std or min-max); needs float32 LUT → float32 image in one pass (currently Albucore uses `cv2.LUT` for this path).

---

## What we need

### 1. Performance

- LUT apply should be **faster than OpenCV** `cv2.LUT` for the same input (uint8 image, 256-element LUT). We already use Stringzilla for uint8/uint8; keeping or improving that advantage is important.

### 2. uint8 image + float32 LUT → float32 image

- **Input:** uint8 image of shape `(H, W)`, `(H, W, C)`, `(N, H, W, C)`, `(D, H, W, C)`, or `(N, D, H, W, C)`; float32 LUT of length 256 (values in any range, e.g. normalized [0, 1] or mean-std normalized).
- **Output:** float32 image of the same shape; `out[..., c] = lut[img[..., c]]` (last dimension is channel when present).
- **Use case:** Normalization (mean-std, min-max) and any pipeline that wants to map uint8 indices through a float LUT without converting the whole image to float first. So we can do “uint8 in → float32 out” in one LUT call.

### 3. Per-channel LUT: shape (C, N) in one call

- **Input:** uint8 image of shape `(H, W, C)`, `(N, H, W, C)`, `(D, H, W, C)`, or `(N, D, H, W, C)`; LUT array of shape **(C, 256)** (or (C, N) with N=256): one LUT per channel.
- **Output:** same shape uint8 (or float32 if LUT is float32); channel `c` is transformed by `luts[c, :]`.
- **Current workaround:** We do a Python loop: `for c in range(C): result[..., c] = sz_lut(img[..., c], luts[c], ...)`. We want a **single API call** that applies each channel’s LUT to that channel (no loop), so the backend can vectorize over channels.

If the same API supports both “single LUT (1D)” and “per-channel LUT (C, 256)”, that would cover our use cases. All of the above should accept **any input shape** `(H, W, C)`, `(N, H, W, C)`, `(D, H, W, C)`, or `(N, D, H, W, C)` without the caller having to reshape; we currently reshape to a contiguous last-dimension and it works, but native support for these shapes is preferred.

---

## Summary table

| Feature | Current | Request |
|--------|---------|---------|
| **Speed** | Stringzilla used for uint8/uint8 | Stay faster than OpenCV `cv2.LUT` |
| **Dtypes** | uint8 image + uint8 LUT → uint8 | Add: uint8 image + **float32 LUT** → **float32** image (e.g. for normalization) |
| **Per-channel** | Loop in Python: one `sz_lut` per channel | Accept LUT shape **(C, 256)** and apply in one call (no loop) |

---

## Why this helps

- **Normalization:** One pass uint8→float32 with a float32 LUT instead of convert-to-float then normalize (or using cv2.LUT in a separate code path). Same pipeline, better performance and simpler code.
- **Per-channel ops:** Equalize, tone curve, reduce_bits, and any per-channel LUT run in a loop today; a single (C, 256) LUT call would remove the loop and let the backend optimize (SIMD over channels, etc.).

If this is implemented in Stringzilla (or exposed in Albucore on top of it), we would use it in Albucore and AlbumentationsX to simplify code, improve performance, and support uint8→float32 LUT and per-channel LUT without Python loops.


### Can you contribute to the implementation?

- [ ] I can contribute

### Is your feature request specific to a certain interface?

It applies to everything

### Contact Details

_No response_

### Is there an existing issue for this?

- [x] I have searched the existing issues

### Code of Conduct

- [x] I agree to follow this project's Code of Conduct

Feature	Current	Request
Speed	Stringzilla used for uint8/uint8	Stay faster than OpenCV `cv2.LUT`
Dtypes	uint8 image + uint8 LUT → uint8	Add: uint8 image + float32 LUT → float32 image (e.g. for normalization)
Per-channel	Loop in Python: one `sz_lut` per channel	Accept LUT shape (C, 256) and apply in one call (no loop)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Stringzilla LUT — faster than OpenCV, float32 LUT/output, per-channel LUT #302

Describe what you are looking for

Feature request: Stringzilla LUT — faster than OpenCV, float32 LUT/output, per-channel LUT

Summary

Current usage

Albucore

AlbumentationsX

What we need

1. Performance

2. uint8 image + float32 LUT → float32 image

3. Per-channel LUT: shape (C, N) in one call

Summary table

Why this helps

Can you contribute to the implementation?

Is your feature request specific to a certain interface?

Contact Details

Is there an existing issue for this?

Code of Conduct

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feature: Stringzilla LUT — faster than OpenCV, float32 LUT/output, per-channel LUT #302

Description

Describe what you are looking for

Feature request: Stringzilla LUT — faster than OpenCV, float32 LUT/output, per-channel LUT

Summary

Current usage

Albucore

AlbumentationsX

What we need

1. Performance

2. uint8 image + float32 LUT → float32 image

3. Per-channel LUT: shape (C, N) in one call

Summary table

Why this helps

Can you contribute to the implementation?

Is your feature request specific to a certain interface?

Contact Details

Is there an existing issue for this?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions