Further optimized mask.from_surface() when converting 32 bit alpha surfaces#3785
Further optimized mask.from_surface() when converting 32 bit alpha surfaces#3785itzpr3d4t0r wants to merge 6 commits into
mask.from_surface() when converting 32 bit alpha surfaces#3785Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughRemoved the 4-pixel unroll macro and rewrote set_from_threshold() in ChangesAlpha-Threshold Conversion
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@itzpr3d4t0r the man the myth the legend! Cool stuff. One thing noticed is that it is not possible to have a 24 bit surface with per pixel alpha, so the 24 bit codepath is not reachable. Also for bonus points, it also is theoretically possible to reach the 4 BPP codepath with a format that would be unsupported, like SDL_PIXELFORMAT_ARGB2101010. So maybe safest patch would be instead of (bpp < 3) for the generic codepath, (SDL_PIXELLAYOUT(surf->format) == SDL_PACKEDLAYOUT_8888) or something to that end. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src_c/mask.c (1)
852-879: ⚡ Quick winConsider accumulating the 8-pixel scan into a local register before writing back to
bits[block_idx].The inner unrolled loop does 8 separate read-modify-write operations on the same memory location
bits[block_idx]. Depending on alias analysis ofsrcp(aUint8*) vs.bits(aBITMASK_W*), the compiler may not always be able to keep that word in a register across the 8 conditional ORs, which would re-introduce the memory dependency chain you are explicitly trying to avoid. Promoting it to a local makes the optimization independent of the compiler's aliasing decisions and also lets the same pattern be reused by the 1–7-pixel tail loop.♻️ Proposed refactor
- /* Fill the whole block */ - while (x <= chunk_end - 8) { - if (srcp[0] > u_threshold) { - bits[block_idx] |= BITMASK_N((x + 0) & BITMASK_W_MASK); - } - if (srcp[4] > u_threshold) { - bits[block_idx] |= BITMASK_N((x + 1) & BITMASK_W_MASK); - } - if (srcp[8] > u_threshold) { - bits[block_idx] |= BITMASK_N((x + 2) & BITMASK_W_MASK); - } - if (srcp[12] > u_threshold) { - bits[block_idx] |= BITMASK_N((x + 3) & BITMASK_W_MASK); - } - if (srcp[16] > u_threshold) { - bits[block_idx] |= BITMASK_N((x + 4) & BITMASK_W_MASK); - } - if (srcp[20] > u_threshold) { - bits[block_idx] |= BITMASK_N((x + 5) & BITMASK_W_MASK); - } - if (srcp[24] > u_threshold) { - bits[block_idx] |= BITMASK_N((x + 6) & BITMASK_W_MASK); - } - if (srcp[28] > u_threshold) { - bits[block_idx] |= BITMASK_N((x + 7) & BITMASK_W_MASK); - } - srcp += 32; - x += 8; - } - - /* remaining 1-7 pixels */ - while (x < chunk_end) { - if (*srcp > u_threshold) { - bits[block_idx] |= BITMASK_N(x & BITMASK_W_MASK); - } - srcp += 4; - x++; - } + BITMASK_W block = bits[block_idx]; + + /* Fill the whole block */ + while (x <= chunk_end - 8) { + const int bp = x & BITMASK_W_MASK; + if (srcp[0] > u_threshold) block |= BITMASK_N(bp + 0); + if (srcp[4] > u_threshold) block |= BITMASK_N(bp + 1); + if (srcp[8] > u_threshold) block |= BITMASK_N(bp + 2); + if (srcp[12] > u_threshold) block |= BITMASK_N(bp + 3); + if (srcp[16] > u_threshold) block |= BITMASK_N(bp + 4); + if (srcp[20] > u_threshold) block |= BITMASK_N(bp + 5); + if (srcp[24] > u_threshold) block |= BITMASK_N(bp + 6); + if (srcp[28] > u_threshold) block |= BITMASK_N(bp + 7); + srcp += 32; + x += 8; + } + + /* remaining 1-7 pixels */ + while (x < chunk_end) { + if (*srcp > u_threshold) { + block |= BITMASK_N(x & BITMASK_W_MASK); + } + srcp += 4; + x++; + } + + bits[block_idx] = block;If you want to push further, the 8 branches per iteration can also be turned into a branchless byte-build (
byte |= (srcp[0] > u_threshold) << 0; ...) and thenblock |= ((BITMASK_W)byte) << bponce, which removes the per-pixel branches entirely — but the local-accumulator change alone already captures most of the win.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src_c/mask.c` around lines 852 - 879, The loop currently performs eight separate read-modify-writes to bits[block_idx] inside the unrolled loop (using srcp, u_threshold, BITMASK_N and BITMASK_W_MASK), which may prevent the compiler from keeping that word in a register; change the code to load bits[block_idx] into a local accumulator (e.g., BITMASK_W acc = bits[block_idx]), perform all eight conditional ORs into acc (using the same BITMASK_N(...) expressions), then write back bits[block_idx] = acc once at the end of the iteration; also apply the same local-accumulator pattern to the 1–7 pixel tail loop so it reuses the same optimization.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src_c/mask.c`:
- Around line 852-879: The loop currently performs eight separate
read-modify-writes to bits[block_idx] inside the unrolled loop (using srcp,
u_threshold, BITMASK_N and BITMASK_W_MASK), which may prevent the compiler from
keeping that word in a register; change the code to load bits[block_idx] into a
local accumulator (e.g., BITMASK_W acc = bits[block_idx]), perform all eight
conditional ORs into acc (using the same BITMASK_N(...) expressions), then write
back bits[block_idx] = acc once at the end of the iteration; also apply the same
local-accumulator pattern to the 1–7 pixel tail loop so it reuses the same
optimization.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: bbf10a95-b0e2-403e-8c65-c749b762c51b
📒 Files selected for processing (1)
src_c/mask.c
Updated explanation a bit.
mask.from_surface() when converting 32 - 24 bit alpha surfacesmask.from_surface() when converting 32 bit alpha surfaces

Follows #2895
In the previous implementation I circumvented SDL pixel getters by rolling in some manual alpha-getting procedures. That sped it up quite a bit but the code still wasn't parallel enough and clogged itself with unnecessary calculations while also not exploiting the way the bitmask is stored.
As documented in the code, bitmasks are stored column-group-major, with each group being 32 or 64 bits depending on the platform.
Take a 128X2 surface 64-bit platform, the bitmap will be laid out as follows.
We exploit this fact and do:
I've manually unrolled the loop so there aren't as many data dependencies and removed the duff's device.
From my testing all this makes the algorithm 60-80% faster.