Further optimized `mask.from_surface()` when converting 32 bit alpha surfaces by itzpr3d4t0r · Pull Request #3785 · pygame-community/pygame-ce

itzpr3d4t0r · 2026-05-03T13:50:58Z

Follows #2895

In the previous implementation I circumvented SDL pixel getters by rolling in some manual alpha-getting procedures. That sped it up quite a bit but the code still wasn't parallel enough and clogged itself with unnecessary calculations while also not exploiting the way the bitmask is stored.

As documented in the code, bitmasks are stored column-group-major, with each group being 32 or 64 bits depending on the platform.

Take a 128X2 surface 64-bit platform, the bitmap will be laid out as follows.

Index:     [0]         [1]          [2]         [3]
        row0,col0   row1,col0    row0,col1   row1,col1
        px 0-63      px 0-63      64-127     px 64-127

We exploit this fact and do:

Still read all source surface's alpha channel only by jumping between the alphas horizontally (so we stick to the same row-scanline as before).
write all of the bitmask's group of pixels (32 or 64) by row so:
1. Do bits[0] in groups of 8 in parallel up to the nearest multiple of 8 possible
2. Do remaining 1-7 pixels one by one up to 32 or 64
3. Jump to the next group on the same row (index 2 from the example)
4. Go to the next row and repeat the process.

I've manually unrolled the loop so there aren't as many data dependencies and removed the duff's device.
From my testing all this makes the algorithm 60-80% faster.

coderabbitai · 2026-05-03T13:51:11Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
✅ Review completed - (🔄 Check again to review again)

📝 Walkthrough

Walkthrough

Removed the 4-pixel unroll macro and rewrote set_from_threshold() in src_c/mask.c: threshold is clamped to a Uint8, <24‑bpp uses a per-pixel loop, and 32‑bpp uses an endianness-aware, chunked scanner that writes alpha bits directly into bitmask->bits.

Changes

Alpha-Threshold Conversion

Layer / File(s)	Summary
Macro removal `src_c/mask.c`	Removed the old `LOOP_UNROLLED4` macro.
Function locals `src_c/mask.c`	Removed unused local `n` from `set_from_threshold()` declaration.
Threshold clamp & per-pixel path `src_c/mask.c`	Introduce `u_threshold` (clamped `Uint8`) and update `bpp < 3` per-pixel comparisons to use `a > u_threshold`.
Early-return routing `src_c/mask.c`	Add early return so optimized scanning runs only for `bpp == 4`.
bpp == 4 chunked scanner `src_c/mask.c`	New implementation computes endianness-dependent alpha offset, writes directly to `bitmask->bits`, processes rows in chunked blocks with 8-pixel unrolled inner loops and tail handling.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

Surface

Suggested reviewers

Starbuck5
ankith26

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: further optimization of mask.from_surface() for 32-bit alpha surfaces, which matches the primary focus of the changeset.
Description check	✅ Passed	The description provides detailed technical context about the optimization approach, including how bitmask storage is exploited, performance improvements achieved, and references prior work (`#2895`).
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Starbuck5 · 2026-05-10T21:50:46Z

@itzpr3d4t0r the man the myth the legend! Cool stuff.

One thing noticed is that it is not possible to have a 24 bit surface with per pixel alpha, so the 24 bit codepath is not reachable.

Also for bonus points, it also is theoretically possible to reach the 4 BPP codepath with a format that would be unsupported, like SDL_PIXELFORMAT_ARGB2101010. So maybe safest patch would be instead of (bpp < 3) for the generic codepath, (SDL_PIXELLAYOUT(surf->format) == SDL_PACKEDLAYOUT_8888) or something to that end.

coderabbitai

🧹 Nitpick comments (1)

src_c/mask.c (1)

852-879: ⚡ Quick win

Consider accumulating the 8-pixel scan into a local register before writing back to bits[block_idx].

The inner unrolled loop does 8 separate read-modify-write operations on the same memory location bits[block_idx]. Depending on alias analysis of srcp (a Uint8*) vs. bits (a BITMASK_W*), the compiler may not always be able to keep that word in a register across the 8 conditional ORs, which would re-introduce the memory dependency chain you are explicitly trying to avoid. Promoting it to a local makes the optimization independent of the compiler's aliasing decisions and also lets the same pattern be reused by the 1–7-pixel tail loop.

♻️ Proposed refactor

-            /* Fill the whole block */
-            while (x <= chunk_end - 8) {
-                if (srcp[0] > u_threshold) {
-                    bits[block_idx] |= BITMASK_N((x + 0) & BITMASK_W_MASK);
-                }
-                if (srcp[4] > u_threshold) {
-                    bits[block_idx] |= BITMASK_N((x + 1) & BITMASK_W_MASK);
-                }
-                if (srcp[8] > u_threshold) {
-                    bits[block_idx] |= BITMASK_N((x + 2) & BITMASK_W_MASK);
-                }
-                if (srcp[12] > u_threshold) {
-                    bits[block_idx] |= BITMASK_N((x + 3) & BITMASK_W_MASK);
-                }
-                if (srcp[16] > u_threshold) {
-                    bits[block_idx] |= BITMASK_N((x + 4) & BITMASK_W_MASK);
-                }
-                if (srcp[20] > u_threshold) {
-                    bits[block_idx] |= BITMASK_N((x + 5) & BITMASK_W_MASK);
-                }
-                if (srcp[24] > u_threshold) {
-                    bits[block_idx] |= BITMASK_N((x + 6) & BITMASK_W_MASK);
-                }
-                if (srcp[28] > u_threshold) {
-                    bits[block_idx] |= BITMASK_N((x + 7) & BITMASK_W_MASK);
-                }
-                srcp += 32;
-                x += 8;
-            }
-
-            /* remaining 1-7 pixels */
-            while (x < chunk_end) {
-                if (*srcp > u_threshold) {
-                    bits[block_idx] |= BITMASK_N(x & BITMASK_W_MASK);
-                }
-                srcp += 4;
-                x++;
-            }
+            BITMASK_W block = bits[block_idx];
+
+            /* Fill the whole block */
+            while (x <= chunk_end - 8) {
+                const int bp = x & BITMASK_W_MASK;
+                if (srcp[0]  > u_threshold) block |= BITMASK_N(bp + 0);
+                if (srcp[4]  > u_threshold) block |= BITMASK_N(bp + 1);
+                if (srcp[8]  > u_threshold) block |= BITMASK_N(bp + 2);
+                if (srcp[12] > u_threshold) block |= BITMASK_N(bp + 3);
+                if (srcp[16] > u_threshold) block |= BITMASK_N(bp + 4);
+                if (srcp[20] > u_threshold) block |= BITMASK_N(bp + 5);
+                if (srcp[24] > u_threshold) block |= BITMASK_N(bp + 6);
+                if (srcp[28] > u_threshold) block |= BITMASK_N(bp + 7);
+                srcp += 32;
+                x += 8;
+            }
+
+            /* remaining 1-7 pixels */
+            while (x < chunk_end) {
+                if (*srcp > u_threshold) {
+                    block |= BITMASK_N(x & BITMASK_W_MASK);
+                }
+                srcp += 4;
+                x++;
+            }
+
+            bits[block_idx] = block;

If you want to push further, the 8 branches per iteration can also be turned into a branchless byte-build (byte |= (srcp[0] > u_threshold) << 0; ...) and then block |= ((BITMASK_W)byte) << bp once, which removes the per-pixel branches entirely — but the local-accumulator change alone already captures most of the win.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src_c/mask.c` around lines 852 - 879, The loop currently performs eight
separate read-modify-writes to bits[block_idx] inside the unrolled loop (using
srcp, u_threshold, BITMASK_N and BITMASK_W_MASK), which may prevent the compiler
from keeping that word in a register; change the code to load bits[block_idx]
into a local accumulator (e.g., BITMASK_W acc = bits[block_idx]), perform all
eight conditional ORs into acc (using the same BITMASK_N(...) expressions), then
write back bits[block_idx] = acc once at the end of the iteration; also apply
the same local-accumulator pattern to the 1–7 pixel tail loop so it reuses the
same optimization.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src_c/mask.c`:
- Around line 852-879: The loop currently performs eight separate
read-modify-writes to bits[block_idx] inside the unrolled loop (using srcp,
u_threshold, BITMASK_N and BITMASK_W_MASK), which may prevent the compiler from
keeping that word in a register; change the code to load bits[block_idx] into a
local accumulator (e.g., BITMASK_W acc = bits[block_idx]), perform all eight
conditional ORs into acc (using the same BITMASK_N(...) expressions), then write
back bits[block_idx] = acc once at the end of the iteration; also apply the same
local-accumulator pattern to the 1–7 pixel tail loop so it reuses the same
optimization.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bbf10a95-b0e2-403e-8c65-c749b762c51b

📥 Commits

Reviewing files that changed from the base of the PR and between a0f4127 and 717e165.

📒 Files selected for processing (1)

src_c/mask.c

Updated explanation a bit.

itzpr3d4t0r · 2026-05-11T21:49:01Z

Bunny was correct, brought a 15-20% perf improvement so now we're comfortably above 110% better than before this PR.

For some reason my pc doesn't like the branchless logic suggestion. I guess it's because my CPU has a very advanced two-ahead branch predictor (AMD 9800X3D). So i suggest to whoever wants to check it out with maybe older hardware, just substitute the inner loop with the 8 pixels at a time with this

const int bp = x & BITMASK_W_MASK;
Uint8 byte = 0;

byte |= (srcp[0]  > u_threshold) << 0;
byte |= (srcp[4]  > u_threshold) << 1;
byte |= (srcp[8]  > u_threshold) << 2;
byte |= (srcp[12] > u_threshold) << 3;
byte |= (srcp[16] > u_threshold) << 4;
byte |= (srcp[20] > u_threshold) << 5;
byte |= (srcp[24] > u_threshold) << 6;
byte |= (srcp[28] > u_threshold) << 7;

block |= ((BITMASK_W)byte) << bp;
srcp += 32; 
x += 8;

…ent)

Improved mask.from_surface performance in 32 and 24 bit surfaces cases.

525258b

itzpr3d4t0r requested a review from a team as a code owner May 3, 2026 13:50

itzpr3d4t0r added the Performance Related to the speed or resource usage of the project label May 3, 2026

Removed unused variable.

a0f4127

itzpr3d4t0r added the mask pygame.mask label May 3, 2026

- Removed 3bpp case and macro.

717e165

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

Implemented bunny's suggestions.

5055e9c

Updated explanation a bit.

itzpr3d4t0r changed the title ~~Further optimized mask.from_surface() when converting 32 - 24 bit alpha surfaces~~ Further optimized mask.from_surface() when converting 32 bit alpha surfaces May 12, 2026

itzpr3d4t0r added 2 commits May 12, 2026 09:57

Condensed inner loop a bit.

8844f02

Micro optimization to skip writes on empty blocks. (4% perf. improvem…

758ada8

…ent)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Further optimized `mask.from_surface()` when converting 32 bit alpha surfaces#3785

Further optimized `mask.from_surface()` when converting 32 bit alpha surfaces#3785
itzpr3d4t0r wants to merge 6 commits into
pygame-community:mainfrom
itzpr3d4t0r:mask_from_surface_optim

itzpr3d4t0r commented May 3, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 3, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Uh oh!

Starbuck5 commented May 10, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

itzpr3d4t0r commented May 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

itzpr3d4t0r commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Uh oh!

Starbuck5 commented May 10, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

itzpr3d4t0r commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

itzpr3d4t0r commented May 3, 2026 •

edited

Loading

coderabbitai Bot commented May 3, 2026 •

edited

Loading

itzpr3d4t0r commented May 11, 2026 •

edited

Loading