Skip to content

Further optimized mask.from_surface() when converting 32 bit alpha surfaces#3785

Open
itzpr3d4t0r wants to merge 6 commits into
pygame-community:mainfrom
itzpr3d4t0r:mask_from_surface_optim
Open

Further optimized mask.from_surface() when converting 32 bit alpha surfaces#3785
itzpr3d4t0r wants to merge 6 commits into
pygame-community:mainfrom
itzpr3d4t0r:mask_from_surface_optim

Conversation

@itzpr3d4t0r
Copy link
Copy Markdown
Member

@itzpr3d4t0r itzpr3d4t0r commented May 3, 2026

Follows #2895

In the previous implementation I circumvented SDL pixel getters by rolling in some manual alpha-getting procedures. That sped it up quite a bit but the code still wasn't parallel enough and clogged itself with unnecessary calculations while also not exploiting the way the bitmask is stored.

As documented in the code, bitmasks are stored column-group-major, with each group being 32 or 64 bits depending on the platform.

Take a 128X2 surface 64-bit platform, the bitmap will be laid out as follows.

Index:     [0]         [1]          [2]         [3]
        row0,col0   row1,col0    row0,col1   row1,col1
        px 0-63      px 0-63      64-127     px 64-127

We exploit this fact and do:

  • Still read all source surface's alpha channel only by jumping between the alphas horizontally (so we stick to the same row-scanline as before).
  • write all of the bitmask's group of pixels (32 or 64) by row so:
    1. Do bits[0] in groups of 8 in parallel up to the nearest multiple of 8 possible
    2. Do remaining 1-7 pixels one by one up to 32 or 64
    3. Jump to the next group on the same row (index 2 from the example)
    4. Go to the next row and repeat the process.

I've manually unrolled the loop so there aren't as many data dependencies and removed the duff's device.
From my testing all this makes the algorithm 60-80% faster.

Figure_2

@itzpr3d4t0r itzpr3d4t0r requested a review from a team as a code owner May 3, 2026 13:50
@itzpr3d4t0r itzpr3d4t0r added the Performance Related to the speed or resource usage of the project label May 3, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 3, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • ✅ Review completed - (🔄 Check again to review again)
📝 Walkthrough

Walkthrough

Removed the 4-pixel unroll macro and rewrote set_from_threshold() in src_c/mask.c: threshold is clamped to a Uint8, <24‑bpp uses a per-pixel loop, and 32‑bpp uses an endianness-aware, chunked scanner that writes alpha bits directly into bitmask->bits.

Changes

Alpha-Threshold Conversion

Layer / File(s) Summary
Macro removal
src_c/mask.c
Removed the old LOOP_UNROLLED4 macro.
Function locals
src_c/mask.c
Removed unused local n from set_from_threshold() declaration.
Threshold clamp & per-pixel path
src_c/mask.c
Introduce u_threshold (clamped Uint8) and update bpp < 3 per-pixel comparisons to use a > u_threshold.
Early-return routing
src_c/mask.c
Add early return so optimized scanning runs only for bpp == 4.
bpp == 4 chunked scanner
src_c/mask.c
New implementation computes endianness-dependent alpha offset, writes directly to bitmask->bits, processes rows in chunked blocks with 8-pixel unrolled inner loops and tail handling.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

Surface

Suggested reviewers

  • Starbuck5
  • ankith26
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: further optimization of mask.from_surface() for 32-bit alpha surfaces, which matches the primary focus of the changeset.
Description check ✅ Passed The description provides detailed technical context about the optimization approach, including how bitmask storage is exploited, performance improvements achieved, and references prior work (#2895).
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@itzpr3d4t0r itzpr3d4t0r added the mask pygame.mask label May 3, 2026
@Starbuck5
Copy link
Copy Markdown
Member

@itzpr3d4t0r the man the myth the legend! Cool stuff.

One thing noticed is that it is not possible to have a 24 bit surface with per pixel alpha, so the 24 bit codepath is not reachable.

Also for bonus points, it also is theoretically possible to reach the 4 BPP codepath with a format that would be unsupported, like SDL_PIXELFORMAT_ARGB2101010. So maybe safest patch would be instead of (bpp < 3) for the generic codepath, (SDL_PIXELLAYOUT(surf->format) == SDL_PACKEDLAYOUT_8888) or something to that end.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src_c/mask.c (1)

852-879: ⚡ Quick win

Consider accumulating the 8-pixel scan into a local register before writing back to bits[block_idx].

The inner unrolled loop does 8 separate read-modify-write operations on the same memory location bits[block_idx]. Depending on alias analysis of srcp (a Uint8*) vs. bits (a BITMASK_W*), the compiler may not always be able to keep that word in a register across the 8 conditional ORs, which would re-introduce the memory dependency chain you are explicitly trying to avoid. Promoting it to a local makes the optimization independent of the compiler's aliasing decisions and also lets the same pattern be reused by the 1–7-pixel tail loop.

♻️ Proposed refactor
-            /* Fill the whole block */
-            while (x <= chunk_end - 8) {
-                if (srcp[0] > u_threshold) {
-                    bits[block_idx] |= BITMASK_N((x + 0) & BITMASK_W_MASK);
-                }
-                if (srcp[4] > u_threshold) {
-                    bits[block_idx] |= BITMASK_N((x + 1) & BITMASK_W_MASK);
-                }
-                if (srcp[8] > u_threshold) {
-                    bits[block_idx] |= BITMASK_N((x + 2) & BITMASK_W_MASK);
-                }
-                if (srcp[12] > u_threshold) {
-                    bits[block_idx] |= BITMASK_N((x + 3) & BITMASK_W_MASK);
-                }
-                if (srcp[16] > u_threshold) {
-                    bits[block_idx] |= BITMASK_N((x + 4) & BITMASK_W_MASK);
-                }
-                if (srcp[20] > u_threshold) {
-                    bits[block_idx] |= BITMASK_N((x + 5) & BITMASK_W_MASK);
-                }
-                if (srcp[24] > u_threshold) {
-                    bits[block_idx] |= BITMASK_N((x + 6) & BITMASK_W_MASK);
-                }
-                if (srcp[28] > u_threshold) {
-                    bits[block_idx] |= BITMASK_N((x + 7) & BITMASK_W_MASK);
-                }
-                srcp += 32;
-                x += 8;
-            }
-
-            /* remaining 1-7 pixels */
-            while (x < chunk_end) {
-                if (*srcp > u_threshold) {
-                    bits[block_idx] |= BITMASK_N(x & BITMASK_W_MASK);
-                }
-                srcp += 4;
-                x++;
-            }
+            BITMASK_W block = bits[block_idx];
+
+            /* Fill the whole block */
+            while (x <= chunk_end - 8) {
+                const int bp = x & BITMASK_W_MASK;
+                if (srcp[0]  > u_threshold) block |= BITMASK_N(bp + 0);
+                if (srcp[4]  > u_threshold) block |= BITMASK_N(bp + 1);
+                if (srcp[8]  > u_threshold) block |= BITMASK_N(bp + 2);
+                if (srcp[12] > u_threshold) block |= BITMASK_N(bp + 3);
+                if (srcp[16] > u_threshold) block |= BITMASK_N(bp + 4);
+                if (srcp[20] > u_threshold) block |= BITMASK_N(bp + 5);
+                if (srcp[24] > u_threshold) block |= BITMASK_N(bp + 6);
+                if (srcp[28] > u_threshold) block |= BITMASK_N(bp + 7);
+                srcp += 32;
+                x += 8;
+            }
+
+            /* remaining 1-7 pixels */
+            while (x < chunk_end) {
+                if (*srcp > u_threshold) {
+                    block |= BITMASK_N(x & BITMASK_W_MASK);
+                }
+                srcp += 4;
+                x++;
+            }
+
+            bits[block_idx] = block;

If you want to push further, the 8 branches per iteration can also be turned into a branchless byte-build (byte |= (srcp[0] > u_threshold) << 0; ...) and then block |= ((BITMASK_W)byte) << bp once, which removes the per-pixel branches entirely — but the local-accumulator change alone already captures most of the win.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src_c/mask.c` around lines 852 - 879, The loop currently performs eight
separate read-modify-writes to bits[block_idx] inside the unrolled loop (using
srcp, u_threshold, BITMASK_N and BITMASK_W_MASK), which may prevent the compiler
from keeping that word in a register; change the code to load bits[block_idx]
into a local accumulator (e.g., BITMASK_W acc = bits[block_idx]), perform all
eight conditional ORs into acc (using the same BITMASK_N(...) expressions), then
write back bits[block_idx] = acc once at the end of the iteration; also apply
the same local-accumulator pattern to the 1–7 pixel tail loop so it reuses the
same optimization.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src_c/mask.c`:
- Around line 852-879: The loop currently performs eight separate
read-modify-writes to bits[block_idx] inside the unrolled loop (using srcp,
u_threshold, BITMASK_N and BITMASK_W_MASK), which may prevent the compiler from
keeping that word in a register; change the code to load bits[block_idx] into a
local accumulator (e.g., BITMASK_W acc = bits[block_idx]), perform all eight
conditional ORs into acc (using the same BITMASK_N(...) expressions), then write
back bits[block_idx] = acc once at the end of the iteration; also apply the same
local-accumulator pattern to the 1–7 pixel tail loop so it reuses the same
optimization.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bbf10a95-b0e2-403e-8c65-c749b762c51b

📥 Commits

Reviewing files that changed from the base of the PR and between a0f4127 and 717e165.

📒 Files selected for processing (1)
  • src_c/mask.c

Updated explanation a bit.
@itzpr3d4t0r
Copy link
Copy Markdown
Member Author

itzpr3d4t0r commented May 11, 2026

Bunny was correct, brought a 15-20% perf improvement so now we're comfortably above 110% better than before this PR.
image

For some reason my pc doesn't like the branchless logic suggestion. I guess it's because my CPU has a very advanced two-ahead branch predictor (AMD 9800X3D). So i suggest to whoever wants to check it out with maybe older hardware, just substitute the inner loop with the 8 pixels at a time with this

const int bp = x & BITMASK_W_MASK;
Uint8 byte = 0;

byte |= (srcp[0]  > u_threshold) << 0;
byte |= (srcp[4]  > u_threshold) << 1;
byte |= (srcp[8]  > u_threshold) << 2;
byte |= (srcp[12] > u_threshold) << 3;
byte |= (srcp[16] > u_threshold) << 4;
byte |= (srcp[20] > u_threshold) << 5;
byte |= (srcp[24] > u_threshold) << 6;
byte |= (srcp[28] > u_threshold) << 7;

block |= ((BITMASK_W)byte) << bp;
srcp += 32; 
x += 8;

@itzpr3d4t0r itzpr3d4t0r changed the title Further optimized mask.from_surface() when converting 32 - 24 bit alpha surfaces Further optimized mask.from_surface() when converting 32 bit alpha surfaces May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

mask pygame.mask Performance Related to the speed or resource usage of the project

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants