Skip to content

gh-145742: Manually emit _LOAD_FAST_BORROW to reduce stencil bloat#148217

Open
corona10 wants to merge 3 commits intopython:mainfrom
corona10:gh-145742-impl
Open

gh-145742: Manually emit _LOAD_FAST_BORROW to reduce stencil bloat#148217
corona10 wants to merge 3 commits intopython:mainfrom
corona10:gh-145742-impl

Conversation

@corona10
Copy link
Copy Markdown
Member

@corona10 corona10 commented Apr 7, 2026

  • Manually emit _LOAD_FAST_BORROW at JIT compile time, encoding the operand offset directly into the instruction instead of loading it from the GOT at runtime.
  • This shrinks the generic case (oparg ≥ 8) from 28 bytes to 8 bytes and eliminates 27 stencil functions.
  • I've compared machine code through godbolt:

@corona10
Copy link
Copy Markdown
Member Author

corona10 commented Apr 8, 2026

For i686: https://godbolt.org/z/cdjdzev5Y

@diegorusso
Copy link
Copy Markdown
Contributor

Some initial feedback on this:

  • we should not to pollute jit.c with uops implementation. They should live in separate compile units and have the same signature of the other ones (e.g.: void emit__UOP_NAME(unsigned char *code, unsigned char *data, _PyExecutorObject *executor, const _PyUOpInstruction *instruction, jit_state *state)
  • ifdefs can select the right architecture of the custom implementation
  • in bytecodes.c we should have a way to tell the JIT machinery not to generated any code for a specific uops but the uops implemetation should be accounted in the table in the jit-stencils-*.h (static const StencilGroup stencil_groups[MAX_UOP_REGS_ID + 1])
  • The linker later on will pick up our own version of the uops implementation.

@corona10
Copy link
Copy Markdown
Member Author

corona10 commented Apr 8, 2026

Thanks, @diegorusso. I’ll keep working on this based on your feedback.

@markshannon
Copy link
Copy Markdown
Member

A couple of other things:

  • This PR asserts that the immediate value fits into the space given, but this will fail for larger opargs.
  • I don't if this matters, but the x86 code is inferior to that generated by the stencils for oparg 0-5. For example, for LOAD_FAST_BORROW_1_r01 in the stencil generated code uses a 1 byte offset instead of the 4 byte offset this PR generates. For oparg > 5, the code is the same.

I think you need to split _LOAD_FAST_BORROW into two variants for the JIT. One for all normal opargs, that can use manual code generation, and a generated fallback for huge opargs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants