perf: METH_FASTCALL, tp_vectorcall, and related optimizations#147
perf: METH_FASTCALL, tp_vectorcall, and related optimizations#147
Conversation
Vectorcall bypasses tp_new→tp_init Python-level dispatch, parsing arguments directly in C. Makes the _parse_fastcall_args helper accept an input_required parameter so type constructors can accept optional input while module functions require it. Type constructor speedup: 41-66% depending on arguments.
There was a problem hiding this comment.
Code Review
This pull request implements vectorcall support for the xxh32, xxh64, xxh3_64, and xxh3_128 hash types in the _xxhash C extension, which improves performance for function calls. It also refactors the internal _parse_fastcall_args function to handle optional input arguments. The review feedback suggests releasing the Global Interpreter Lock (GIL) using Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS during the initial hash update within the new vectorcall implementations to ensure consistent multi-threaded performance across the library.
| if (buf.buf) { | ||
| XXH32_update(self->xxhash_state, buf.buf, buf.len); | ||
| PyBuffer_Release(&buf); | ||
| } |
There was a problem hiding this comment.
The XXH32_update call should be wrapped in Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS to release the GIL. This is consistent with the existing PYXXH32_do_update helper and is important for performance when hashing large buffers in multi-threaded environments, which is a primary use case for this library.
if (buf.buf) {
Py_BEGIN_ALLOW_THREADS
XXH32_update(self->xxhash_state, buf.buf, buf.len);
Py_END_ALLOW_THREADS
PyBuffer_Release(&buf);
}| if (buf.buf) { | ||
| XXH64_update(self->xxhash_state, buf.buf, buf.len); | ||
| PyBuffer_Release(&buf); | ||
| } |
There was a problem hiding this comment.
The XXH64_update call should be wrapped in Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS to release the GIL, maintaining consistency with the PYXXH64_do_update implementation and ensuring multi-threaded performance for large inputs.
if (buf.buf) {
Py_BEGIN_ALLOW_THREADS
XXH64_update(self->xxhash_state, buf.buf, buf.len);
Py_END_ALLOW_THREADS
PyBuffer_Release(&buf);
}| if (buf.buf) { | ||
| XXH3_64bits_update(self->xxhash_state, buf.buf, buf.len); | ||
| PyBuffer_Release(&buf); | ||
| } |
There was a problem hiding this comment.
| if (buf.buf) { | ||
| XXH3_128bits_update(self->xxhash_state, buf.buf, buf.len); | ||
| PyBuffer_Release(&buf); | ||
| } |
There was a problem hiding this comment.
Merging this PR will not alter performance
Performance Changes
Comparing Footnotes
|
Wrap XXH*_update calls in Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS in all four tp_vectorcall functions, consistent with the existing PYXXH*_do_update helpers. Important for multi-threaded hashing of large buffers.
Replaces large-buffer-only benchmarks with a mix that captures the recent optimizations: - 5-byte inputs (call overhead dominates, 25-60% improvement) - str input (tests _get_buffer_or_str UTF-8 path) - type constructors (tests tp_vectorcall, 41-66% improvement) - keyword seed passing - Retains 1KB/10KB macro benchmarks for covering hash throughput
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements vectorcall support for the xxhash hash types (xxh32, xxh64, xxh3_64, and xxh3_128) to optimize performance for small inputs and constructor calls. It also refactors the internal _parse_fastcall_args function to handle optional input arguments and updates the benchmark suite to include micro-benchmarks for these new paths. Review feedback suggests refining error handling in the new vectorcall implementations by returning NULL directly when tp_alloc fails, rather than calling PyErr_NoMemory(), to avoid potentially overwriting existing exceptions.
| if (buf.buf) | ||
| PyBuffer_Release(&buf); | ||
| Py_XDECREF(buf_owner); | ||
| return PyErr_NoMemory(); |
There was a problem hiding this comment.
In PYXXH32_vectorcall, if tp_alloc fails, it typically sets a MemoryError. Calling PyErr_NoMemory() again is redundant but safe. However, if tp_alloc were to fail for a reason other than memory exhaustion (though unlikely for this type), PyErr_NoMemory() would overwrite the actual error. It is generally better to return NULL directly if an error is already set, or only call PyErr_NoMemory() if XXH32_createState() fails (as it doesn't set a Python exception).
return NULL;| if (buf.buf) | ||
| PyBuffer_Release(&buf); | ||
| Py_XDECREF(buf_owner); | ||
| return PyErr_NoMemory(); |
There was a problem hiding this comment.
| if (buf.buf) | ||
| PyBuffer_Release(&buf); | ||
| Py_XDECREF(buf_owner); | ||
| return PyErr_NoMemory(); |
| if (buf.buf) | ||
| PyBuffer_Release(&buf); | ||
| Py_XDECREF(buf_owner); | ||
| return PyErr_NoMemory(); |
tp_alloc already sets MemoryError (or another exception) on failure. Calling PyErr_NoMemory() in the shared error path would overwrite it. Now return NULL directly for tp_alloc failure and only call PyErr_NoMemory() for XXH*_createState failure (which sets no exception).
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements the tp_vectorcall protocol for the xxh32, xxh64, xxh3_64, and xxh3_128 types to optimize constructor calls and reduce overhead. It also refactors internal argument parsing to support optional input requirements and adds new micro-benchmarks to measure call performance. Feedback highlights critical resource leaks in the new vectorcall functions where buffers and owners are not released if memory allocation fails. Additionally, it is recommended to add type checks to the vectorcall implementations to ensure that Python-level subclass overrides (like __init__) are not bypassed.
| if (self == NULL) | ||
| return NULL; |
| if (self == NULL) | ||
| return NULL; |
| if (self == NULL) | ||
| return NULL; |
| if (self == NULL) | ||
| return NULL; |
| PYXXH32_vectorcall(PyObject *type, PyObject *const *args, | ||
| size_t nargsf, PyObject *kwnames) | ||
| { |
There was a problem hiding this comment.
Implementing tp_vectorcall on a type object effectively replaces the standard tp_new and tp_init sequence for constructors. However, this implementation bypasses Python-level overrides in subclasses. If a user subclasses xxh32 in Python and defines a custom __init__, it will be ignored when the class is called via the vectorcall path (which is the default in modern Python). To maintain compatibility with subclasses, you should check if type is exactly &PYXXH32Type and fall back to the standard call path if it is not.
| PYXXH64_vectorcall(PyObject *type, PyObject *const *args, | ||
| size_t nargsf, PyObject *kwnames) | ||
| { |
| PYXXH3_64_vectorcall(PyObject *type, PyObject *const *args, | ||
| size_t nargsf, PyObject *kwnames) | ||
| { |
| PYXXH3_128_vectorcall(PyObject *type, PyObject *const *args, | ||
| size_t nargsf, PyObject *kwnames) | ||
| { |
When _parse_fastcall_args succeeds but tp_alloc subsequently fails, the acquired buffer was leaked. Now release buf/buf_owner before returning NULL in all four vectorcall functions.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements vectorcall support for the xxh32, xxh64, xxh3_64, and xxh3_128 hash types to optimize performance. It refactors _parse_fastcall_args to support optional input arguments and updates the benchmark suite to focus on micro-benchmarks and constructor performance. The reviewer suggested using existing do_update helper functions within the new vectorcall implementations to reduce code duplication.
| if (buf.buf) { | ||
| Py_BEGIN_ALLOW_THREADS | ||
| XXH32_update(self->xxhash_state, buf.buf, buf.len); | ||
| Py_END_ALLOW_THREADS | ||
| PyBuffer_Release(&buf); | ||
| } |
There was a problem hiding this comment.
| if (buf.buf) { | ||
| Py_BEGIN_ALLOW_THREADS | ||
| XXH64_update(self->xxhash_state, buf.buf, buf.len); | ||
| Py_END_ALLOW_THREADS | ||
| PyBuffer_Release(&buf); | ||
| } |
| if (buf.buf) { | ||
| Py_BEGIN_ALLOW_THREADS | ||
| XXH3_64bits_update(self->xxhash_state, buf.buf, buf.len); | ||
| Py_END_ALLOW_THREADS | ||
| PyBuffer_Release(&buf); | ||
| } |
| if (buf.buf) { | ||
| Py_BEGIN_ALLOW_THREADS | ||
| XXH3_128bits_update(self->xxhash_state, buf.buf, buf.len); | ||
| Py_END_ALLOW_THREADS | ||
| PyBuffer_Release(&buf); | ||
| } |
Replace duplicated Py_BEGIN_ALLOW_THREADS/update/Py_END_ALLOW_THREADS/ PyBuffer_Release blocks with calls to existing PYXXH*_do_update helpers. Reduces code duplication across all four vectorcall implementations.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements the Python vectorcall protocol for xxhash hash objects (xxh32, xxh64, xxh3_64, and xxh3_128) to optimize performance for small inputs. It also refactors the benchmark suite to include micro-benchmarks for tiny inputs, string encoding paths, and constructors. Feedback was provided regarding the buffer release logic in the new vectorcall implementations, suggesting that checking buf.obj instead of buf.buf is a safer way to ensure PyBuffer_Release is called correctly, especially for empty inputs.
| if (buf.buf) | ||
| PYXXH32_do_update(self, &buf); |
There was a problem hiding this comment.
The check if (buf.buf) to decide whether to call PYXXH32_do_update (which releases the buffer) might be insufficient. If _parse_fastcall_args successfully acquired a buffer (e.g., for an empty input where buf.len == 0), buf.buf might be non-NULL but buf.len would be 0. More importantly, if PyObject_GetBuffer succeeded, buf.obj will be non-NULL. To avoid potential memory leaks or inconsistent state, it is safer to check if (buf.obj) or simply always call PYXXH32_do_update if the input argument was found, as XXH32_update handles zero-length inputs correctly and PyBuffer_Release must be called if a buffer was acquired.
if (buf.obj)
PYXXH32_do_update(self, &buf);buf.obj is set by PyObject_GetBuffer when a buffer is successfully acquired, making it a more semantically clear check than buf.buf. Initialize buf->obj = NULL in _parse_fastcall_args to match.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements vectorcall support for xxhash hash objects in the C extension, optimizing object creation and initial hashing. It also refactors the benchmark suite to include micro-benchmarks for call overhead and string inputs. Feedback suggests adding a type check within the vectorcall implementations to ensure they only execute for the intended types, falling back to the standard calling mechanism otherwise to safely support potential future subclassing.
| if (_parse_fastcall_args(args, nargs, kwnames, "xxhash.xxh32()", 0, | ||
| &buf, &buf_owner, &raw_seed) < 0) | ||
| return NULL; | ||
| seed = (XXH32_hash_t)raw_seed; |
There was a problem hiding this comment.
The PYXXH32_vectorcall implementation (and others like it) bypasses the standard tp_new and tp_init mechanism. While xxhash.xxh32 is not currently subclassable (it lacks Py_TPFLAGS_BASETYPE), it is best practice to include a check to ensure the type argument is exactly &PYXXH32Type. If it is not, the function should fall back to the default calling mechanism to avoid breaking potential future subclassing or unexpected behavior if the flag is ever added.
if (type != (PyObject *)&PYXXH32Type) {
return PyType_Type.tp_call(type, PyTuple_New(0), NULL); // Simplified fallback
}
if (_parse_fastcall_args(args, nargs, kwnames, "xxhash.xxh32()", 0,
&buf, &buf_owner, &raw_seed) < 0)
return NULL;Covers one-shot intdigest/hexdigest and streaming constructors across all four algorithms at 2MB where hashing dominates.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements tp_vectorcall for the xxhash types (XXH32, XXH64, XXH3_64, and XXH3_128) to optimize object creation and performance. It also refactors the internal _parse_fastcall_args function to handle optional input arguments and updates the benchmark suite to measure micro-performance improvements. I have no feedback to provide.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements vectorcall support for xxhash hash objects, aiming to reduce overhead during object creation. It also refactors internal argument parsing to support optional inputs and ensures safer buffer handling by initializing buf->obj. The benchmark suite has been updated to better measure call overhead and string processing. Feedback was provided regarding the use of tp_alloc in the vectorcall constructors, suggesting a potential simplification using PyObject_New if the performance gains from direct allocation are not significant.
| PYXXH32Object *self = (PYXXH32Object *) | ||
| ((PyTypeObject *)type)->tp_alloc((PyTypeObject *)type, 0); |
There was a problem hiding this comment.
The use of tp_alloc directly in the vectorcall constructor is a valid optimization, but ensure that the type does not have a custom tp_new that performs essential setup not covered here. In this case, PYXXH32_new only performs state creation and reset, which are duplicated in this vectorcall implementation, so it is safe. However, for consistency with the rest of the codebase which uses PyObject_New, you might consider if tp_alloc is strictly necessary for the performance gain or if PyObject_New (which eventually calls tp_alloc) would be cleaner.
Summary
METH_FASTCALL for 12 module-level one-shot functions
tp_vectorcall for 4 type constructors
Other optimizations
PyLong_FromUnsigned*replacesPy_BuildValue(avoids format string parse)Py_ALWAYS_INLINEon internal arg-parsing helpersFeatures
_parse_fastcall_argshelper with consistent error reportingTests
Bug fixes
_newand_copywhencreateStatefailsPyErr_NoMemory()on allocation failuresconstqualifier mismatch onPyModuleDef.slotsPYXXH3_128_digestdocstring (8-byte → 16-byte)PyLong_AsUnsignedLongLongMasktp_allocfailure in vectorcall functions