Skip to content

feat: hashlib compatibility#148

Merged
ifduyue merged 1 commit intomasterfrom
hashlib-compliant
Apr 27, 2026
Merged

feat: hashlib compatibility#148
ifduyue merged 1 commit intomasterfrom
hashlib-compliant

Conversation

@ifduyue
Copy link
Copy Markdown
Owner

@ifduyue ifduyue commented Apr 27, 2026

Summary

Makes xxhash hashlib-compatible:

  • algorithms_available and algorithms_guaranteed module attributes
  • Str rejection: TypeError: Strings must be encoded before hashing (matches hashlib exactly)
  • data= keyword argument renamed from input=
  • Full buffer type support: bytes, bytearray, memoryview, array, mmap, PickleBuffer, ctypes
  • 13 hashlib compatibility tests (digest_size, block_size, name, update, copy, etc.)

Changes

  • _get_buffer_or_str: simplified to reject str instead of auto-encoding
  • Error messages: input -> data throughout
  • Tests: updated all str usages to bytes, added buffer type tests, hashlib compat suite
  • Version: 3.8.0.dev5

@codspeed-hq
Copy link
Copy Markdown
Contributor

codspeed-hq Bot commented Apr 27, 2026

Merging this PR will not alter performance

✅ 30 untouched benchmarks
⏩ 90 skipped benchmarks1


Comparing hashlib-compliant (90b376d) with master (b24c873)

Open in CodSpeed

Footnotes

  1. 90 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to improve hashlib compatibility by renaming the 'input' keyword argument to 'data' and modifying the C implementation to reject string objects in favor of encoded bytes. It includes updates to docstrings, a version bump, and a new test suite for hashlib compliance. The review feedback correctly identifies that the update() method and tp_init still use the 's*' format unit in PyArg_ParseTuple, which implicitly encodes strings and contradicts the PR's goal. Additionally, some test cases in the xxh32 and related test files still use string literals for update() calls, which should be updated to bytes to ensure the new rejection logic is properly tested.

Comment thread src/_xxhash.c
Comment thread tests/test_xxh32.py
@ifduyue ifduyue force-pushed the hashlib-compliant branch from 8d98528 to db87d55 Compare April 27, 2026 07:47
@ifduyue
Copy link
Copy Markdown
Owner Author

ifduyue commented Apr 27, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves hashlib compatibility by rejecting string inputs and renaming the input keyword to data. Changes include C extension updates, docstring revisions, and a new compatibility test suite. Feedback suggests correcting docstrings that still mention "string data" and using the _get_buffer_or_str helper in update methods for consistent error reporting. Additionally, expanding compatibility tests to include update() and the xxh128 alias is recommended.

Comment thread src/_xxhash.c Outdated
Comment thread src/_xxhash.c Outdated
Comment thread src/_xxhash.c Outdated
Comment thread src/_xxhash.c Outdated
Comment on lines 932 to 937
Py_buffer buf;

buf.buf = buf.obj = NULL;

if (!PyArg_ParseTuple(args, "s*:update", &buf)) {
if (PyObject_GetBuffer(arg, &buf, PyBUF_SIMPLE) < 0)
return NULL;
}

PYXXH64_do_update(self, &buf);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The update method should use the _get_buffer_or_str helper to ensure the error message for string inputs matches hashlib exactly.

    Py_buffer buf;
    PyObject *owner = NULL;

    if (_get_buffer_or_str(arg, &buf, &owner) < 0)
        return NULL;

    PYXXH64_do_update(self, &buf);
    Py_XDECREF(owner);

Comment thread src/_xxhash.c Outdated
Comment thread src/_xxhash.c Outdated
Comment thread src/_xxhash.c Outdated
Comment thread src/_xxhash.c Outdated
Comment thread tests/test_hashlib_compat.py
@ifduyue
Copy link
Copy Markdown
Owner Author

ifduyue commented Apr 27, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aligns the library with Python's hashlib by removing automatic UTF-8 encoding for strings, renaming the 'input' keyword argument to 'data', and adding standard attributes like algorithms_guaranteed, digest_size, and block_size. The test suite has been comprehensively updated to use byte literals and includes new compatibility checks. Review feedback focuses on maintaining consistency in error messages across different calling paths, updating docstrings to reflect the rejection of string inputs, and simplifying the internal C API by removing now-redundant parameters.

Comment thread src/_xxhash.c Outdated
Comment thread src/_xxhash.c
Comment thread src/_xxhash.c Outdated
Comment thread src/_xxhash.c Outdated
Comment thread src/_xxhash.c Outdated
@ifduyue
Copy link
Copy Markdown
Owner Author

ifduyue commented Apr 27, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances hashlib compatibility by renaming the input parameter to data, explicitly rejecting str inputs with a specific TypeError, and adding standard attributes like algorithms_guaranteed, digest_size, and block_size. Feedback highlights that the current str rejection logic in constructors only applies to positional arguments, leading to inconsistent error messages when data is passed as a keyword. It is recommended to refactor the initialization logic for all hash types to use the _get_buffer_or_str helper for consistent validation and to add test cases covering keyword-based str rejection.

Comment thread src/_xxhash.c Outdated
Comment thread src/_xxhash.c Outdated
Comment thread src/_xxhash.c Outdated
Comment thread src/_xxhash.c Outdated
Comment thread tests/test_hashlib_compat.py
@ifduyue ifduyue force-pushed the hashlib-compliant branch from 9468e01 to fba326c Compare April 27, 2026 11:11
@ifduyue
Copy link
Copy Markdown
Owner Author

ifduyue commented Apr 27, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request renames the input parameter to data and enforces bytes-like input by rejecting strings with a TypeError to align with hashlib standards. It also adds compatibility attributes such as algorithms_guaranteed, digest_size, and block_size, and updates the test suite accordingly. Review feedback suggests removing explicit None checks in constructors to ensure a TypeError is correctly raised for None inputs and identifies a version string mismatch in version.py.

Comment thread src/_xxhash.c Outdated
Comment thread src/_xxhash.c Outdated
Comment thread src/_xxhash.c Outdated
Comment thread src/_xxhash.c Outdated
Comment thread xxhash/version.py Outdated
@ifduyue
Copy link
Copy Markdown
Owner Author

ifduyue commented Apr 27, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the xxhash library to improve compatibility with Python's hashlib. Key changes include the removal of automatic UTF-8 encoding for string inputs, which now raise a TypeError, and the renaming of the input keyword argument to data across all hashing functions and constructors. The update method has been optimized to use METH_O, and new tests have been added to verify hashlib compatibility and support for various buffer-compliant types like mmap and PickleBuffer. Additionally, the version has been bumped to 3.8.0.dev6. I have no feedback to provide.

@ifduyue
Copy link
Copy Markdown
Owner Author

ifduyue commented Apr 27, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances hashlib compatibility by requiring bytes-like input, renaming the input parameter to data, and adding standard attributes such as algorithms_guaranteed and digest_size. While the changes include a transition to manual argument parsing in the C extension, the current implementation for __init__ and update methods is buggy; it fails to correctly process keyword-only arguments and ignores unknown keywords. It is recommended to use PyArg_ParseTupleAndKeywords or refine the manual parsing logic to ensure robust argument handling and accurate error messages.

Comment thread src/_xxhash.c
Comment thread src/_xxhash.c
Comment thread src/_xxhash.c
Comment thread src/_xxhash.c
Comment thread src/_xxhash.c
Comment thread src/_xxhash.c
Comment thread src/_xxhash.c
Comment thread src/_xxhash.c
@ifduyue
Copy link
Copy Markdown
Owner Author

ifduyue commented Apr 27, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aligns the xxhash library with Python's hashlib by renaming the input argument to data, rejecting string inputs to require explicit encoding, and adding the algorithms_guaranteed attribute. It also refactors internal argument parsing to use the fastcall convention and introduces a new test suite for hashlib compatibility. Review feedback suggests improving error message specificity in _check_kwargs for better consistency and correcting a versioning inconsistency in version.py.

Comment thread src/_xxhash.c
Comment thread xxhash/version.py
@ifduyue
Copy link
Copy Markdown
Owner Author

ifduyue commented Apr 27, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aligns the xxhash API with Python's hashlib by requiring data to be encoded before hashing and renaming the input keyword argument to data. It removes automatic string encoding, adds algorithms_guaranteed, and updates internal argument parsing to use FASTCALL. Feedback suggests refactoring the duplicated initialization and update logic across the various hash types into shared helper functions or macros to improve maintainability.

Comment thread src/_xxhash.c
Comment on lines 544 to 601
static int PYXXH32_init(PYXXH32Object *self, PyObject *args, PyObject *kwargs)
{
XXH32_hash_t seed = 0;
char *keywords[] = {"input", "seed", NULL};
Py_buffer buf;
PyObject *data_obj = NULL;
Py_buffer buf = {NULL, NULL};
Py_ssize_t nargs = PyTuple_GET_SIZE(args);

buf.buf = buf.obj = NULL;
if (_check_kwargs(kwargs) < 0)
return -1;

if (!PyArg_ParseTupleAndKeywords(args, kwargs, "|s*I:__init__", keywords, &buf, &seed)) {
if (nargs >= 1) {
data_obj = PyTuple_GET_ITEM(args, 0);
if (kwargs && PyDict_GetItemString(kwargs, "data")) {
PyErr_SetString(PyExc_TypeError,
"__init__() got multiple values for argument 'data'");
return -1;
}
}
if (nargs >= 2) {
seed = (XXH32_hash_t)PyLong_AsUnsignedLongMask(PyTuple_GET_ITEM(args, 1));
if (PyErr_Occurred()) return -1;
if (kwargs && PyDict_GetItemString(kwargs, "seed")) {
PyErr_SetString(PyExc_TypeError,
"__init__() got multiple values for argument 'seed'");
return -1;
}
}
if (nargs > 2) {
PyErr_SetString(PyExc_TypeError,
"__init__() takes at most 2 positional arguments");
return -1;
}

if (kwargs) {
PyObject *val = PyDict_GetItemString(kwargs, "data");
if (val) {
if (data_obj) return -1; /* unreachable, caught above */
data_obj = val;
}
val = PyDict_GetItemString(kwargs, "seed");
if (val) {
seed = (XXH32_hash_t)PyLong_AsUnsignedLongMask(val);
if (PyErr_Occurred()) return -1;
}
}

if (data_obj) {
if (_get_buffer_or_str(data_obj, &buf) < 0)
return -1;
}

self->seed = seed;
XXH32_reset(self->xxhash_state, seed);

if (buf.obj) {
if (buf.obj)
PYXXH32_do_update(self, &buf);
}

return 0;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is significant code duplication across the init, update, and hexdigest methods for the four hash types (XXH32, XXH64, XXH3_64, XXH3_128). The manual keyword parsing and error handling logic is almost identical. Consider refactoring this into shared helper functions or using macros to improve maintainability and reduce the risk of inconsistencies.

- algorithms_available / algorithms_guaranteed: module attributes
- Str rejection: TypeError('Strings must be encoded before hashing')
- None rejection: TypeError('object supporting the buffer API required')
- data= keyword argument across all entry points (constructor, update, one-shot)
- Full buffer type support: bytes, bytearray, memoryview, array, mmap, PickleBuffer, ctypes
- tp_vectorcall on all 4 type constructors (CPython fast path)
- METH_FASTCALL on all 12 module-level one-shot functions
- METH_FASTCALL|METH_KEYWORDS on all 4 update() methods
- Manual arg parsing in tp_init for PyPy fallback
- Reject unknown keywords, duplicate args, extra positional args globally
- _get_buffer_or_str, _parse_fastcall_args, _check_kwargs shared helpers
- Py_ALWAYS_INLINE on all performance-critical helpers
- PyLong_FromUnsigned* replaces Py_BuildValue
- Remove hexdigits lookup table (regressed)
- 120 tests (15 hashlib compat, 32 fastcall, 34 benchmark, 39 original)
- Tested on CPython 3.9-3.15 and PyPy 3.9-3.11
@ifduyue ifduyue force-pushed the hashlib-compliant branch from 290258a to 90b376d Compare April 27, 2026 13:28
@ifduyue ifduyue merged commit ff51b84 into master Apr 27, 2026
42 checks passed
@ifduyue ifduyue deleted the hashlib-compliant branch April 27, 2026 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant