Skip to content

Hash frozensets deterministically#8284

Open
sarathfrancis90 wants to merge 1 commit into
huggingface:mainfrom
sarathfrancis90:deterministic-frozenset-hash
Open

Hash frozensets deterministically#8284
sarathfrancis90 wants to merge 1 commit into
huggingface:mainfrom
sarathfrancis90:deterministic-frozenset-hash

Conversation

@sarathfrancis90

Copy link
Copy Markdown

While looking at how transforms get fingerprinted I noticed frozenset isn't hashed deterministically, unlike set.

set has a custom reducer (_save_set) that pickles its elements sorted, so the fingerprint is stable across runs. frozenset has no such reducer and falls back to the default one, which keeps the iteration order. Because Python randomizes the string hash seed per process, a transform that captures a frozenset of strings gets a different fingerprint every session, so the cache never hits.

Repro:

from datasets.fingerprint import Hasher
# prints a different hash on every run (PYTHONHASHSEED is random by default)
print(Hasher.hash(frozenset("abcdefghij")))

I registered the same sorted reducer for frozenset that set already uses (falling back to sorting by Hasher.hash for unorderable elements). Added a test that hashes a frozenset in two subprocesses with different PYTHONHASHSEED and checks the hashes match, mirroring the existing set coverage.

Sets are pickled with sorted elements so their fingerprint is stable across
Python sessions, but frozensets fell back to the default reducer that keeps
the iteration order. Since the string hash seed is randomized per process, a
transform capturing a frozenset got a different fingerprint every run and the
cache never hit. Register the same sorted reducer for frozenset.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant