fix(embedder): stabilize offline embedding cache key, prevent cross-model collision (#321)#337
Merged
Merged
Conversation
…s-model collision (#321) Part 1 (collision fix): `get_hash()` in the non-local-path branch now hashes `model_name` before the commit SHA. Previously, hashing only the SHA meant two different models that both fell back to the same revision string (e.g. "main" when offline) produced identical cache keys, silently serving one model's embeddings to another. Part 2 (offline SHA resolution): `_get_latest_commit_hash()` now detects `huggingface_hub.constants.HF_HUB_OFFLINE` before attempting a network call. When offline it reads the commit SHA from the local HF cache ref file at `$HF_HUB_CACHE/<repo_folder>/refs/<rev>` (written by huggingface_hub on every successful download; equals the remote SHA). If the file is present the cached SHA is returned with no network access and no warning. If absent it falls back to the revision string at DEBUG level (not WARNING), since offline-with-no-cache is expected. The online path (model_info try/except) is unchanged. APIs used: `huggingface_hub.constants.{HF_HUB_OFFLINE,HF_HUB_CACHE}` and `huggingface_hub.file_download.repo_folder_name(repo_id, repo_type="model")` — both confirmed present in the installed huggingface_hub. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pin HF_HUB_CACHE to a tmp dir so the cross-model collision test never reads the developer's real cache, matching the sibling offline tests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #321 (description + comment). Two complementary fixes to the embedder cache key in
sentence_transformers.py:get_hash()hashed only the resolved commit hash +max_length, nevermodel_name. Offline, two different models both degrade to the revision string"main"→ identical key → one served the other's embeddings (wrong data). Nowmodel_nameis hashed in the remote-model branch (not the trained/local-path branch, which uses an ephemeral tempdir path), so distinct models can never collide even when the SHA can't be resolved.HF_HUB_OFFLINE,_get_latest_commit_hashnow resolves the SHA from the local HF cache ref file ($HF_HUB_CACHE/<repo_folder>/refs/<rev>) instead of callingmodel_info. This removes the spurious per-cold-model warning and keeps the key stable/auto-invalidating offline, identical to the online path. The online path is unchanged; the missing-ref fallback is demoted to DEBUG.Note: common models pinned in
DEFAULT_REVISIONSalready short-circuit on a SHA; these bugs bite non-pinned models or explicitly-set non-SHA revisions.Test plan
tests/embedder/test_hash.py: cross-model no-collision offline; offline local-ref resolution (assertsmodel_infonot called); missing-ref fallback. All network-free, no model loads, withlru_cachecleared andHF_HUB_CACHEisolated to tmp.huggingface_hub0.36.2; ruff + mypy clean.🤖 Generated with Claude Code