Skip to content

fix(embedder): stabilize offline embedding cache key, prevent cross-model collision (#321)#337

Merged
voorhs merged 2 commits into
devfrom
fix/issue-321-offline-embedder-cache
Jun 22, 2026
Merged

fix(embedder): stabilize offline embedding cache key, prevent cross-model collision (#321)#337
voorhs merged 2 commits into
devfrom
fix/issue-321-offline-embedder-cache

Conversation

@voorhs

@voorhs voorhs commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Summary

Fixes #321 (description + comment). Two complementary fixes to the embedder cache key in sentence_transformers.py:

  1. Cross-model collision (the dangerous one, from the issue comment): get_hash() hashed only the resolved commit hash + max_length, never model_name. Offline, two different models both degrade to the revision string "main" → identical key → one served the other's embeddings (wrong data). Now model_name is hashed in the remote-model branch (not the trained/local-path branch, which uses an ephemeral tempdir path), so distinct models can never collide even when the SHA can't be resolved.
  2. Offline network call + warning + staleness: under HF_HUB_OFFLINE, _get_latest_commit_hash now resolves the SHA from the local HF cache ref file ($HF_HUB_CACHE/<repo_folder>/refs/<rev>) instead of calling model_info. This removes the spurious per-cold-model warning and keeps the key stable/auto-invalidating offline, identical to the online path. The online path is unchanged; the missing-ref fallback is demoted to DEBUG.

Note: common models pinned in DEFAULT_REVISIONS already short-circuit on a SHA; these bugs bite non-pinned models or explicitly-set non-SHA revisions.

Test plan

  • New tests in tests/embedder/test_hash.py: cross-model no-collision offline; offline local-ref resolution (asserts model_info not called); missing-ref fallback. All network-free, no model loads, with lru_cache cleared and HF_HUB_CACHE isolated to tmp.
  • Verified fail→pass via standalone snippets; HF APIs confirmed against huggingface_hub 0.36.2; ruff + mypy clean.
  • CI runs the full suite.

🤖 Generated with Claude Code

voorhs and others added 2 commits June 22, 2026 12:12
…s-model collision (#321)

Part 1 (collision fix): `get_hash()` in the non-local-path branch now hashes
`model_name` before the commit SHA. Previously, hashing only the SHA meant two
different models that both fell back to the same revision string (e.g. "main"
when offline) produced identical cache keys, silently serving one model's
embeddings to another.

Part 2 (offline SHA resolution): `_get_latest_commit_hash()` now detects
`huggingface_hub.constants.HF_HUB_OFFLINE` before attempting a network call.
When offline it reads the commit SHA from the local HF cache ref file at
`$HF_HUB_CACHE/<repo_folder>/refs/<rev>` (written by huggingface_hub on every
successful download; equals the remote SHA). If the file is present the cached
SHA is returned with no network access and no warning. If absent it falls back
to the revision string at DEBUG level (not WARNING), since offline-with-no-cache
is expected. The online path (model_info try/except) is unchanged.

APIs used: `huggingface_hub.constants.{HF_HUB_OFFLINE,HF_HUB_CACHE}` and
`huggingface_hub.file_download.repo_folder_name(repo_id, repo_type="model")`
— both confirmed present in the installed huggingface_hub.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pin HF_HUB_CACHE to a tmp dir so the cross-model collision test never
reads the developer's real cache, matching the sibling offline tests.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@voorhs voorhs merged commit b2f57be into dev Jun 22, 2026
19 checks passed
@voorhs voorhs deleted the fix/issue-321-offline-embedder-cache branch June 22, 2026 09:39
@voorhs voorhs mentioned this pull request Jun 22, 2026
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Embedder cache-key resolution still hits the HF API under HF_HUB_OFFLINE (warning + cache-key staleness)

1 participant