Skip to content

[Storage] Add Hugging Face Buckets and repos as a storage backend#9698

Open
XciD wants to merge 10 commits into
skypilot-org:masterfrom
XciD:adrien/hf-buckets-storage-pr
Open

[Storage] Add Hugging Face Buckets and repos as a storage backend#9698
XciD wants to merge 10 commits into
skypilot-org:masterfrom
XciD:adrien/hf-buckets-storage-pr

Conversation

@XciD
Copy link
Copy Markdown

@XciD XciD commented May 22, 2026

This is a continuation of #9418 by @nikhiljha, opened directly against master to ease the review process. The original commits are preserved in history.

Summary

Adds a new hf storage type that wraps Hugging Face Buckets (Xet-backed S3-like object storage, read-write) and, via the same handle, also supports mounting HF model/dataset/space repos read-only.

All bucket lifecycle operations (create/upload/download/delete) go through huggingface_hub (>=1.10). MOUNT / MOUNT_CACHED modes install and invoke hf-mount with its FUSE backend, which aligns with SkyPilot's existing FUSE-based mount infrastructure (gcsfuse / blobfuse2 / rclone mount / goofys).

URL formats

hf://buckets/<ns>/<bucket>[/<sub>]        (bucket, rw)
hf://<ns>/<model>[@<rev>][/<sub>]         (repo, ro)
hf://datasets/<ns>/<ds>[@<rev>][/<sub>]   (repo, ro)
hf://spaces/<ns>/<sp>[@<rev>][/<sub>]     (repo, ro)

Other notes

  • Auth: reads HF_TOKEN from env or the HF CLI token file and propagates it to clusters via the existing credential-file-mounts pipeline.
  • Fail-loud uploads: uploading to a repo with a local source raises StorageUploadError instead of silently no-op'ing.
  • glibc constraint: hf-mount v0.6.5 Linux binaries are linked against glibc >= 2.34, so MOUNT-mode tasks on the default SkyPilot Kubernetes image (Ubuntu 20.04, glibc 2.31) need an explicit newer base image via resources.image_id (e.g. docker:mirror.gcr.io/ubuntu:22.04). COPY mode works on any image. See docs/source/reference/storage.rst.

hf-mount bump (v0.3.1 → v0.6.5)

The originally referenced hf-mount v0.3.1 could not mount in unprivileged Kubernetes pods (the default SkyPilot k8s execution context). Two bugs blocked it, both now fixed upstream:

  1. fuser hard-failed before reaching the fusermount fallback: fuser's direct /dev/fuse open path didn't fall back to fusermount when the device wasn't openable, so the mount failed before the setuid fusermount-shim could ever be invoked. Fixed in hf-mount#173 (picking up fuser#1). Released as v0.6.3.

  2. clone_fd opened /dev/fuse N-1 more times after the initial mount: after the initial mount succeeded via the proxy, fuser's session loop opened /dev/fuse an additional n_threads - 1 times to clone per-thread fds. Those direct opens hit EPERM in the unprivileged pod and the daemon crashed with FUSE session error: Operation not permitted (os error 1) right after the FUSE_INIT handshake. Fixed in hf-mount#174: probe /dev/fuse openability once at startup and disable clone_fd if denied. Released as v0.6.4. v0.6.5 is a maintenance release on top.

Also corrects the inline comment's glibc minimum: the binaries actually require glibc 2.34 (verified with strings hf-mount-fuse-x86_64-linux | grep GLIBC_ against the published release), not 2.32 as previously documented.

nikhiljha and others added 3 commits April 21, 2026 10:37
Adds a new hf storage type that wraps Hugging Face Buckets
(Xet-backed S3-like object storage, read-write) and, via the same
handle, also supports mounting HF model/dataset/space repos read-only.

All bucket lifecycle operations (create/upload/download/delete) go
through huggingface_hub (>=1.10). MOUNT/MOUNT_CACHED modes install
and invoke hf-mount (https://github.com/huggingface/hf-mount) with
its FUSE backend, which aligns with SkyPilot's existing FUSE-based
mount infrastructure (gcsfuse/blobfuse2/rclone mount/goofys).

Note: the published v0.3.1 hf-mount Linux binaries are linked against
glibc >= 2.32, so MOUNT-mode tasks on the default SkyPilot k8s image
(Ubuntu 20.04, glibc 2.31) need an explicit newer base image via
resources.image_id (e.g. docker:mirror.gcr.io/ubuntu:22.04). COPY
mode works on any image. See docs/source/reference/storage.rst.

- URL formats:
    hf://buckets/<ns>/<bucket>[/<sub>]           (bucket, rw)
    hf://<ns>/<model>[@<rev>][/<sub>]            (repo, ro)
    hf://datasets/<ns>/<ds>[@<rev>][/<sub>]      (repo, ro)
    hf://spaces/<ns>/<sp>[@<rev>][/<sub>]        (repo, ro)

- Auth: reads HF_TOKEN from env or the HF CLI token file and propagates
  it to clusters via the existing credential-file-mounts pipeline.

- Uploading to a repo with a local source raises StorageUploadError
  instead of silently no-op'ing, so misconfigurations fail loud.

Tested:
- Unit tests covering URL parsing, name validation, classify/upload
  guards, public Storage construction paths, COPY-mode translation,
  credential propagation, cloud-store download command generation, and
  the hf-mount command generator (including --fuse and fuse3 bootstrap
  assertions).
- Focused HF test pass:
    uv run --with-editable .[huggingface] --with pytest --with pytest-xdist
      --with pytest-env --with buildkite-test-collector --with boto3
      pytest tests/unit_tests/test_sky/storage/test_huggingface.py -q -n 0
- git diff --check, yapf --diff, isort --diff.
- End-to-end on a Kubernetes cluster:
    * create bucket -> upload dir+files (COPY mode via HfApi.sync_bucket)
    * pod starts, HF token file propagated to pod at ~/.cache/huggingface
    * pod lists bucket via huggingface_hub, downloads hello.txt
    * pod writes pod-wrote.txt back via batch_bucket_files
    * re-list shows pod-written file, re-download confirms content
    * clean teardown (sky down + HfApi.delete_bucket)
    * MOUNT mode via hf-mount --fuse on Ubuntu 22.04
- Broaden Python support: relax huggingface_hub>=1.10 (Python 3.10+) to
  >=1.5 (>=1.5,<1.9 on Python 3.9). 1.5 is the first release with the
  full Buckets API and still supports Python 3.9; 1.9+ drops 3.9.
  This fixes the CI unit-test job (runs on 3.9) that was failing with
  ModuleNotFoundError.
- cloud_stores.HFCloudStorage.is_directory: use list_repo_tree instead
  of repo_info().siblings for repo-path checks. repo_info fetches
  metadata for every file in the repo; list_repo_tree(path_in_repo=p)
  only fetches the specified path (and raises EntryNotFoundError for
  file paths, which is how we now distinguish file vs directory).
- HuggingFaceStore.mount_cached_command: propagate MountCachedConfig
  read_only to hf-mount via --read-only instead of silently dropping
  the config. Other (rclone-specific) fields still have no hf-mount
  equivalent; documented in the docstring.
- Doc nits from review: shorten HF mention in intro, drop the full
  hf:// URL enumeration from the example (users can refer to the HF
  CLI docs), single hf:// example in the Storage YAML reference.
- Add unit tests covering the list_repo_tree optimization and
  mount_cached_command propagation.

Tested:
- Full HF test suite passes on Python 3.9 (278) and 3.11 (310).
- bash format.sh clean (black, yapf, isort, mypy, pylint 10.00/10).
v0.3.1 cannot mount in unprivileged k8s pods (the default SkyPilot
Kubernetes execution context). Two bugs blocked it:

1. fuser's direct /dev/fuse open path didn't fall back to fusermount
   when the device wasn't openable, so the mount failed before the
   setuid fusermount-shim could ever be invoked. Fixed in hf-mount
   v0.6.3 (huggingface/hf-mount#173, picking up huggingface/fuser#1).

2. After the initial mount succeeded via fusermount-shim, fuser's
   session loop opened /dev/fuse N-1 more times to clone per-thread
   fds. Those direct opens hit EPERM in the unprivileged pod and
   the daemon crashed with 'FUSE session error: Operation not
   permitted (os error 1)' immediately after the FUSE_INIT
   handshake, tearing the mount down (visible to the pod as
   'mountpoint /data' returning 'not a mountpoint'). Fixed in
   hf-mount v0.6.4 (huggingface/hf-mount#174): probe /dev/fuse
   openability once at startup and disable clone_fd if denied.

v0.6.5 is a maintenance release on top of v0.6.4.

Verified end-to-end on a private sandbox Kubernetes cluster with the
fusermount-shim/fusermount-server setup from this PR: mount, list,
and read of a dataset repo all succeed.

Also corrects the comment's glibc minimum (the actual binaries
require 2.34, not 2.32; verified with 'strings ... | grep GLIBC_'
against the published release).
@XciD XciD marked this pull request as ready for review May 22, 2026 17:49
gemini-code-assist[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

@Michaelvll Michaelvll requested a review from zpoint May 22, 2026 18:10
Devin findings:
- Fix attribute name bug in HuggingFaceStore._classify/_validate: the sub-path
  was being assigned to `self.bucket_sub_path` (no leading underscore), but
  every consumer reads `self._bucket_sub_path`. The assignment silently
  created a dead attribute. Masked in the common flow (Storage.add_store
  injects _bucket_sub_path at the parent level), but corrupted state if
  HuggingFaceStore is constructed directly with a sub-path URL.
- Update docs/source/reference/storage.rst FAQ to reference hf-mount v0.6.5
  and glibc >= 2.34 (was still pointing at v0.3.1 / glibc 2.32 after the
  version bump).

Gemini findings:
- Harden HF token cache directory permissions: os.makedirs(mode=0o700)
  ignores the mode flag for pre-existing dirs, and the prior code swallowed
  any chmod failure. Now we raise StorageSpecError on chmod failure and
  verify the final mode before writing the token; we refuse to write if the
  dir is group- or world-accessible.
- Remove `eval echo` from the hf-mount token-file path expansion. Expand a
  leading `~/` to `$HOME/` in Python instead so the generated shell command
  never invokes eval.
- Improve StorageSpecError wording when an HF repo source disagrees with the
  storage name ('Hugging Face' instead of the cryptic 'HF').
- Parallelize multi-source bucket uploads with a small ThreadPoolExecutor
  (max 4 workers). Each sync_bucket call is already internally parallel, so
  the pool just overlaps I/O between distinct source dirs. Also classifies
  all sources up front so missing paths fail loud before any upload starts.
devin-ai-integration[bot]

This comment was marked as resolved.

Devin finding: HuggingFaceStore._get_bucket() wrapped both bucket_info()
and _validate_existing_bucket() in a single try/except Exception block,
so a StorageSpecError raised by _validate_existing_bucket (e.g. when a
user tries to mount an externally-created bucket without ``source:``)
was caught by the broad handler and re-raised as a misleading
StorageBucketGetError. Move the validation out of the try block so the
spec error surfaces its actionable message.
devin-ai-integration[bot]

This comment was marked as resolved.

Devin finding: ``snapshot_download`` preserves repo-relative paths under
``local_dir``, so passing ``allow_patterns=['<sub_path>/*']`` places
files at ``<destination>/<sub_path>/<files>`` rather than
``<destination>/<files>``. For a source like
``hf://openai-community/gpt2/onnx`` mounted at ``/data/onnx``, files
ended up at ``/data/onnx/onnx/model.onnx`` instead of
``/data/onnx/model.onnx``.

Both COPY-mode paths are affected:

- ``HFCloudStorage.make_sync_dir_command`` (remote heredoc).
- ``HuggingFaceStore.download_remote_dir`` (in-process).

Fix: when a sub-path is present, stage ``snapshot_download`` into a
temporary directory and move only the sub-path contents up to the
real destination. The no-sub-path path is unchanged.

Added an assertion to ``test_repo_dir_download_command`` plus a new
``test_repo_dir_download_command_no_sub_path`` to lock in the
behavior.
devin-ai-integration[bot]

This comment was marked as resolved.

…path_prefix

Devin finding: when ``storage.source`` is an HF URL with a sub-path, the
sub-path was relied upon entirely from the source URL, while every other
store type also runs the constructed blob path through
``get_bucket_sub_path_prefix``. This left a gap when ``_bucket_sub_path``
is set independently of the source URL (e.g. constructed via the
internal API or restored from a handle that no longer carries the sub-
path in ``source``).

Naively always calling ``get_bucket_sub_path_prefix`` would double the
sub-path in the common case (``hf://buckets/ns/bucket/sub`` →
``hf://buckets/ns/bucket/sub/sub``). Instead, parse the source URL,
drop any embedded sub-path to recover a canonical base URL, then run
that through ``get_bucket_sub_path_prefix`` — symmetric with the other
``StoreType`` branches.
devin-ai-integration[bot]

This comment was marked as resolved.

…porting

Devin / AGENTS.md: ``huggingface_hub`` is already exposed at module level
via ``common.LazyImport``, so the in-function ``from huggingface_hub
import HfApi`` was redundant. Drop the local import and call
``huggingface_hub.HfApi()`` directly — the LazyImport defers loading
until first attribute access, so ``sky`` import cost is unchanged.
Copy link
Copy Markdown
Collaborator

@zpoint zpoint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a copy from #9418. Could you address these comments?

…~ expansion)

Addresses the review feedback from PR skypilot-org#9418 (referenced on skypilot-org#9698):

- installation.rst: add a Hugging Face storage backend section with the
  _huggingface-installation anchor (pip extras, hf auth login / HF_TOKEN,
  sky check huggingface).
- check.py: include HuggingFace in the storage-only clouds comment.
- adaptors/huggingface.py: add a "For more info:" docs link to the
  token-not-set hint, like the other storage-only clouds. On whoami()
  failure, stop unconditionally blaming the token: surface the underlying
  error on a separate "Details:" line and make the refresh advice
  conditional (a missing httpx[socks], network errors, etc. are not token
  problems).
- cloud_stores.py: wrap every COPY-mode download destination with
  os.path.expanduser. We invoke huggingface_hub directly (no shell), so a
  wrapped "~/.sky/file_mounts/..." destination was treated as a literal
  "~" directory and files never reached the mount point.
- Add unit tests for the new hint behavior and the ~ expansion.
@XciD
Copy link
Copy Markdown
Author

XciD commented May 29, 2026

This seems to be a copy from #9418. Could you address these comments?

Addressed here: 1b30839

The doc-build job treats Sphinx warnings as errors. The new "Hugging Face"
section heading tripped the sentence-case linter ("Unexpected capitals:
Face"). Add it to MULTI_WORD_TERMS, like "Prime Intellect".
@Michaelvll Michaelvll requested a review from zpoint May 30, 2026 00:56
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 19 additional findings in Devin Review.

Open in Devin Review

Comment thread sky/cloud_stores.py
Comment on lines +902 to +915
code = '\n'.join([
'import os',
'import shutil',
'import tempfile',
'from huggingface_hub import hf_hub_download',
self._TOKEN_HELPER,
f'dest = os.path.expanduser({destination!r})',
'tmp_dir = tempfile.mkdtemp()',
(f'downloaded = hf_hub_download(repo_id={repo_id!r}, '
f'repo_type={repo_type!r}, revision={revision!r}, '
f'filename={path!r}, local_dir=tmp_dir, token=token)'),
'os.makedirs(os.path.dirname(dest) or ".", exist_ok=True)',
'shutil.copy2(downloaded, dest)',
])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Temp directory leaked in make_sync_file_command for HF repo file downloads

In HFCloudStorage.make_sync_file_command for repo (non-bucket) sources, a temporary directory is created via tempfile.mkdtemp() but is never cleaned up after the file is copied. The generated Python code downloads the file into tmp_dir, copies it to dest with shutil.copy2, and then exits — leaving tmp_dir and its contents on disk.

The comparable make_sync_dir_command at sky/cloud_stores.py:859-872 correctly wraps the download+move in a try/finally block that calls shutil.rmtree(tmp_dir, ignore_errors=True). This file-sync variant is missing the same cleanup, so every repo file download accumulates a leaked temp directory on the remote VM.

Suggested change
code = '\n'.join([
'import os',
'import shutil',
'import tempfile',
'from huggingface_hub import hf_hub_download',
self._TOKEN_HELPER,
f'dest = os.path.expanduser({destination!r})',
'tmp_dir = tempfile.mkdtemp()',
(f'downloaded = hf_hub_download(repo_id={repo_id!r}, '
f'repo_type={repo_type!r}, revision={revision!r}, '
f'filename={path!r}, local_dir=tmp_dir, token=token)'),
'os.makedirs(os.path.dirname(dest) or ".", exist_ok=True)',
'shutil.copy2(downloaded, dest)',
])
code = '\n'.join([
'import os',
'import shutil',
'import tempfile',
'from huggingface_hub import hf_hub_download',
self._TOKEN_HELPER,
f'dest = os.path.expanduser({destination!r})',
'tmp_dir = tempfile.mkdtemp()',
'try:',
(f' downloaded = hf_hub_download(repo_id={repo_id!r}, '
f'repo_type={repo_type!r}, revision={revision!r}, '
f'filename={path!r}, local_dir=tmp_dir, token=token)'),
' os.makedirs(os.path.dirname(dest) or ".", exist_ok=True)',
' shutil.copy2(downloaded, dest)',
'finally:',
' shutil.rmtree(tmp_dir, ignore_errors=True)',
])
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants