[Storage] Add Hugging Face Buckets and repos as a storage backend#9698
[Storage] Add Hugging Face Buckets and repos as a storage backend#9698XciD wants to merge 10 commits into
Conversation
Adds a new hf storage type that wraps Hugging Face Buckets (Xet-backed S3-like object storage, read-write) and, via the same handle, also supports mounting HF model/dataset/space repos read-only. All bucket lifecycle operations (create/upload/download/delete) go through huggingface_hub (>=1.10). MOUNT/MOUNT_CACHED modes install and invoke hf-mount (https://github.com/huggingface/hf-mount) with its FUSE backend, which aligns with SkyPilot's existing FUSE-based mount infrastructure (gcsfuse/blobfuse2/rclone mount/goofys). Note: the published v0.3.1 hf-mount Linux binaries are linked against glibc >= 2.32, so MOUNT-mode tasks on the default SkyPilot k8s image (Ubuntu 20.04, glibc 2.31) need an explicit newer base image via resources.image_id (e.g. docker:mirror.gcr.io/ubuntu:22.04). COPY mode works on any image. See docs/source/reference/storage.rst. - URL formats: hf://buckets/<ns>/<bucket>[/<sub>] (bucket, rw) hf://<ns>/<model>[@<rev>][/<sub>] (repo, ro) hf://datasets/<ns>/<ds>[@<rev>][/<sub>] (repo, ro) hf://spaces/<ns>/<sp>[@<rev>][/<sub>] (repo, ro) - Auth: reads HF_TOKEN from env or the HF CLI token file and propagates it to clusters via the existing credential-file-mounts pipeline. - Uploading to a repo with a local source raises StorageUploadError instead of silently no-op'ing, so misconfigurations fail loud. Tested: - Unit tests covering URL parsing, name validation, classify/upload guards, public Storage construction paths, COPY-mode translation, credential propagation, cloud-store download command generation, and the hf-mount command generator (including --fuse and fuse3 bootstrap assertions). - Focused HF test pass: uv run --with-editable .[huggingface] --with pytest --with pytest-xdist --with pytest-env --with buildkite-test-collector --with boto3 pytest tests/unit_tests/test_sky/storage/test_huggingface.py -q -n 0 - git diff --check, yapf --diff, isort --diff. - End-to-end on a Kubernetes cluster: * create bucket -> upload dir+files (COPY mode via HfApi.sync_bucket) * pod starts, HF token file propagated to pod at ~/.cache/huggingface * pod lists bucket via huggingface_hub, downloads hello.txt * pod writes pod-wrote.txt back via batch_bucket_files * re-list shows pod-written file, re-download confirms content * clean teardown (sky down + HfApi.delete_bucket) * MOUNT mode via hf-mount --fuse on Ubuntu 22.04
- Broaden Python support: relax huggingface_hub>=1.10 (Python 3.10+) to >=1.5 (>=1.5,<1.9 on Python 3.9). 1.5 is the first release with the full Buckets API and still supports Python 3.9; 1.9+ drops 3.9. This fixes the CI unit-test job (runs on 3.9) that was failing with ModuleNotFoundError. - cloud_stores.HFCloudStorage.is_directory: use list_repo_tree instead of repo_info().siblings for repo-path checks. repo_info fetches metadata for every file in the repo; list_repo_tree(path_in_repo=p) only fetches the specified path (and raises EntryNotFoundError for file paths, which is how we now distinguish file vs directory). - HuggingFaceStore.mount_cached_command: propagate MountCachedConfig read_only to hf-mount via --read-only instead of silently dropping the config. Other (rclone-specific) fields still have no hf-mount equivalent; documented in the docstring. - Doc nits from review: shorten HF mention in intro, drop the full hf:// URL enumeration from the example (users can refer to the HF CLI docs), single hf:// example in the Storage YAML reference. - Add unit tests covering the list_repo_tree optimization and mount_cached_command propagation. Tested: - Full HF test suite passes on Python 3.9 (278) and 3.11 (310). - bash format.sh clean (black, yapf, isort, mypy, pylint 10.00/10).
v0.3.1 cannot mount in unprivileged k8s pods (the default SkyPilot Kubernetes execution context). Two bugs blocked it: 1. fuser's direct /dev/fuse open path didn't fall back to fusermount when the device wasn't openable, so the mount failed before the setuid fusermount-shim could ever be invoked. Fixed in hf-mount v0.6.3 (huggingface/hf-mount#173, picking up huggingface/fuser#1). 2. After the initial mount succeeded via fusermount-shim, fuser's session loop opened /dev/fuse N-1 more times to clone per-thread fds. Those direct opens hit EPERM in the unprivileged pod and the daemon crashed with 'FUSE session error: Operation not permitted (os error 1)' immediately after the FUSE_INIT handshake, tearing the mount down (visible to the pod as 'mountpoint /data' returning 'not a mountpoint'). Fixed in hf-mount v0.6.4 (huggingface/hf-mount#174): probe /dev/fuse openability once at startup and disable clone_fd if denied. v0.6.5 is a maintenance release on top of v0.6.4. Verified end-to-end on a private sandbox Kubernetes cluster with the fusermount-shim/fusermount-server setup from this PR: mount, list, and read of a dataset repo all succeed. Also corrects the comment's glibc minimum (the actual binaries require 2.34, not 2.32; verified with 'strings ... | grep GLIBC_' against the published release).
Devin findings:
- Fix attribute name bug in HuggingFaceStore._classify/_validate: the sub-path
was being assigned to `self.bucket_sub_path` (no leading underscore), but
every consumer reads `self._bucket_sub_path`. The assignment silently
created a dead attribute. Masked in the common flow (Storage.add_store
injects _bucket_sub_path at the parent level), but corrupted state if
HuggingFaceStore is constructed directly with a sub-path URL.
- Update docs/source/reference/storage.rst FAQ to reference hf-mount v0.6.5
and glibc >= 2.34 (was still pointing at v0.3.1 / glibc 2.32 after the
version bump).
Gemini findings:
- Harden HF token cache directory permissions: os.makedirs(mode=0o700)
ignores the mode flag for pre-existing dirs, and the prior code swallowed
any chmod failure. Now we raise StorageSpecError on chmod failure and
verify the final mode before writing the token; we refuse to write if the
dir is group- or world-accessible.
- Remove `eval echo` from the hf-mount token-file path expansion. Expand a
leading `~/` to `$HOME/` in Python instead so the generated shell command
never invokes eval.
- Improve StorageSpecError wording when an HF repo source disagrees with the
storage name ('Hugging Face' instead of the cryptic 'HF').
- Parallelize multi-source bucket uploads with a small ThreadPoolExecutor
(max 4 workers). Each sync_bucket call is already internally parallel, so
the pool just overlaps I/O between distinct source dirs. Also classifies
all sources up front so missing paths fail loud before any upload starts.
Devin finding: HuggingFaceStore._get_bucket() wrapped both bucket_info() and _validate_existing_bucket() in a single try/except Exception block, so a StorageSpecError raised by _validate_existing_bucket (e.g. when a user tries to mount an externally-created bucket without ``source:``) was caught by the broad handler and re-raised as a misleading StorageBucketGetError. Move the validation out of the try block so the spec error surfaces its actionable message.
Devin finding: ``snapshot_download`` preserves repo-relative paths under ``local_dir``, so passing ``allow_patterns=['<sub_path>/*']`` places files at ``<destination>/<sub_path>/<files>`` rather than ``<destination>/<files>``. For a source like ``hf://openai-community/gpt2/onnx`` mounted at ``/data/onnx``, files ended up at ``/data/onnx/onnx/model.onnx`` instead of ``/data/onnx/model.onnx``. Both COPY-mode paths are affected: - ``HFCloudStorage.make_sync_dir_command`` (remote heredoc). - ``HuggingFaceStore.download_remote_dir`` (in-process). Fix: when a sub-path is present, stage ``snapshot_download`` into a temporary directory and move only the sub-path contents up to the real destination. The no-sub-path path is unchanged. Added an assertion to ``test_repo_dir_download_command`` plus a new ``test_repo_dir_download_command_no_sub_path`` to lock in the behavior.
…path_prefix Devin finding: when ``storage.source`` is an HF URL with a sub-path, the sub-path was relied upon entirely from the source URL, while every other store type also runs the constructed blob path through ``get_bucket_sub_path_prefix``. This left a gap when ``_bucket_sub_path`` is set independently of the source URL (e.g. constructed via the internal API or restored from a handle that no longer carries the sub- path in ``source``). Naively always calling ``get_bucket_sub_path_prefix`` would double the sub-path in the common case (``hf://buckets/ns/bucket/sub`` → ``hf://buckets/ns/bucket/sub/sub``). Instead, parse the source URL, drop any embedded sub-path to recover a canonical base URL, then run that through ``get_bucket_sub_path_prefix`` — symmetric with the other ``StoreType`` branches.
…porting Devin / AGENTS.md: ``huggingface_hub`` is already exposed at module level via ``common.LazyImport``, so the in-function ``from huggingface_hub import HfApi`` was redundant. Drop the local import and call ``huggingface_hub.HfApi()`` directly — the LazyImport defers loading until first attribute access, so ``sky`` import cost is unchanged.
…~ expansion) Addresses the review feedback from PR skypilot-org#9418 (referenced on skypilot-org#9698): - installation.rst: add a Hugging Face storage backend section with the _huggingface-installation anchor (pip extras, hf auth login / HF_TOKEN, sky check huggingface). - check.py: include HuggingFace in the storage-only clouds comment. - adaptors/huggingface.py: add a "For more info:" docs link to the token-not-set hint, like the other storage-only clouds. On whoami() failure, stop unconditionally blaming the token: surface the underlying error on a separate "Details:" line and make the refresh advice conditional (a missing httpx[socks], network errors, etc. are not token problems). - cloud_stores.py: wrap every COPY-mode download destination with os.path.expanduser. We invoke huggingface_hub directly (no shell), so a wrapped "~/.sky/file_mounts/..." destination was treated as a literal "~" directory and files never reached the mount point. - Add unit tests for the new hint behavior and the ~ expansion.
The doc-build job treats Sphinx warnings as errors. The new "Hugging Face"
section heading tripped the sentence-case linter ("Unexpected capitals:
Face"). Add it to MULTI_WORD_TERMS, like "Prime Intellect".
| code = '\n'.join([ | ||
| 'import os', | ||
| 'import shutil', | ||
| 'import tempfile', | ||
| 'from huggingface_hub import hf_hub_download', | ||
| self._TOKEN_HELPER, | ||
| f'dest = os.path.expanduser({destination!r})', | ||
| 'tmp_dir = tempfile.mkdtemp()', | ||
| (f'downloaded = hf_hub_download(repo_id={repo_id!r}, ' | ||
| f'repo_type={repo_type!r}, revision={revision!r}, ' | ||
| f'filename={path!r}, local_dir=tmp_dir, token=token)'), | ||
| 'os.makedirs(os.path.dirname(dest) or ".", exist_ok=True)', | ||
| 'shutil.copy2(downloaded, dest)', | ||
| ]) |
There was a problem hiding this comment.
🟡 Temp directory leaked in make_sync_file_command for HF repo file downloads
In HFCloudStorage.make_sync_file_command for repo (non-bucket) sources, a temporary directory is created via tempfile.mkdtemp() but is never cleaned up after the file is copied. The generated Python code downloads the file into tmp_dir, copies it to dest with shutil.copy2, and then exits — leaving tmp_dir and its contents on disk.
The comparable make_sync_dir_command at sky/cloud_stores.py:859-872 correctly wraps the download+move in a try/finally block that calls shutil.rmtree(tmp_dir, ignore_errors=True). This file-sync variant is missing the same cleanup, so every repo file download accumulates a leaked temp directory on the remote VM.
| code = '\n'.join([ | |
| 'import os', | |
| 'import shutil', | |
| 'import tempfile', | |
| 'from huggingface_hub import hf_hub_download', | |
| self._TOKEN_HELPER, | |
| f'dest = os.path.expanduser({destination!r})', | |
| 'tmp_dir = tempfile.mkdtemp()', | |
| (f'downloaded = hf_hub_download(repo_id={repo_id!r}, ' | |
| f'repo_type={repo_type!r}, revision={revision!r}, ' | |
| f'filename={path!r}, local_dir=tmp_dir, token=token)'), | |
| 'os.makedirs(os.path.dirname(dest) or ".", exist_ok=True)', | |
| 'shutil.copy2(downloaded, dest)', | |
| ]) | |
| code = '\n'.join([ | |
| 'import os', | |
| 'import shutil', | |
| 'import tempfile', | |
| 'from huggingface_hub import hf_hub_download', | |
| self._TOKEN_HELPER, | |
| f'dest = os.path.expanduser({destination!r})', | |
| 'tmp_dir = tempfile.mkdtemp()', | |
| 'try:', | |
| (f' downloaded = hf_hub_download(repo_id={repo_id!r}, ' | |
| f'repo_type={repo_type!r}, revision={revision!r}, ' | |
| f'filename={path!r}, local_dir=tmp_dir, token=token)'), | |
| ' os.makedirs(os.path.dirname(dest) or ".", exist_ok=True)', | |
| ' shutil.copy2(downloaded, dest)', | |
| 'finally:', | |
| ' shutil.rmtree(tmp_dir, ignore_errors=True)', | |
| ]) |
Was this helpful? React with 👍 or 👎 to provide feedback.
This is a continuation of #9418 by @nikhiljha, opened directly against
masterto ease the review process. The original commits are preserved in history.Summary
Adds a new
hfstorage type that wraps Hugging Face Buckets (Xet-backed S3-like object storage, read-write) and, via the same handle, also supports mounting HF model/dataset/space repos read-only.All bucket lifecycle operations (create/upload/download/delete) go through
huggingface_hub(>=1.10).MOUNT/MOUNT_CACHEDmodes install and invokehf-mountwith its FUSE backend, which aligns with SkyPilot's existing FUSE-based mount infrastructure (gcsfuse/blobfuse2/rclone mount/goofys).URL formats
Other notes
HF_TOKENfrom env or the HF CLI token file and propagates it to clusters via the existing credential-file-mounts pipeline.sourceraisesStorageUploadErrorinstead of silently no-op'ing.hf-mountv0.6.5 Linux binaries are linked against glibc >= 2.34, soMOUNT-mode tasks on the default SkyPilot Kubernetes image (Ubuntu 20.04, glibc 2.31) need an explicit newer base image viaresources.image_id(e.g.docker:mirror.gcr.io/ubuntu:22.04).COPYmode works on any image. Seedocs/source/reference/storage.rst.hf-mount bump (v0.3.1 → v0.6.5)
The originally referenced
hf-mountv0.3.1 could not mount in unprivileged Kubernetes pods (the default SkyPilot k8s execution context). Two bugs blocked it, both now fixed upstream:fuser hard-failed before reaching the fusermount fallback: fuser's direct
/dev/fuseopen path didn't fall back tofusermountwhen the device wasn't openable, so the mount failed before the setuidfusermount-shimcould ever be invoked. Fixed in hf-mount#173 (picking up fuser#1). Released as v0.6.3.clone_fdopened/dev/fuseN-1 more times after the initial mount: after the initial mount succeeded via the proxy, fuser's session loop opened/dev/fusean additionaln_threads - 1times to clone per-thread fds. Those direct opens hit EPERM in the unprivileged pod and the daemon crashed withFUSE session error: Operation not permitted (os error 1)right after the FUSE_INIT handshake. Fixed in hf-mount#174: probe/dev/fuseopenability once at startup and disableclone_fdif denied. Released as v0.6.4. v0.6.5 is a maintenance release on top.Also corrects the inline comment's glibc minimum: the binaries actually require glibc 2.34 (verified with
strings hf-mount-fuse-x86_64-linux | grep GLIBC_against the published release), not 2.32 as previously documented.