Skip to content

feat(io): add Hugging Face Storage Bucket support#6731

Draft
everettVT wants to merge 4 commits into
mainfrom
everettVT/hf-storage-buckets
Draft

feat(io): add Hugging Face Storage Bucket support#6731
everettVT wants to merge 4 commits into
mainfrom
everettVT/hf-storage-buckets

Conversation

@everettVT
Copy link
Copy Markdown
Contributor

@everettVT everettVT commented Apr 20, 2026

Summary

  • add support for hf://buckets/<owner>/<bucket>/... paths in Daft
  • route HF bucket reads, writes, lists, and deletes through the OpenDAL Hugging Face backend
  • align HF path parsing, URL generation, docs, and tests with upstream HF/OpenDAL behavior, including Xet-backed reads

Notes

Validation

  • cargo fmt --all
  • cargo test -p daft-io test_parse_hf_parts --lib
  • cargo test -p daft-io test_get_hf_ --lib
  • cargo test -p daft-io test_parse_bucket_hf_parts --lib
  • cargo test -p daft-io test_full_get_from_xet_backed_hf_dataset --lib -- --ignored
  • HF_TOKEN=<from .env> HF_BUCKET=Eventual-Inc/datasets cargo test -p daft-io test_hf_bucket --lib -- --ignored

@everettVT everettVT requested a review from a team as a code owner April 20, 2026 02:00
@github-actions github-actions Bot added the feat label Apr 20, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 20, 2026

Greptile Summary

This PR adds Hugging Face Storage Bucket (hf://buckets/<owner>/<bucket>/...) support by routing all ObjectSource operations (get, put, delete, ls, glob, multipart write) through the OpenDAL huggingface backend, caching per-repository OpenDAL operators inside HFSource, and adding a new HFBucket variant to SourceType for routing and native-writer detection.

  • P1 — git dependency on unreleased opendal commit: Cargo.toml pins opendal to a specific git SHA, blocking crates.io publishing and degrading reproducibility.
  • P1 — panic in glob closure: from_opendal_uri(...).expect(...) inside map_ok in glob will panic rather than propagate an error if OpenDAL returns a malformed URI.

Confidence Score: 3/5

Not ready to merge as-is: the git-pinned dependency blocks publishing and the glob panic can crash callers on malformed backend responses.

Two P1 findings should be addressed before merging. All other findings are P2 style/hygiene. The core routing logic and URL rewriting are well-structured and the test coverage is good.

src/daft-io/src/huggingface.rs (glob closure panic) and Cargo.toml (git dependency)

Important Files Changed

Filename Overview
Cargo.toml Pins opendal to an unreleased git commit (a0c1d81) instead of a published version — blocks crates.io publishing and reduces reproducibility.
src/daft-io/src/huggingface.rs Core implementation: adds bucket_source/bucket_source_and_uri caching layer, routes all ObjectSource methods to OpenDALSource for hf://buckets/ paths; has a P1 panic risk in the glob closure.
src/daft-io/src/lib.rs Adds HFBucket SourceType, routes hf://buckets/ prefix to it in parse_url, enables supports_native_writer for HFBucket; logic is correct.
src/common/io-config/src/huggingface.rs Adds to_opendal_config helper and tests; correctly omits token in anonymous mode.
src/daft-io/src/opendal_source.rs Adds 'huggingface' to available_schemes and migrates Operator construction to string-based via_iter; error message improved.
src/daft-io/Cargo.toml Enables services-huggingface opendal feature; straightforward change.
docs/connectors/huggingface.md Documents Storage Bucket read/write and auth patterns; accurate and consistent with the implementation.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["User: hf://buckets/owner/bucket/path"] --> B["parse_url()"]
    B --> C{host_str == 'buckets'?}
    C -- yes --> D["SourceType::HFBucket"]
    C -- no --> E["SourceType::HF"]
    D --> F["HFSource::get_client()"]
    E --> F
    F --> G["ObjectSource method called"]
    G --> H["bucket_source_and_uri(uri)"]
    H --> I{is_storage_bucket?}
    I -- no --> J["Existing HTTP / HF Dataset path"]
    I -- yes --> K["Lookup/create OpenDALSource (cached by repository)"]
    K --> L["hf_config.to_opendal_config"]
    L --> M["OpenDALSource::get_client(scheme='huggingface')"]
    M --> N["opendal_uri() → huggingface://repo/path"]
    N --> O["opendal huggingface backend"]
    O -- ls/glob --> P["from_opendal_uri() → hf://buckets/owner/bucket/path"]
Loading

Comments Outside Diff (2)

  1. Cargo.toml, line 10 (link)

    P1 Git dependency on unreleased opendal commit

    Pinning opendal to a specific git commit (a0c1d81) rather than a published version introduces several risks: the dependency could disappear if the upstream repo is force-pushed or rebased, it blocks publishing any Daft crates to crates.io (Cargo forbids git deps in published crates), and CI reproducibility degrades over time as the Cargo.lock diverges from the tree. The PR notes this is intentional while waiting for a v0.55.x patch release — it would be worth tracking this with a TODO comment and a GitHub issue so it isn't accidentally left in place long-term.

  2. src/daft-io/src/huggingface.rs, line 783-804 (link)

    P2 Integration tests don't clean up on assertion failure

    test_hf_bucket_put_get_roundtrip and test_hf_bucket_multipart_and_ls_roundtrip call client.delete(...) only as the last statement, so any assert_eq! or earlier ? failure leaves orphaned objects in the real HF bucket. Consider using a RAII guard or a defer-style pattern to guarantee cleanup regardless of how the test exits.

Reviews (1): Last reviewed commit: "feat(io): add Hugging Face Storage Bucke..." | Re-trigger Greptile

Comment thread src/daft-io/src/huggingface.rs Outdated
Comment on lines +684 to +688
.map_ok(move |mut file| {
file.filepath = path_parts
.from_opendal_uri(&file.filepath)
.expect("OpenDAL bucket glob returned an invalid URI");
file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 expect inside async stream closure can panic instead of propagating an error

from_opendal_uri can return an Err if OpenDAL returns a URI that fails url::Url::parse. Calling .expect() inside the map_ok closure means a malformed URI from the underlying backend will panic the task rather than surfacing as a typed Err in the stream. Use map(|result| result.and_then(...)) instead of map_ok to propagate the error properly.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 97027f046f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +427 to +429
let config = self
.hf_config
.to_opendal_config("bucket", &path_parts.repository, None);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Forward deprecated HTTP bearer token to bucket OpenDAL config

HFSource::get_client still supports Hugging Face auth via HTTPConfig.bearer_token (with a deprecation warning), but bucket clients are initialized from self.hf_config only. When users rely on that deprecated token path and leave hf.token unset, dataset HTTP reads can still authenticate while private hf://buckets/... operations are created without a token and fail with unauthorized/permission errors.

Useful? React with 👍 / 👎.

Comment thread src/daft-io/src/huggingface.rs Outdated
Comment on lines +685 to +687
file.filepath = path_parts
.from_opendal_uri(&file.filepath)
.expect("OpenDAL bucket glob returned an invalid URI");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Return an error instead of panicking on invalid bucket glob URIs

The bucket glob remapping path uses .expect(...) after from_opendal_uri, so any unexpected OpenDAL filepath that cannot be parsed as a URL will panic the process instead of returning an I/O error. This makes globbing fragile for edge-case object keys and turns a recoverable conversion issue into a hard crash.

Useful? React with 👍 / 👎.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 20, 2026

Codecov Report

❌ Patch coverage is 62.93706% with 159 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.00%. Comparing base (d48924c) to head (ff93d7d).
⚠️ Report is 22 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-io/src/huggingface.rs 58.91% 152 Missing ⚠️
src/common/io-config/src/huggingface.rs 87.09% 4 Missing ⚠️
src/daft-io/src/lib.rs 85.00% 3 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #6731      +/-   ##
==========================================
- Coverage   75.02%   75.00%   -0.02%     
==========================================
  Files        1067     1079      +12     
  Lines      147155   149815    +2660     
==========================================
+ Hits       110408   112374    +1966     
- Misses      36747    37441     +694     
Files with missing lines Coverage Δ
src/daft-io/src/opendal_source.rs 81.70% <100.00%> (+3.83%) ⬆️
src/daft-io/src/lib.rs 81.86% <85.00%> (-0.06%) ⬇️
src/common/io-config/src/huggingface.rs 90.90% <87.09%> (-4.93%) ⬇️
src/daft-io/src/huggingface.rs 53.48% <58.91%> (+7.01%) ⬆️

... and 89 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@everettVT
Copy link
Copy Markdown
Contributor Author

Addressed the main review findings in follow-up commits.

  • 81c531a96: propagate deprecated HTTPConfig.bearer_token into HF bucket OpenDAL config when hf.token is unset, and replace the bucket glob expect(...) path with typed error propagation.
  • ff93d7d3a: fix HF revision parsing for encoded slash revisions like @feature%2Fabc/... and @refs%2Fheads%2Fmain/..., and add regression tests for those cases.

Revalidated locally with the parser/unit tests and the ignored live HF bucket round-trip tests.

@everettVT
Copy link
Copy Markdown
Contributor Author

@universalmind303 - some notes on the pyspark impl


Compared to pyspark_huggingface, our implementation is closer to a filesystem/object-store adapter, while Spark’s is a DataFrame datasource/sink.

The main differences:

  • Spark is dataset-first. Its read path goes through datasets.load_dataset_builder(...) and as_streaming_dataset(), with split/config/filter options and Parquet-friendly pushdown. Even for bucket paths, it first validates the bucket with api.bucket_info(...) and then still drives the read through the dataset builder.
  • Daft is path-first for buckets. We parse hf://buckets/<owner>/<bucket>/... as a dedicated HFBucket source in src/daft-io/src/lib.rs, then expose bucket operations through ObjectSource in src/daft-io/src/huggingface.rs. That gives us get, put, get_size, ls, glob, delete, and multipart-writer support for arbitrary bucket paths, not just dataset-shaped reads.
  • Spark’s bucket writer is more native/self-contained. It writes Parquet partition files in memory, uploads temp files with api.batch_bucket_files(add=...), then does a commit phase with copy/delete operations. Our bucket writes are broader at the IO surface, but the HF-specific bucket semantics are currently delegated to OpenDAL in src/daft-io/src/huggingface.rs.
  • Spark’s public API is narrower: .load("buckets/owner/bucket") plus options like data_dir and split, and its sink writes Parquet. Daft uses the canonical HF bucket URI form documented in docs/connectors/huggingface.md, which is a better fit for generic file IO.
  • Spark’s source code explicitly says dataset reads are currently public-only. Daft’s docs and implementation already support token-based auth for private buckets and authenticated HF access in docs/connectors/huggingface.md.

So the short version is:

  • Spark is more native at the HF application layer.
  • Daft is more general at the object-store layer.
  • If we want a future fully native Daft replacement for OpenDAL, Spark’s bucket write flow is a good model for staging/commit semantics, but it is not a drop-in template because it relies heavily on Python huggingface_hub and datasets.

Sources:

@everettVT
Copy link
Copy Markdown
Contributor Author

@lhoestq I was hoping to use OpenDAL to vendor the integration since it comes with a few extra benefits over the dataset conversion, but I think the feature is more complicated than I'm prepared to finish.

@universalmind303 will help review sometime this week, but right now I don't have a timeline.

Would love any feedback or if there is a simpler set of requirements at the ffspec level. Happy to start from a scratch if there's a cleaner approach.

@everettVT everettVT marked this pull request as draft May 12, 2026 23:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant