feat(io): add Hugging Face Storage Bucket support#6731
Conversation
Greptile SummaryThis PR adds Hugging Face Storage Bucket (
Confidence Score: 3/5Not ready to merge as-is: the git-pinned dependency blocks publishing and the glob panic can crash callers on malformed backend responses. Two P1 findings should be addressed before merging. All other findings are P2 style/hygiene. The core routing logic and URL rewriting are well-structured and the test coverage is good. src/daft-io/src/huggingface.rs (glob closure panic) and Cargo.toml (git dependency) Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["User: hf://buckets/owner/bucket/path"] --> B["parse_url()"]
B --> C{host_str == 'buckets'?}
C -- yes --> D["SourceType::HFBucket"]
C -- no --> E["SourceType::HF"]
D --> F["HFSource::get_client()"]
E --> F
F --> G["ObjectSource method called"]
G --> H["bucket_source_and_uri(uri)"]
H --> I{is_storage_bucket?}
I -- no --> J["Existing HTTP / HF Dataset path"]
I -- yes --> K["Lookup/create OpenDALSource (cached by repository)"]
K --> L["hf_config.to_opendal_config"]
L --> M["OpenDALSource::get_client(scheme='huggingface')"]
M --> N["opendal_uri() → huggingface://repo/path"]
N --> O["opendal huggingface backend"]
O -- ls/glob --> P["from_opendal_uri() → hf://buckets/owner/bucket/path"]
|
| .map_ok(move |mut file| { | ||
| file.filepath = path_parts | ||
| .from_opendal_uri(&file.filepath) | ||
| .expect("OpenDAL bucket glob returned an invalid URI"); | ||
| file |
There was a problem hiding this comment.
expect inside async stream closure can panic instead of propagating an error
from_opendal_uri can return an Err if OpenDAL returns a URI that fails url::Url::parse. Calling .expect() inside the map_ok closure means a malformed URI from the underlying backend will panic the task rather than surfacing as a typed Err in the stream. Use map(|result| result.and_then(...)) instead of map_ok to propagate the error properly.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 97027f046f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let config = self | ||
| .hf_config | ||
| .to_opendal_config("bucket", &path_parts.repository, None); |
There was a problem hiding this comment.
Forward deprecated HTTP bearer token to bucket OpenDAL config
HFSource::get_client still supports Hugging Face auth via HTTPConfig.bearer_token (with a deprecation warning), but bucket clients are initialized from self.hf_config only. When users rely on that deprecated token path and leave hf.token unset, dataset HTTP reads can still authenticate while private hf://buckets/... operations are created without a token and fail with unauthorized/permission errors.
Useful? React with 👍 / 👎.
| file.filepath = path_parts | ||
| .from_opendal_uri(&file.filepath) | ||
| .expect("OpenDAL bucket glob returned an invalid URI"); |
There was a problem hiding this comment.
Return an error instead of panicking on invalid bucket glob URIs
The bucket glob remapping path uses .expect(...) after from_opendal_uri, so any unexpected OpenDAL filepath that cannot be parsed as a URL will panic the process instead of returning an I/O error. This makes globbing fragile for edge-case object keys and turns a recoverable conversion issue into a hard crash.
Useful? React with 👍 / 👎.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6731 +/- ##
==========================================
- Coverage 75.02% 75.00% -0.02%
==========================================
Files 1067 1079 +12
Lines 147155 149815 +2660
==========================================
+ Hits 110408 112374 +1966
- Misses 36747 37441 +694
🚀 New features to boost your workflow:
|
|
Addressed the main review findings in follow-up commits.
Revalidated locally with the parser/unit tests and the ignored live HF bucket round-trip tests. |
|
@universalmind303 - some notes on the pyspark impl Compared to The main differences:
So the short version is:
Sources:
|
|
@lhoestq I was hoping to use OpenDAL to vendor the integration since it comes with a few extra benefits over the dataset conversion, but I think the feature is more complicated than I'm prepared to finish. @universalmind303 will help review sometime this week, but right now I don't have a timeline. Would love any feedback or if there is a simpler set of requirements at the ffspec level. Happy to start from a scratch if there's a cleaner approach. |
Summary
hf://buckets/<owner>/<bucket>/...paths in DaftNotes
opendaltoa0c1d81237f9a558e8682ec3d24773f865d3ceeabecause the latest released version (v0.55.0) predates Hugging Face Storage Bucket supportValidation
cargo fmt --allcargo test -p daft-io test_parse_hf_parts --libcargo test -p daft-io test_get_hf_ --libcargo test -p daft-io test_parse_bucket_hf_parts --libcargo test -p daft-io test_full_get_from_xet_backed_hf_dataset --lib -- --ignoredHF_TOKEN=<from .env> HF_BUCKET=Eventual-Inc/datasets cargo test -p daft-io test_hf_bucket --lib -- --ignored