huggingface: add bucket scanning#5017
Conversation
| return nil, fmt.Errorf("failed to download bucket file %s: status %d", path, resp.StatusCode) | ||
| } | ||
| return resp.Body, nil | ||
| } |
There was a problem hiding this comment.
HTTP client timeout breaks large bucket file downloads
High Severity
DownloadBucketFile returns resp.Body for the caller to read later, but the shared HFClient.HTTPClient has a 10-second Timeout that covers the entire request lifecycle including reading the response body. Go's docs explicitly state the timer "remains running after Do returns and will interrupt reading of the Response.Body." Since maxBucketFileSize allows files up to 250MB, virtually any non-trivial file download will time out during body reading by handlers.HandleFile. The API methods get and getPage aren't affected because they fully consume the body within the call.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 0ead96c. Configure here.
| Data: &source_metadatapb.MetaData_Huggingface{ | ||
| Huggingface: &source_metadatapb.Huggingface{ | ||
| File: sanitizer.UTF8(file.Path), | ||
| Link: sanitizer.UTF8(fmt.Sprintf("%s/%s/%s/resolve/%s", s.conn.Endpoint, BucketsRoute, bucketID, file.Path)), |
There was a problem hiding this comment.
Metadata link uses unescaped path unlike download URL
Low Severity
The metadata Link in scanBucketFile constructs the resolve URL with file.Path unescaped (preserving literal slashes), while DownloadBucketFile uses url.PathEscape(path) which encodes slashes as %2F. The comment in DownloadBucketFile states the resolve endpoint expects paths "fully URL-encoded as a single segment (including slashes)," so the metadata link will be incorrect for any file in a subdirectory.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 0ead96c. Configure here.
|
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 3 total unresolved issues (including 2 from previous reviews).
Reviewed by Cursor Bugbot for commit 41edf0a. Configure here.
| func (c *HFClient) DownloadBucketFile(ctx context.Context, bucketID string, path string) (io.ReadCloser, error) { | ||
| // The resolve endpoint expects the file path fully URL-encoded as a | ||
| // single segment (including slashes). | ||
| downloadURL := fmt.Sprintf("%s/%s/%s/resolve/%s", c.BaseURL, BucketsRoute, bucketID, url.PathEscape(path)) |
There was a problem hiding this comment.
Bucket file path incorrectly encoded with PathEscape
Medium Severity
url.PathEscape(path) encodes forward slashes as %2F, so a file at subdir/file.txt produces a URL like .../resolve/subdir%2Ffile.txt instead of .../resolve/subdir/file.txt. HuggingFace resolve endpoints typically expect literal path separators. This would cause 404 errors when downloading any file in a subdirectory of a bucket.
Reviewed by Cursor Bugbot for commit 41edf0a. Configure here.


HF recently shipped storage buckets (Xet-backed object storage)
--bucket <namespace/name>flag, plus--include-buckets/--ignore-buckets/--skip-all-buckets;--org/--userscans pick up buckets automaticallyhandlers.HandleFileTested against a real public bucket (results identical to
trufflehog filesystemon the same file), and a planted canary AWS key comes back as a verified finding with correct bucket metadata.cc @dxa4481
Note
Low Risk
Feature addition on the HuggingFace source with external API calls and file downloads; no changes to auth or core scanning infrastructure beyond new optional scan targets.
Overview
Adds Hugging Face storage bucket support to the
huggingfacescan source, alongside models, datasets, and spaces.The CLI and config surface new
--bucket,--include-buckets,--ignore-buckets, and--skip-all-bucketsflags (wired throughHuggingfaceConfig, protobuf, and docs). Org/user enumeration now discovers buckets when not skipped, and validation requires at least one target type including buckets.Buckets are not git repos: a dedicated path lists files via the HF tree API (with Link pagination), downloads objects through the resolve endpoint (250MB cap), and chunks content via
handlers.HandleFilewith bucket-specific metadata—parallel to the existing S3-style object scan rather than clone-and-scan.Reviewed by Cursor Bugbot for commit 41edf0a. Bugbot is set up for automated code reviews on this repo. Configure here.