Skip to content

huggingface: add bucket scanning#5017

Open
julien-c wants to merge 5 commits into
trufflesecurity:mainfrom
julien-c:huggingface-buckets
Open

huggingface: add bucket scanning#5017
julien-c wants to merge 5 commits into
trufflesecurity:mainfrom
julien-c:huggingface-buckets

Conversation

@julien-c
Copy link
Copy Markdown

@julien-c julien-c commented Jun 4, 2026

HF recently shipped storage buckets (Xet-backed object storage)

  • new --bucket <namespace/name> flag, plus --include-buckets / --ignore-buckets / --skip-all-buckets; --org / --user scans pick up buckets automatically
  • buckets aren't git repos, so this is a separate scan path modeled on the S3 source: list via the tree API, download, chunk through handlers.HandleFile

Tested against a real public bucket (results identical to trufflehog filesystem on the same file), and a planted canary AWS key comes back as a verified finding with correct bucket metadata.

cc @dxa4481


Note

Low Risk
Feature addition on the HuggingFace source with external API calls and file downloads; no changes to auth or core scanning infrastructure beyond new optional scan targets.

Overview
Adds Hugging Face storage bucket support to the huggingface scan source, alongside models, datasets, and spaces.

The CLI and config surface new --bucket, --include-buckets, --ignore-buckets, and --skip-all-buckets flags (wired through HuggingfaceConfig, protobuf, and docs). Org/user enumeration now discovers buckets when not skipped, and validation requires at least one target type including buckets.

Buckets are not git repos: a dedicated path lists files via the HF tree API (with Link pagination), downloads objects through the resolve endpoint (250MB cap), and chunks content via handlers.HandleFile with bucket-specific metadata—parallel to the existing S3-style object scan rather than clone-and-scan.

Reviewed by Cursor Bugbot for commit 41edf0a. Bugbot is set up for automated code reviews on this repo. Configure here.

@julien-c julien-c requested a review from a team June 4, 2026 16:05
@julien-c julien-c requested review from a team as code owners June 4, 2026 16:05
return nil, fmt.Errorf("failed to download bucket file %s: status %d", path, resp.StatusCode)
}
return resp.Body, nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTTP client timeout breaks large bucket file downloads

High Severity

DownloadBucketFile returns resp.Body for the caller to read later, but the shared HFClient.HTTPClient has a 10-second Timeout that covers the entire request lifecycle including reading the response body. Go's docs explicitly state the timer "remains running after Do returns and will interrupt reading of the Response.Body." Since maxBucketFileSize allows files up to 250MB, virtually any non-trivial file download will time out during body reading by handlers.HandleFile. The API methods get and getPage aren't affected because they fully consume the body within the call.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 0ead96c. Configure here.

Data: &source_metadatapb.MetaData_Huggingface{
Huggingface: &source_metadatapb.Huggingface{
File: sanitizer.UTF8(file.Path),
Link: sanitizer.UTF8(fmt.Sprintf("%s/%s/%s/resolve/%s", s.conn.Endpoint, BucketsRoute, bucketID, file.Path)),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metadata link uses unescaped path unlike download URL

Low Severity

The metadata Link in scanBucketFile constructs the resolve URL with file.Path unescaped (preserving literal slashes), while DownloadBucketFile uses url.PathEscape(path) which encodes slashes as %2F. The comment in DownloadBucketFile states the resolve endpoint expects paths "fully URL-encoded as a single segment (including slashes)," so the metadata link will be incorrect for any file in a subdirectory.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 0ead96c. Configure here.

Comment thread pkg/sources/huggingface/client.go Outdated
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Jun 4, 2026

CLA assistant check
All committers have signed the CLA.

@julien-c
Copy link
Copy Markdown
Author

julien-c commented Jun 4, 2026

i'll sign the CLA only if this is deemed worthy of merging, if that sounds ok

Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 41edf0a. Configure here.

func (c *HFClient) DownloadBucketFile(ctx context.Context, bucketID string, path string) (io.ReadCloser, error) {
// The resolve endpoint expects the file path fully URL-encoded as a
// single segment (including slashes).
downloadURL := fmt.Sprintf("%s/%s/%s/resolve/%s", c.BaseURL, BucketsRoute, bucketID, url.PathEscape(path))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bucket file path incorrectly encoded with PathEscape

Medium Severity

url.PathEscape(path) encodes forward slashes as %2F, so a file at subdir/file.txt produces a URL like .../resolve/subdir%2Ffile.txt instead of .../resolve/subdir/file.txt. HuggingFace resolve endpoints typically expect literal path separators. This would cause 404 errors when downloading any file in a subdirectory of a bucket.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 41edf0a. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants