Skip to content

fix: pass hub token when embedding external files during push_to_hub#8285

Open
itxsamad1 wants to merge 1 commit into
huggingface:mainfrom
itxsamad1:fix/push-embed-token-per-repo-id
Open

fix: pass hub token when embedding external files during push_to_hub#8285
itxsamad1 wants to merge 1 commit into
huggingface:mainfrom
itxsamad1:fix/push-embed-token-per-repo-id

Conversation

@itxsamad1

Copy link
Copy Markdown

Summary

  • Fixes gated-repo failures when Dataset.push_to_hub(..., embed_external_files=True) embeds hf:// image/audio paths into Parquet shards.
  • _push_parquet_shards_to_hub_single now builds a token_per_repo_id map from shard paths (and destination repo metadata) and passes it to embed_table_storage, matching the existing IterableDataset.push_to_hub behavior.

Root cause

IterableDataset.push_to_hub already used partial(embed_table_storage, token_per_repo_id=...), but the map-style Dataset push path called embed_table_storage bare, so downloads from gated source repos failed during parquet conversion.

Test plan

  • Added test_get_token_per_repo_id_for_embed unit test
  • Manual: push dataset with hf:// image paths from a gated repo using token=...

Fixes #6348

Map-style Dataset.push_to_hub was calling embed_table_storage without token_per_repo_id, so embedding hf:// image/audio paths from gated source repos failed. Build a repo_id-to-token map from shard paths and pass it through, matching IterableDataset.push_to_hub.

Fixes huggingface#6348
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet stream-conversion fails to embed images/audio files from gated repos

1 participant