Fix MetadataConfigs dropping parquet shards for non-consecutive configs#8299
Open
rodriguescarson wants to merge 1 commit into
Open
Conversation
_from_exported_parquet_files_and_dataset_infos() groups the exported parquet files with itertools.groupby, which only groups *consecutive* equal keys. When a config (or split) appears again after another config in the exported list, groupby yields it as multiple groups; because the result is built as a dict keyed by config name, the later group overwrites the earlier one and its shard URLs are silently lost. Sort the exported files by (config, split) before grouping so all rows for a config/split are consecutive. The sort is stable, so shard order within a split is preserved. Adds a regression test. Fixes huggingface#8269
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #8269
Problem
MetadataConfigs._from_exported_parquet_files_and_dataset_infos()groups the exported parquet files withitertools.groupby(exported_parquet_files, itemgetter("config")).groupbyonly groups consecutive equal keys, but the result is built as a dict keyed by config name:So if the same config appears again after a different config (e.g.
default,other,default),groupbyyieldsdefaultas two separate groups and the second one overwrites the first in the dict — the earlier shard URLs are silently lost. The same applies to the innergroupbyover splits.Fix
Sort the exported files by
(config, split)before grouping, so all rows for a given config/split are consecutive. The sort is stable, so shard order within a split (0000.parquet,0001.parquet, …) is preserved.Tests
Added
test_from_exported_parquet_files_keeps_all_shards_when_configs_non_consecutive— it feeds a non-consecutivedefaultconfig and asserts both shard URLs survive (the test fails onmainand passes with this change). The existingtest_split_order_...test still passes (final ordering is still driven bydataset_infos).pytest tests/test_metadata_util.py→ 9 passed;ruff checkclean.🤖 Generated with Claude Code