Skip to content

Fix MetadataConfigs dropping parquet shards for non-consecutive configs#8299

Open
rodriguescarson wants to merge 1 commit into
huggingface:mainfrom
rodriguescarson:fix/metadata-configs-non-consecutive-shards
Open

Fix MetadataConfigs dropping parquet shards for non-consecutive configs#8299
rodriguescarson wants to merge 1 commit into
huggingface:mainfrom
rodriguescarson:fix/metadata-configs-non-consecutive-shards

Conversation

@rodriguescarson

Copy link
Copy Markdown

Fixes #8269

Problem

MetadataConfigs._from_exported_parquet_files_and_dataset_infos() groups the exported parquet files with itertools.groupby(exported_parquet_files, itemgetter("config")). groupby only groups consecutive equal keys, but the result is built as a dict keyed by config name:

metadata_configs = {
    config_name: { ... }
    for config_name, parquet_files_for_config in groupby(exported_parquet_files, itemgetter("config"))
}

So if the same config appears again after a different config (e.g. default, other, default), groupby yields default as two separate groups and the second one overwrites the first in the dict — the earlier shard URLs are silently lost. The same applies to the inner groupby over splits.

Fix

Sort the exported files by (config, split) before grouping, so all rows for a given config/split are consecutive. The sort is stable, so shard order within a split (0000.parquet, 0001.parquet, …) is preserved.

Tests

Added test_from_exported_parquet_files_keeps_all_shards_when_configs_non_consecutive — it feeds a non-consecutive default config and asserts both shard URLs survive (the test fails on main and passes with this change). The existing test_split_order_... test still passes (final ordering is still driven by dataset_infos). pytest tests/test_metadata_util.py → 9 passed; ruff check clean.

🤖 Generated with Claude Code

_from_exported_parquet_files_and_dataset_infos() groups the exported parquet
files with itertools.groupby, which only groups *consecutive* equal keys. When
a config (or split) appears again after another config in the exported list,
groupby yields it as multiple groups; because the result is built as a dict
keyed by config name, the later group overwrites the earlier one and its shard
URLs are silently lost.

Sort the exported files by (config, split) before grouping so all rows for a
config/split are consecutive. The sort is stable, so shard order within a split
is preserved.

Adds a regression test.

Fixes huggingface#8269
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MetadataConfigs drops parquet shards when exported config rows are non-consecutive

1 participant