[fix] handle parquet schema mismatch in dataset concatenation#146
Merged
[fix] handle parquet schema mismatch in dataset concatenation#146
Conversation
When loading multiple parquet files via YAML config, concatenate_datasets() fails if columns have different inferred Arrow types. Add a safe wrapper that falls back to row-wise concatenation on schema mismatch.
kcz358
reviewed
Mar 19, 2026
Comment on lines
+42
to
+49
| except Exception as e: | ||
| logger.warning( | ||
| f"Direct concatenation failed due to schema mismatch: {e}. " f"Falling back to row-wise concatenation." | ||
| ) | ||
| all_rows = [] | ||
| for ds in data_list: | ||
| all_rows.extend(ds.to_list()) | ||
| return Dataset.from_list(all_rows) |
Collaborator
There was a problem hiding this comment.
This would be cause extremely slow and large memory needed under large scale data. Recommend to preprocess or cast dataset before training to ensure the dataset is in the same features.
Collaborator
Author
There was a problem hiding this comment.
Good point — I've replaced the to_list() fallback with cast()-based schema alignment in e28d773. The new approach:
- Tries
cast()to the first dataset's features (fast, zero-copy for compatible schemas) - If that fails, builds a merged feature set preferring non-null types and casts all datasets to it
This keeps the fix zero-overhead on the happy path and avoids materializing the full dataset into memory on schema mismatch.
Replace the to_list() + from_list() fallback with cast()-based schema alignment. This avoids materializing the entire dataset into memory, making it safe for large-scale data. Strategy: 1. Try cast() to the first dataset's features (fast, zero-copy) 2. If that fails, build a merged feature set preferring non-null types and cast all datasets to the merged schema
kcz358
approved these changes
Mar 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
When using YAML config to load multiple parquet files from different sources,
concatenate_datasets()inDataUtilities.load_yaml()crashes withArrowInvalidif the files have incompatible inferred Arrow schemas.This is a common scenario in practice. For example, when one parquet file contains rows with
image_url: {"url": "http://..."}(inferred asstruct{url: string}) while another file has all-null values in the same column (inferred asnulltype), Arrow refuses to concatenate the two because their schemas don't match — even though the data is semantically compatible.Reproduction:
Modifications
_safe_concatenate_datasets()helper indata_utils.pythat wrapsconcatenate_datasets()with a try/except fallbackto_list()+from_list()row-wise concatenation with a warning log so users know the fallback was triggeredload_yaml()call sites (line 134 and line 239)