Skip to content

[fix] handle parquet schema mismatch in dataset concatenation#146

Merged
kcz358 merged 2 commits intomainfrom
fix/safe-concatenate-datasets
Mar 19, 2026
Merged

[fix] handle parquet schema mismatch in dataset concatenation#146
kcz358 merged 2 commits intomainfrom
fix/safe-concatenate-datasets

Conversation

@mwxely
Copy link
Copy Markdown
Collaborator

@mwxely mwxely commented Mar 18, 2026

Motivation

When using YAML config to load multiple parquet files from different sources, concatenate_datasets() in DataUtilities.load_yaml() crashes with ArrowInvalid if the files have incompatible inferred Arrow schemas.

This is a common scenario in practice. For example, when one parquet file contains rows with image_url: {"url": "http://..."} (inferred as struct{url: string}) while another file has all-null values in the same column (inferred as null type), Arrow refuses to concatenate the two because their schemas don't match — even though the data is semantically compatible.

Reproduction:

datasets:
  - path: dataset_a.parquet   # has image_url with values → struct{url: string}
    data_type: parquet
  - path: dataset_b.parquet   # has image_url all null → null type
    data_type: parquet
ArrowInvalid: Schema at index 1 was different:
  image_url: null
vs
  image_url: struct<url: string>

Modifications

  • Add _safe_concatenate_datasets() helper in data_utils.py that wraps concatenate_datasets() with a try/except fallback
  • On success: zero overhead, same code path as before
  • On schema mismatch: falls back to to_list() + from_list() row-wise concatenation with a warning log so users know the fallback was triggered
  • Applied to both load_yaml() call sites (line 134 and line 239)

When loading multiple parquet files via YAML config, concatenate_datasets()
fails if columns have different inferred Arrow types. Add a safe wrapper
that falls back to row-wise concatenation on schema mismatch.
@mwxely mwxely requested a review from kcz358 March 18, 2026 13:26
Comment thread src/lmms_engine/utils/data_utils.py Outdated
Comment on lines +42 to +49
except Exception as e:
logger.warning(
f"Direct concatenation failed due to schema mismatch: {e}. " f"Falling back to row-wise concatenation."
)
all_rows = []
for ds in data_list:
all_rows.extend(ds.to_list())
return Dataset.from_list(all_rows)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be cause extremely slow and large memory needed under large scale data. Recommend to preprocess or cast dataset before training to ensure the dataset is in the same features.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — I've replaced the to_list() fallback with cast()-based schema alignment in e28d773. The new approach:

  1. Tries cast() to the first dataset's features (fast, zero-copy for compatible schemas)
  2. If that fails, builds a merged feature set preferring non-null types and casts all datasets to it

This keeps the fix zero-overhead on the happy path and avoids materializing the full dataset into memory on schema mismatch.

Replace the to_list() + from_list() fallback with cast()-based schema
alignment. This avoids materializing the entire dataset into memory,
making it safe for large-scale data.

Strategy:
1. Try cast() to the first dataset's features (fast, zero-copy)
2. If that fails, build a merged feature set preferring non-null types
   and cast all datasets to the merged schema
@kcz358 kcz358 merged commit 73ac00a into main Mar 19, 2026
3 checks passed
@kcz358 kcz358 deleted the fix/safe-concatenate-datasets branch March 19, 2026 06:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants