[fix] handle parquet schema mismatch in dataset concatenation by mwxely · Pull Request #146 · EvolvingLMMs-Lab/lmms-engine

mwxely · 2026-03-18T13:24:44Z

Motivation

When using YAML config to load multiple parquet files from different sources, concatenate_datasets() in DataUtilities.load_yaml() crashes with ArrowInvalid if the files have incompatible inferred Arrow schemas.

This is a common scenario in practice. For example, when one parquet file contains rows with image_url: {"url": "http://..."} (inferred as struct{url: string}) while another file has all-null values in the same column (inferred as null type), Arrow refuses to concatenate the two because their schemas don't match — even though the data is semantically compatible.

Reproduction:

datasets:
  - path: dataset_a.parquet   # has image_url with values → struct{url: string}
    data_type: parquet
  - path: dataset_b.parquet   # has image_url all null → null type
    data_type: parquet

ArrowInvalid: Schema at index 1 was different:
  image_url: null
vs
  image_url: struct<url: string>

Modifications

Add _safe_concatenate_datasets() helper in data_utils.py that wraps concatenate_datasets() with a try/except fallback
On success: zero overhead, same code path as before
On schema mismatch: falls back to to_list() + from_list() row-wise concatenation with a warning log so users know the fallback was triggered
Applied to both load_yaml() call sites (line 134 and line 239)

When loading multiple parquet files via YAML config, concatenate_datasets() fails if columns have different inferred Arrow types. Add a safe wrapper that falls back to row-wise concatenation on schema mismatch.

kcz358 · 2026-03-19T02:34:13Z

+    except Exception as e:
+        logger.warning(
+            f"Direct concatenation failed due to schema mismatch: {e}. " f"Falling back to row-wise concatenation."
+        )
+        all_rows = []
+        for ds in data_list:
+            all_rows.extend(ds.to_list())
+        return Dataset.from_list(all_rows)


This would be cause extremely slow and large memory needed under large scale data. Recommend to preprocess or cast dataset before training to ensure the dataset is in the same features.

Good point — I've replaced the to_list() fallback with cast()-based schema alignment in e28d773. The new approach:

Tries cast() to the first dataset's features (fast, zero-copy for compatible schemas)

If that fails, builds a merged feature set preferring non-null types and casts all datasets to it

This keeps the fix zero-overhead on the happy path and avoids materializing the full dataset into memory on schema mismatch.

Replace the to_list() + from_list() fallback with cast()-based schema alignment. This avoids materializing the entire dataset into memory, making it safe for large-scale data. Strategy: 1. Try cast() to the first dataset's features (fast, zero-copy) 2. If that fails, build a merged feature set preferring non-null types and cast all datasets to the merged schema

[fix] handle parquet schema mismatch in dataset concatenation

c8cc421

When loading multiple parquet files via YAML config, concatenate_datasets() fails if columns have different inferred Arrow types. Add a safe wrapper that falls back to row-wise concatenation on schema mismatch.

mwxely requested a review from kcz358 March 18, 2026 13:26

kcz358 reviewed Mar 19, 2026

View reviewed changes

kcz358 approved these changes Mar 19, 2026

View reviewed changes

kcz358 merged commit 73ac00a into main Mar 19, 2026
3 checks passed

kcz358 deleted the fix/safe-concatenate-datasets branch March 19, 2026 06:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] handle parquet schema mismatch in dataset concatenation#146

[fix] handle parquet schema mismatch in dataset concatenation#146
kcz358 merged 2 commits intomainfrom
fix/safe-concatenate-datasets

mwxely commented Mar 18, 2026

Uh oh!

kcz358 Mar 19, 2026

Uh oh!

mwxely Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mwxely commented Mar 18, 2026

Motivation

Modifications

Uh oh!

kcz358 Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

mwxely Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants