[feat] support messages column as JSON string in iterable datasets by mwxely · Pull Request #147 · EvolvingLMMs-Lab/lmms-engine

mwxely · 2026-03-18T13:25:00Z

Motivation

Parquet's Arrow schema inference struggles with deeply nested, heterogeneous structures like OpenAI-format chat messages. A single messages column may contain varying content types (text, image_url, video) with different nested fields across rows, leading to schema conflicts during both file creation and multi-file loading.

A pragmatic and widely-used workaround is to store the messages column as a JSON-encoded string. This is already common practice in the community — many ShareGPT-format datasets on HuggingFace Hub use this approach. However, the current iterable dataset loaders assume messages is always a native list[dict] and crash on string input with:

TypeError: string indices must be integers, not 'str'

Modifications

Add an isinstance(messages, str) check with json.loads() fallback at all entry points where messages is read from data:

Qwen3VLIterableDataset.load_from_json() — handles video/image multimodal data
VisionSFTIterableDataset.load_from_json() — handles image-only data
VisionSFTIterableDataset.load_from_hf() — handles HuggingFace dataset format

The change is backward-compatible: native list[dict] messages pass through the isinstance check with zero overhead.

Add automatic JSON string decoding for the messages column in iterable dataset loaders. This allows parquet files to store chat messages as JSON strings instead of nested Arrow structs, avoiding schema inference issues with deeply nested message formats.

kcz358

It's okay to do this but usually we want parquet to be field structure so simply inserting a field like messages as str into parquet might loosing quite a lot info. But this ops is okay here as would not hurt other operations.

mwxely · 2026-03-19T04:04:30Z

Thanks for the review! Agreed that structured parquet fields are preferred in general. This is mainly a compatibility fallback for datasets that already store messages as JSON strings (common in ShareGPT-format datasets on HF Hub). The isinstance check has zero overhead on properly structured data. LGTM to merge?

kcz358 · 2026-03-19T06:20:56Z

Config the ssh or ghp key for the future for verified signature.

mwxely requested a review from kcz358 March 18, 2026 13:26

kcz358 reviewed Mar 19, 2026

View reviewed changes

kcz358 merged commit 87a1f86 into main Mar 19, 2026
4 of 5 checks passed

kcz358 deleted the feat/json-string-messages branch March 19, 2026 06:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] support messages column as JSON string in iterable datasets#147

[feat] support messages column as JSON string in iterable datasets#147
kcz358 merged 1 commit intomainfrom
feat/json-string-messages

mwxely commented Mar 18, 2026

Uh oh!

kcz358 left a comment

Uh oh!

mwxely commented Mar 19, 2026

Uh oh!

kcz358 commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mwxely commented Mar 18, 2026

Motivation

Modifications

Uh oh!

kcz358 left a comment

Choose a reason for hiding this comment

Uh oh!

mwxely commented Mar 19, 2026

Uh oh!

kcz358 commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants