Skip to content

[feat] support messages column as JSON string in iterable datasets#147

Merged
kcz358 merged 1 commit intomainfrom
feat/json-string-messages
Mar 19, 2026
Merged

[feat] support messages column as JSON string in iterable datasets#147
kcz358 merged 1 commit intomainfrom
feat/json-string-messages

Conversation

@mwxely
Copy link
Copy Markdown
Collaborator

@mwxely mwxely commented Mar 18, 2026

Motivation

Parquet's Arrow schema inference struggles with deeply nested, heterogeneous structures like OpenAI-format chat messages. A single messages column may contain varying content types (text, image_url, video) with different nested fields across rows, leading to schema conflicts during both file creation and multi-file loading.

A pragmatic and widely-used workaround is to store the messages column as a JSON-encoded string. This is already common practice in the community — many ShareGPT-format datasets on HuggingFace Hub use this approach. However, the current iterable dataset loaders assume messages is always a native list[dict] and crash on string input with:

TypeError: string indices must be integers, not 'str'

Modifications

Add an isinstance(messages, str) check with json.loads() fallback at all entry points where messages is read from data:

  • Qwen3VLIterableDataset.load_from_json() — handles video/image multimodal data
  • VisionSFTIterableDataset.load_from_json() — handles image-only data
  • VisionSFTIterableDataset.load_from_hf() — handles HuggingFace dataset format

The change is backward-compatible: native list[dict] messages pass through the isinstance check with zero overhead.

Add automatic JSON string decoding for the messages column in iterable
dataset loaders. This allows parquet files to store chat messages as
JSON strings instead of nested Arrow structs, avoiding schema inference
issues with deeply nested message formats.
@mwxely mwxely requested a review from kcz358 March 18, 2026 13:26
Copy link
Copy Markdown
Collaborator

@kcz358 kcz358 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay to do this but usually we want parquet to be field structure so simply inserting a field like messages as str into parquet might loosing quite a lot info. But this ops is okay here as would not hurt other operations.

@mwxely
Copy link
Copy Markdown
Collaborator Author

mwxely commented Mar 19, 2026

Thanks for the review! Agreed that structured parquet fields are preferred in general. This is mainly a compatibility fallback for datasets that already store messages as JSON strings (common in ShareGPT-format datasets on HF Hub). The isinstance check has zero overhead on properly structured data. LGTM to merge?

@kcz358
Copy link
Copy Markdown
Collaborator

kcz358 commented Mar 19, 2026

Config the ssh or ghp key for the future for verified signature.

@kcz358 kcz358 merged commit 87a1f86 into main Mar 19, 2026
4 of 5 checks passed
@kcz358 kcz358 deleted the feat/json-string-messages branch March 19, 2026 06:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants