Skip to content

[feat] support messages column as JSON string in iterable datasets#145

Closed
mwxely wants to merge 1 commit intoEvolvingLMMs-Lab:mainfrom
mwxely:feat/json-string-messages
Closed

[feat] support messages column as JSON string in iterable datasets#145
mwxely wants to merge 1 commit intoEvolvingLMMs-Lab:mainfrom
mwxely:feat/json-string-messages

Conversation

@mwxely
Copy link
Copy Markdown
Collaborator

@mwxely mwxely commented Mar 18, 2026

Motivation

Parquet's Arrow schema inference struggles with deeply nested, heterogeneous structures like OpenAI-format chat messages. A single messages column may contain varying content types (text, image_url, video) with different nested fields across rows, leading to schema conflicts during both file creation and multi-file loading.

A pragmatic and widely-used workaround is to store the messages column as a JSON-encoded string. This is already common practice in the community — many ShareGPT-format datasets on HuggingFace Hub use this approach. However, the current iterable dataset loaders assume messages is always a native list[dict] and crash on string input with:

TypeError: string indices must be integers, not 'str'

Modifications

Add an isinstance(messages, str) check with json.loads() fallback at all entry points where messages is read from data:

  • Qwen3VLIterableDataset.load_from_json() — handles video/image multimodal data
  • VisionSFTIterableDataset.load_from_json() — handles image-only data
  • VisionSFTIterableDataset.load_from_hf() — handles HuggingFace dataset format

The change is backward-compatible: native list[dict] messages pass through the isinstance check with zero overhead.

Add automatic JSON string decoding for the messages column in iterable
dataset loaders. This allows parquet files to store chat messages as
JSON strings instead of nested Arrow structs, avoiding schema inference
issues with deeply nested message formats.
@mwxely mwxely closed this Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant