Skip to content

[Fix] preserve image data in preprocess for VLM training on multimodal data#532

Open
jamesahou wants to merge 2 commits intosgl-project:mainfrom
jamesahou:mm-preprocess-fix
Open

[Fix] preserve image data in preprocess for VLM training on multimodal data#532
jamesahou wants to merge 2 commits intosgl-project:mainfrom
jamesahou:mm-preprocess-fix

Conversation

@jamesahou
Copy link
Copy Markdown

Motivation

When using multimodal conversations in training, safe_conversations_generator drops the image field when loading and cleaning the data, which leads preprocess_vlm_conversations to fail when trying to access the image field in the conversation on this line. For example running this toy script

import json, os
from datasets import Dataset
from specforge.utils import safe_conversations_generator
from specforge.data.preprocessing import preprocess_vlm_conversations
from specforge.data.template import TEMPLATE_REGISTRY

test_file = "test_vlm_image.jsonl"
with open(test_file, "w") as f:
    f.write(json.dumps({
        "id": 1,
        "image": "/data/images/test.jpg",
        "conversations": [
            {"role": "user", "content": "Describe this image."},
            {"role": "assistant", "content": "A cat."},
        ],
    }) + "\n")

dataset = Dataset.from_generator(
    generator=safe_conversations_generator,
    gen_kwargs={"file_path": test_file},
)

preprocess_vlm_conversations(
    processor=None,
    examples=dataset[:],
    chat_template=TEMPLATE_REGISTRY.get("qwen2-vl"),
    max_length=2048,
)

os.remove(test_file)

yields the following:

Error: Exit code 1
     /home/ec2-user/qwenvl-eagle/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the
     flash attention backend.
       warnings.warn(
     <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
     <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
     Set TORCH_CUDA_ARCH_LIST to 8.9

     Generating train split: 0 examples [00:00, ? examples/s]
     Generating train split: 1 examples [00:00, 587.85 examples/s]
     Traceback (most recent call last):
       File "/home/ec2-user/qwenvl-eagle/test.py", line 23, in <module>
         preprocess_vlm_conversations(
       File "/home/ec2-user/qwenvl-eagle/SpecForge/specforge/data/preprocessing.py", line 217, in preprocess_vlm_conversations
         for i, image in enumerate(examples["image"]):
                                   ~~~~~~~~^^^^^^^^^
     KeyError: 'image'

Modifications

  • Added image field preservation to safe_conversations_generator
  • Added test (TestBuildEagle3Dataset.test_vlm_image_field_preserved) to prevent future regression. If this was failing silently, multimodal quality will likely be hurt.

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

@jamesahou jamesahou requested a review from FrankLeeeee as a code owner April 13, 2026 20:52
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@jamesahou jamesahou changed the title [Bug] fix to preserve image data in preprocess for VLM training on multimodal data [Fix] preserve image data in preprocess for VLM training on multimodal data Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant