[Fix] preserve image data in preprocess for VLM training on multimodal data by jamesahou · Pull Request #532 · sgl-project/SpecForge

jamesahou · 2026-04-13T20:52:30Z

Motivation

When using multimodal conversations in training, safe_conversations_generator drops the image field when loading and cleaning the data, which leads preprocess_vlm_conversations to fail when trying to access the image field in the conversation on this line. For example running this toy script

import json, os
from datasets import Dataset
from specforge.utils import safe_conversations_generator
from specforge.data.preprocessing import preprocess_vlm_conversations
from specforge.data.template import TEMPLATE_REGISTRY

test_file = "test_vlm_image.jsonl"
with open(test_file, "w") as f:
    f.write(json.dumps({
        "id": 1,
        "image": "/data/images/test.jpg",
        "conversations": [
            {"role": "user", "content": "Describe this image."},
            {"role": "assistant", "content": "A cat."},
        ],
    }) + "\n")

dataset = Dataset.from_generator(
    generator=safe_conversations_generator,
    gen_kwargs={"file_path": test_file},
)

preprocess_vlm_conversations(
    processor=None,
    examples=dataset[:],
    chat_template=TEMPLATE_REGISTRY.get("qwen2-vl"),
    max_length=2048,
)

os.remove(test_file)

yields the following:

Error: Exit code 1
     /home/ec2-user/qwenvl-eagle/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the
     flash attention backend.
       warnings.warn(
     <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
     <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
     Set TORCH_CUDA_ARCH_LIST to 8.9

     Generating train split: 0 examples [00:00, ? examples/s]
     Generating train split: 1 examples [00:00, 587.85 examples/s]
     Traceback (most recent call last):
       File "/home/ec2-user/qwenvl-eagle/test.py", line 23, in <module>
         preprocess_vlm_conversations(
       File "/home/ec2-user/qwenvl-eagle/SpecForge/specforge/data/preprocessing.py", line 217, in preprocess_vlm_conversations
         for i, image in enumerate(examples["image"]):
                                   ~~~~~~~~^^^^^^^^^
     KeyError: 'image'

Modifications

Added image field preservation to safe_conversations_generator
Added test (TestBuildEagle3Dataset.test_vlm_image_field_preserved) to prevent future regression. If this was failing silently, multimodal quality will likely be hurt.

Related Issues

Related to (Fix VLM preprocessing and add mRoPE position handling in target head #527), which fixes downstream VLM preprocessing issue. This minor patch addresses gap that drops the image field separate from the issue.
May be related to poor accept lengths for VL models cited in (EAGLE3 on Qwen2.5-VL / Qwen3-VL shows extremely low accept length (accept_len ≈ 1) #310) though it would be a silent fail.

Accuracy Test

Benchmark & Profiling

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://sgl-fru7574.slack.com/archives/C09784E3EN6 to discuss your PR.

gemini-code-assist · 2026-04-13T20:52:34Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

jamesahou added 2 commits April 13, 2026 19:26

preserve image in preprocess for multimodal data

933d4b4

style adjustments and regression test added

de9e024

jamesahou requested a review from FrankLeeeee as a code owner April 13, 2026 20:52

jamesahou changed the title ~~[Bug] fix to preserve image data in preprocess for VLM training on multimodal data~~ [Fix] preserve image data in preprocess for VLM training on multimodal data Apr 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Fix] preserve image data in preprocess for VLM training on multimodal data#532

[Fix] preserve image data in preprocess for VLM training on multimodal data#532
jamesahou wants to merge 2 commits into
sgl-project:mainfrom
jamesahou:mm-preprocess-fix

jamesahou commented Apr 13, 2026

Uh oh!

gemini-code-assist Bot commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jamesahou commented Apr 13, 2026

Motivation

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant