This directory contains ready-to-run offline inference examples for MOSS-VL.
The script supports full-modality offline inference through model.offline_generate(...), including:
- pure text
- single image
- multiple images
- single video
- multiple videos
- interleaved image-video inputs in the
messagesformat
run_inference.py is compatible with both releases of the MOSS-VL checkpoint:
transformers==4.57.1— usesMossVLImageProcessorFastand exposes avision_chunked_lengthknob inside the vision tower.transformers==5.5.4— uses the slow PIL-basedMossVLImageProcessorand processes the entire vision input in a single forward pass.
The offline_generate / offline_batch_generate / offline_image_generate / offline_video_generate API is identical on both versions, so the script requires no changes when switching checkpoints. Empirically the two checkpoints produce token-identical outputs under do_sample=false on the example queries in this directory.
vision_chunked_lengthis accepted but has no effect onMOSS-VL-Instruct. The new modeling file runs the visual tower over all media in one pass, so the value is silently ignored when usingtransformers==5.5.4. You can keep it ingenerate_kwargsfor backward compatibility; it will still shard prefill on the legacytransformers==4.57.1checkpoint.
Image examples:
python inference/run_inference.py \
--checkpoint /path/to/dummy-checkpoint \
--mode offline \
--input inference/image_queries.jsonVideo example:
python inference/run_inference.py \
--checkpoint /path/to/dummy-checkpoint \
--mode offline \
--input inference/video_queries.jsonText-only example:
python inference/run_inference.py \
--checkpoint /path/to/dummy-checkpoint \
--mode offline \
--input inference/batch_queries.jsonSFT / validation-set example in training jsonl format:
python inference/run_inference.py \
--checkpoint /path/to/dummy-checkpoint \
--mode offline \
--input /path/to/valid.jsonlIf --output is omitted, the script writes results to <input_stem>_results.json.
The input file can be either:
- a JSON list, where each item is one query
- a JSONL file, where each line is one sample
Each query can use either of the following formats:
messagespromptwith optionalimagesandvideos
Optional fields such as media_kwargs, generate_kwargs, and system_prompt are also supported.
For JSONL inputs, the script also accepts the standard SFT training formats documented in mossvl_finetune/README.md:
messagesorconversationswith top-levelimages/videosprompt/responsewith optionalimages/videos
When a training sample includes assistant targets, the loader automatically trims trailing assistant turns at inference time and keeps the remaining context up to the last user turn. For conversation-style samples that use <|image|> or <|video|> placeholders in text, the loader also reconstructs structured multimodal messages content before calling offline_generate.
The provided examples use the messages format. Example:
[
{
"messages": [
{
"role": "user",
"content": [
{ "type": "image", "image": "assets/images/bill.png" },
{ "type": "text", "text": "Describe this image." }
]
}
],
"media_kwargs": {},
"generate_kwargs": {
"max_new_tokens": 256,
"do_sample": false,
"vision_chunked_length": 64
}
}
]The prompt format is also supported. Example:
[
{
"prompt": "Describe this image.",
"images": ["assets/images/bill.png"],
"videos": [],
"media_kwargs": {},
"generate_kwargs": {
"max_new_tokens": 256,
"do_sample": false,
"vision_chunked_length": 64
}
}
]Relative media paths are resolved relative to the JSON file location.
image_queries.json: image and multi-image examplesvideo_queries.json: video examplebatch_queries.json: text-only exampleassets/images: demo imagesassets/videos: demo videos