Skip to content

feat: add Open-X VQA task#1346

Open
njb-nvidia wants to merge 1 commit into
EvolvingLMMs-Lab:mainfrom
njb-nvidia:add-openxvqa-task
Open

feat: add Open-X VQA task#1346
njb-nvidia wants to merge 1 commit into
EvolvingLMMs-Lab:mainfrom
njb-nvidia:add-openxvqa-task

Conversation

@njb-nvidia
Copy link
Copy Markdown
Contributor

Summary

Adds Open-X VQA, a multiple-choice VQA benchmark for embodied AI / robotic manipulation scenes derived from the Open-X-Embodiment data. Each item is a single-image MCQ where the model selects one of A-D.

  • Dataset: nv-njb/OpenXVQA on HuggingFace (6,676 test items, single `test` split).
  • Metric: `openxvqa_accuracy` — exact-match on the extracted MCQ letter.

Files

  • `lmms_eval/tasks/openxvqa/openxvqa.yaml` — task config.
  • `lmms_eval/tasks/openxvqa/utils.py` — image bytes -> PIL, MCQ letter extraction.

Parity vs. local fork

Qwen3-VL-2B-Instruct, full `test` split (6,676 items), 8x H100, greedy decoding.

Source Accuracy Identical predictions
Fork 0.5685 -
Upstream 0.5785 5,711 / 6,676 (85.6%)

Delta of +1.0pp is within the noise we have seen on other ports caused by minor drift in the upstream `qwen3_vl` model class.

Test plan

  • `uv run lmms-eval --tasks openxvqa --limit 8` smoke (single GPU)
  • Full `test` run on 8x H100 with Qwen3-VL-2B-Instruct, scores match the fork within noise
  • Per-doc analysis: 85.6% identical filtered_resps

Open-X VQA is a multiple-choice VQA benchmark for embodied AI / robotic
manipulation scenes, derived from the Open-X-Embodiment data. Each item
is a single-image MCQ where the model picks one of A-D.

Dataset: nv-njb/OpenXVQA on HuggingFace (6,676 test items, single split).

Metric: openxvqa_accuracy — exact-match on the extracted MCQ letter.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant