Skip to content

Commit 5753e51

Browse files
nssmdUbuntuHAPIclaudemwxely
authored
Add UniG2U benchmark task with model support (#1297)
* Add UniG2U benchmark task with model support Add UniG2U (Unified Generation-to-Understanding) benchmark covering 11 sub-tasks across chart understanding, geometry, physics, spatial planning, and visual puzzles. Two evaluation modes: - unig2u: standard understanding (single-stage) - unig2u_GtA: Visual CoT two-stage (generate auxiliary image, then answer) GtA triggered explicitly via generation_kwargs.visual_cot: true. New model implementations: Ovis-U1, ILLUME+, MMaDa, Qwen-Image-Edit. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run> * Fix qwen_image_edit device_map for diffusion pipeline DiffusionPipeline does not support device_map="auto", use "balanced" instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add Visual CoT (GtA) support check for models Models without GtA implementation now raise a clear error when run on visual_cot tasks, instead of silently degrading with garbled prompts. - Add `supports_visual_cot` class attribute to lmms base (default False) - Add `_check_visual_cot_support()` guard in base class - Call the check in evaluator before generate_until - Set `supports_visual_cot = True` on ovis_u1, bagel_unig2u, illume_plus, qwen_image_edit Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Revert "Add Visual CoT (GtA) support check for models" This reverts commit 2bc4f01. * Add generate_visual_cot as dedicated output type for GtA tasks Visual CoT (GtA) tasks now use output_type: generate_visual_cot instead of generate_until. Models must implement generate_visual_cot() to run these tasks — models without it get a clear NotImplementedError. - Add generate_visual_cot() default in lmms base class (raises NotImplementedError) - Register generate_visual_cot in ALL_OUTPUT_TYPES and construct_requests - All 31 visual_cot yaml configs: output_type → generate_visual_cot - 4 GtA models (ovis_u1, bagel_unig2u, illume_plus, qwen_image_edit): implement generate_visual_cot() delegating to generate_until() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update README with generate_visual_cot documentation Reflect the new GtA mechanism: models must implement generate_visual_cot() to run visual_cot tasks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Remove duplicate entries in .gitignore Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Organize unig2u task folder by sub-task Move yaml configs into per-task subdirectories for clarity. Symlink utils.py into each subdirectory for !function resolution. tasks/unig2u/ ├── chartqa100/ ├── geometry3k/ ├── auxsolidmath/ ├── babyvision/ ├── illusionbench/ ├── mmsi/ ├── phyx/ ├── realunify/ ├── uni_mmmu/ ├── vsp/ ├── visualpuzzles/ ├── unig2u.yaml # top-level standard group ├── unig2u_GtA.yaml # top-level GtA group ├── utils.py # shared utilities └── README.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Address PR review feedback - Fix function name collisions: rename generic doc_to_visual/doc_to_text/ process_results/aggregate_results to babyvision_* and realunify_* prefixed versions, update 10 yaml files accordingly - Fix Python 3.9 compat: _JudgeClient | None → Optional[_JudgeClient] - Remove eval_logger reassignment from loguru to stdlib logging - Remove parse_response random fallback (return None instead) - Remove gradient_checkpointing_enable() during eval in illume_plus - Add OOM warning for dual model loading in qwen_image_edit - Fix bagel_umm registration to use short class name - Fix != None → is not None - Fix GPT-5.1 → GPT-4o in docstrings - Fix bare except → except (json.JSONDecodeError, ValueError, KeyError) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Split merged utils.py into per-task independent utils Replace the single 2900-line merged utils.py with the original per-task utils.py files from the reference implementation. Each sub-task folder now has its own independent utils.py, eliminating function name collisions entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: run black + isort formatting * Fix code quality issues in per-task utils - Remove eval_logger reassignment in mmsi, visualpuzzles utils - Remove random fallback in visualpuzzles parse_response (return None) - Fix != None → is not None in visualpuzzles - Fix GPT-5.1 → GPT-4o in auxsolidmath, geometry3k docstrings - Fix bare except → specific exceptions in uni_mmmu Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix lint: run black + isort on mmsi and visualpuzzles utils Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Ubuntu <xinjiezhang@RobustDNN-A100-47.anzplpv4vzne3ajfbcfpzr1dkd.ix.internal.cloudapp.net> Co-authored-by: HAPI <noreply@hapi.run> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: mwxely <yang0756@e.ntu.edu.sg>
1 parent be9c135 commit 5753e51

101 files changed

Lines changed: 11471 additions & 1 deletion

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,3 +81,5 @@ CLAUDE.md
8181
.opencode
8282
.ignored/
8383
.worktrees/
84+
Bagel/
85+
MMaDA/

lmms_eval/api/model.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,17 @@ def generate_until_multi_round(self, requests) -> List[str]:
118118
"""
119119
pass
120120

121+
def generate_visual_cot(self, requests) -> List[str]:
122+
"""Visual CoT (GtA) generation: two-stage pipeline that generates an
123+
auxiliary visualization image (Stage 1) and then answers using both
124+
the original and generated images (Stage 2).
125+
126+
Models that support GtA must override this method.
127+
"""
128+
raise NotImplementedError(
129+
f"{type(self).__name__} does not support Visual CoT (GtA). " f"To run visual_cot tasks, the model must implement generate_visual_cot(). " f"Supported models: ovis_u1, bagel_unig2u, illume_plus, qwen_image_edit"
130+
)
131+
121132
@classmethod
122133
def create_from_arg_string(cls: Type[T], arg_string: str, additional_config: Optional[dict] = None) -> T:
123134
"""

lmms_eval/api/task.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@
5959
"generate_until",
6060
"generate_until_multi_round",
6161
"generate_until_agentic",
62+
"generate_visual_cot",
6263
]
6364

6465

@@ -1563,6 +1564,8 @@ def construct_requests(self, doc_id: int, ctx: str, **kwargs) -> Union[List[Inst
15631564

15641565
elif self.OUTPUT_TYPE == "generate_until":
15651566
arguments = (ctx, copy.deepcopy(self.config.generation_kwargs), self.doc_to_visual, doc_id, self.config.task, split)
1567+
elif self.OUTPUT_TYPE == "generate_visual_cot":
1568+
arguments = (ctx, copy.deepcopy(self.config.generation_kwargs), self.doc_to_visual, doc_id, self.config.task, split)
15661569
elif self.OUTPUT_TYPE == "generate_until_multi_round":
15671570
arguments = (ctx, copy.deepcopy(self.config.generation_kwargs), self.doc_to_visual, partial(self.config.doc_to_text, lmms_eval_specific_kwargs=self.lmms_eval_specific_kwargs), doc_id, self.config.task, split)
15681571
elif self.OUTPUT_TYPE == "generate_until_agentic":
@@ -1572,7 +1575,7 @@ def construct_requests(self, doc_id: int, ctx: str, **kwargs) -> Union[List[Inst
15721575
# TODO: we add a full_docs interface here for some evaluations that needs to access the full datasets during process_results function. we may have better ways to handle this.
15731576
@retry(stop=(stop_after_attempt(5) | stop_after_delay(1200)), wait=wait_fixed(2))
15741577
def process_results(self, doc, results, full_docs=None):
1575-
if self.OUTPUT_TYPE == "generate_until":
1578+
if self.OUTPUT_TYPE in ("generate_until", "generate_visual_cot"):
15761579
if isinstance(results, list) and isinstance(results[0], list):
15771580
results = [res.strip() for res in results[0]]
15781581
else:

lmms_eval/models/__init__.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
"auroracap": "AuroraCap",
2424
"bagel": "Bagel",
2525
"bagel_umm": "BagelUMM",
26+
"bagel_unig2u": "BagelUniG2U",
2627
"baichuan_omni": "BaichuanOmni",
2728
"batch_gpt4": "BatchGPT4",
2829
"claude": "Claude",
@@ -40,6 +41,7 @@
4041
"gemma3": "Gemma3",
4142
"gpt4v": "GPT4V",
4243
"idefics2": "Idefics2",
44+
"illume_plus": "ILLUMEPlus",
4345
"instructblip": "InstructBLIP",
4446
"internvideo2_5": "InternVideo2_5",
4547
"internvideo2": "InternVideo2",
@@ -63,12 +65,14 @@
6365
"minicpm_o": "MiniCPM_O",
6466
"minicpm_v": "MiniCPM_V",
6567
"minimonkey": "MiniMonkey",
68+
"mmada": "MMaDA",
6669
"moviechat": "MovieChat",
6770
"mplug_owl_video": "mplug_Owl",
6871
"ola": "Ola",
6972
"omnivinci": "OmniVinci",
7073
"openai": "OpenAICompatible",
7174
"oryx": "Oryx",
75+
"ovis_u1": "OvisU1",
7276
"penguinvl": "PenguinVL",
7377
"phi3v": "Phi3v",
7478
"phi4_multimodal": "Phi4",
@@ -79,6 +83,7 @@
7983
"qwen2_5_vl": "Qwen2_5_VL",
8084
"qwen2_audio": "Qwen2_Audio",
8185
"qwen2_vl": "Qwen2_VL",
86+
"qwen_image_edit": "QwenImageEdit",
8287
"qwen3_omni": "Qwen3_Omni",
8388
"qwen3_vl": "Qwen3_VL",
8489
"qwen3_5": "Qwen3_5",

0 commit comments

Comments
 (0)