feat(text-metrics): add text-based VLM judge metrics#639
feat(text-metrics): add text-based VLM judge metrics#639davidberenstein1957 wants to merge 3 commits into
Conversation
- Add LLM2Vec from OneIG vendor source - Includes Llama encoder and bidirectional models - Self-contained, no dependencies on Pruna internals - Licensed under Apache 2.0
- Add BaseVLM abstract interface - Add LitellmVLM for API-based inference (OpenAI, Anthropic, etc.) - Add TransformersVLM for local Hugging Face models - Add StatefulVLMMeanScoresMetric base class for judge metrics - Add vlm_utils.py with image/batch utilities - Add pyproject.toml dependency pins (peft, litellm) - Add unit tests for infrastructure
- Add TextScoreMetric for semantic similarity - Add OneIGTextScoreMetric (OneIG-tuned variant) - Add QAAccuracyMetric for question-answering - Add OneIGAlignmentMetric for vision-instruction alignment - Add OneIGReasoningMetric for step-by-step reasoning - Register all text metrics in registry - Add benchmark configs for text-based evaluation - Include unit and integration tests with mocked VLM
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 077bcca. Configure here.
| name="OneIG Text Rendering", | ||
| description="OneIG subset: text and graphics painted into the image.", | ||
| metrics=["oneig_text_score"], | ||
| task_type=TASK_TYPE_TEXT_IMAGE, |
There was a problem hiding this comment.
OneIG subset benchmarks crash at import time
High Severity
The six OneIG subset benchmarks (e.g. "OneIG Anime Stylization") produce lookup_key values like "OneIGAnimeStylization" via name.replace(" ", ""), but base_datasets only contains a single "OneIG" entry. The _register method checks benchmark.lookup_key not in base_datasets and raises ValueError when it's missing. Since this loop runs at module level, importing benchmarks.py (done by task.py) will crash immediately on the first OneIG subset benchmark.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 077bcca. Configure here.
| self.pooling_mode = pooling_mode | ||
| self.skip_instruction = skip_instruction | ||
| self.max_length = max_length | ||
| self.doc_max_length = 512 |
There was a problem hiding this comment.
LLM2Vec ignores doc_max_length constructor parameter
Low Severity
LLM2Vec.__init__ accepts a doc_max_length parameter but hardcodes self.doc_max_length = 512 instead of using the passed value. Any caller providing a custom doc_max_length will have it silently ignored. Currently the only caller also passes 512, so there's no runtime impact yet, but it's a latent bug.
Reviewed by Cursor Bugbot for commit 077bcca. Configure here.
| out.append(Image.fromarray(np.asarray(img)).convert("RGB")) | ||
| else: | ||
| out.append(img) | ||
| return out |
There was a problem hiding this comment.
Duplicate tensor-to-PIL conversion logic across files
Low Severity
_to_pil_list in metric_oneig_reasoning.py largely duplicates _tensor_to_pil and _process_images from vlm_utils.py. Both convert tensors to PIL images with nearly identical logic (handle 4D batch, normalize, transpose CHW→HWC). Having two copies risks divergent bug fixes — notably _to_pil_list conditionally transposes only when shape[0] == 3, while _tensor_to_pil always transposes. The existing utility could be reused.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 077bcca. Configure here.


Summary
Five text-based judge metrics using vision-language understanding. Evaluates text outputs (captions, answers, reasoning) against images and reference text.
Metrics Added
TextScoreMetric(372 lines)Zero-shot semantic similarity between reference text and generated output given an image.
OneIGTextScoreMetricOneIG-tuned variant of TextScore (same logic, tuned prompts/weights).
QAAccuracyMetric(204 lines)Visual question-answering correctness.
OneIGAlignmentMetric(234 lines)Vision-instruction alignment (does output align with task instructions?).
OneIGReasoningMetric(357 lines)Step-by-step reasoning quality evaluation.
Files
New:
src/pruna/evaluation/metrics/metric_text_score.py— TextScore + OneIGTextScoresrc/pruna/evaluation/metrics/metric_text_score_utils.py— Shared prompt templatessrc/pruna/evaluation/metrics/metric_qa_accuracy.py— QA metricsrc/pruna/evaluation/metrics/metric_oneig_alignment.py— OneIG alignmentsrc/pruna/evaluation/metrics/metric_oneig_reasoning.py— OneIG reasoningtests/evaluation/test_text_metrics.py(228 lines) — Unit + integration testsModified:
src/pruna/evaluation/metrics/__init__.py— Export 5 metricssrc/pruna/evaluation/metrics/registry.py— Register in metric registrysrc/pruna/evaluation/benchmarks.py— Add text metric benchmark configsTesting
Usage
Benchmarks
New benchmark configs:
text-judge-zero-shot— TextScore baselinetext-judge-qa— QA accuracytext-judge-oneig-align— OneIG alignmenttext-judge-oneig-reason— OneIG reasoningContext
Part of 5-PR segmentation:
Dependencies: PR-2 (infrastructure + VLM base classes)
Blocks: PR-5 (e2e tests depend on all metrics)
Review Focus
🤖 Generated with Claude Code