feat(text-metrics): add text-based VLM judge metrics by davidberenstein1957 · Pull Request #639 · PrunaAI/pruna

davidberenstein1957 · 2026-04-25T12:52:27Z

Summary

Five text-based judge metrics using vision-language understanding. Evaluates text outputs (captions, answers, reasoning) against images and reference text.

Metrics Added

`TextScoreMetric` (372 lines)

Zero-shot semantic similarity between reference text and generated output given an image.

Input: image + reference text + generated text
Output: 0-1 similarity score
Use case: Image caption evaluation, description quality

`OneIGTextScoreMetric`

OneIG-tuned variant of TextScore (same logic, tuned prompts/weights).

`QAAccuracyMetric` (204 lines)

Visual question-answering correctness.

Input: image + question + answer
Output: 0-1 accuracy score
Use case: VQA task evaluation

`OneIGAlignmentMetric` (234 lines)

Vision-instruction alignment (does output align with task instructions?).

Input: image + instruction + output
Output: 0-1 alignment score
OneIG-specific tuning for alignment tasks

`OneIGReasoningMetric` (357 lines)

Step-by-step reasoning quality evaluation.

Input: image + reasoning steps
Output: 0-1 reasoning quality score
Evaluates logical flow and task relevance

Files

New:

src/pruna/evaluation/metrics/metric_text_score.py — TextScore + OneIGTextScore
src/pruna/evaluation/metrics/metric_text_score_utils.py — Shared prompt templates
src/pruna/evaluation/metrics/metric_qa_accuracy.py — QA metric
src/pruna/evaluation/metrics/metric_oneig_alignment.py — OneIG alignment
src/pruna/evaluation/metrics/metric_oneig_reasoning.py — OneIG reasoning
tests/evaluation/test_text_metrics.py (228 lines) — Unit + integration tests

Modified:

src/pruna/evaluation/metrics/__init__.py — Export 5 metrics
src/pruna/evaluation/metrics/registry.py — Register in metric registry
src/pruna/evaluation/benchmarks.py — Add text metric benchmark configs

Testing

# Unit tests (mocked VLM)
pytest tests/evaluation/test_text_metrics.py -v

# Specific metric
pytest tests/evaluation/test_text_metrics.py::test_text_score_metric -v

# With real VLM (requires OPENAI_API_KEY)
pytest tests/evaluation/test_text_metrics.py::test_text_score_real -v

Usage

from pruna.evaluation.metrics import TextScoreMetric

metric = TextScoreMetric(
    vlm_type="litellm",  # or "transformers"
    model_name="openai/gpt-4o",
    api_key="sk-..."  # optional if env var set
)

score = metric(
    image=image_tensor,
    reference_text="A cat sleeping",
    generated_text="A sleeping feline"
)
print(score)  # ~0.95

Benchmarks

New benchmark configs:

text-judge-zero-shot — TextScore baseline
text-judge-qa — QA accuracy
text-judge-oneig-align — OneIG alignment
text-judge-oneig-reason — OneIG reasoning

Context

Part of 5-PR segmentation:

PR-1: Vendor Code ✓
PR-2: Infrastructure ✓
[THIS] PR-3: Text Metrics — First metric family
PR-4: Vision Metrics — Second metric family
PR-5: E2E Tests — Integration + docs

Dependencies: PR-2 (infrastructure + VLM base classes)
Blocks: PR-5 (e2e tests depend on all metrics)

Review Focus

Prompt engineering (are prompts effective? any injection risks?)
Scoring logic (how are VLM outputs converted to 0-1 scores?)
OneIG tuning (what makes OneIG variants different?)
Error handling (VLM failures, malformed inputs)
Test coverage (edge cases, empty batches)
Docstring accuracy (do docstrings match behavior?)

🤖 Generated with Claude Code

- Add LLM2Vec from OneIG vendor source - Includes Llama encoder and bidirectional models - Self-contained, no dependencies on Pruna internals - Licensed under Apache 2.0

- Add BaseVLM abstract interface - Add LitellmVLM for API-based inference (OpenAI, Anthropic, etc.) - Add TransformersVLM for local Hugging Face models - Add StatefulVLMMeanScoresMetric base class for judge metrics - Add vlm_utils.py with image/batch utilities - Add pyproject.toml dependency pins (peft, litellm) - Add unit tests for infrastructure

- Add TextScoreMetric for semantic similarity - Add OneIGTextScoreMetric (OneIG-tuned variant) - Add QAAccuracyMetric for question-answering - Add OneIGAlignmentMetric for vision-instruction alignment - Add OneIGReasoningMetric for step-by-step reasoning - Register all text metrics in registry - Add benchmark configs for text-based evaluation - Include unit and integration tests with mocked VLM

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

^{Reviewed by Cursor Bugbot for commit 077bcca. Configure here.}

cursor · 2026-04-25T12:55:11Z

+        name="OneIG Text Rendering",
+        description="OneIG subset: text and graphics painted into the image.",
+        metrics=["oneig_text_score"],
+        task_type=TASK_TYPE_TEXT_IMAGE,


OneIG subset benchmarks crash at import time

High Severity

The six OneIG subset benchmarks (e.g. "OneIG Anime Stylization") produce lookup_key values like "OneIGAnimeStylization" via name.replace(" ", ""), but base_datasets only contains a single "OneIG" entry. The _register method checks benchmark.lookup_key not in base_datasets and raises ValueError when it's missing. Since this loop runs at module level, importing benchmarks.py (done by task.py) will crash immediately on the first OneIG subset benchmark.

Additional Locations (1)

src/pruna/evaluation/benchmarks.py#L94-L100

^{Reviewed by Cursor Bugbot for commit 077bcca. Configure here.}

cursor · 2026-04-25T12:55:12Z

+        self.pooling_mode = pooling_mode
+        self.skip_instruction = skip_instruction
+        self.max_length = max_length
+        self.doc_max_length = 512


LLM2Vec ignores doc_max_length constructor parameter

Low Severity

LLM2Vec.__init__ accepts a doc_max_length parameter but hardcodes self.doc_max_length = 512 instead of using the passed value. Any caller providing a custom doc_max_length will have it silently ignored. Currently the only caller also passes 512, so there's no runtime impact yet, but it's a latent bug.

^{Reviewed by Cursor Bugbot for commit 077bcca. Configure here.}

cursor · 2026-04-25T12:55:12Z

+            out.append(Image.fromarray(np.asarray(img)).convert("RGB"))
+        else:
+            out.append(img)
+    return out


Duplicate tensor-to-PIL conversion logic across files

Low Severity

_to_pil_list in metric_oneig_reasoning.py largely duplicates _tensor_to_pil and _process_images from vlm_utils.py. Both convert tensors to PIL images with nearly identical logic (handle 4D batch, normalize, transpose CHW→HWC). Having two copies risks divergent bug fixes — notably _to_pil_list conditionally transposes only when shape[0] == 3, while _tensor_to_pil always transposes. The existing utility could be reused.

Additional Locations (1)

src/pruna/evaluation/metrics/vlm_utils.py#L36-L47

^{Reviewed by Cursor Bugbot for commit 077bcca. Configure here.}

davidberenstein1957 · 2026-04-28T13:06:53Z

Superseded by metric-focused stacked PRs: #645, #646, #647, #648, #649, #650, #651, and stacked e2e #641.

davidberenstein1957 added 3 commits April 25, 2026 12:49

feat(vendor): add LLM2Vec embedding model

c933c4d

- Add LLM2Vec from OneIG vendor source - Includes Llama encoder and bidirectional models - Self-contained, no dependencies on Pruna internals - Licensed under Apache 2.0

cursor Bot reviewed Apr 25, 2026

View reviewed changes

davidberenstein1957 closed this Apr 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(text-metrics): add text-based VLM judge metrics#639

feat(text-metrics): add text-based VLM judge metrics#639
davidberenstein1957 wants to merge 3 commits into
mainfrom
feat/vlm-pr-3-text-metrics

davidberenstein1957 commented Apr 25, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Apr 25, 2026

Uh oh!

cursor Bot Apr 25, 2026

Uh oh!

cursor Bot Apr 25, 2026

Uh oh!

davidberenstein1957 commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

davidberenstein1957 commented Apr 25, 2026

Summary

Metrics Added

TextScoreMetric (372 lines)

OneIGTextScoreMetric

QAAccuracyMetric (204 lines)

OneIGAlignmentMetric (234 lines)

OneIGReasoningMetric (357 lines)

Files

Testing

Usage

Benchmarks

Context

Review Focus

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 25, 2026

Choose a reason for hiding this comment

OneIG subset benchmarks crash at import time

Uh oh!

cursor Bot Apr 25, 2026

Choose a reason for hiding this comment

LLM2Vec ignores doc_max_length constructor parameter

Uh oh!

cursor Bot Apr 25, 2026

Choose a reason for hiding this comment

Duplicate tensor-to-PIL conversion logic across files

Uh oh!

davidberenstein1957 commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`TextScoreMetric` (372 lines)

`OneIGTextScoreMetric`

`QAAccuracyMetric` (204 lines)

`OneIGAlignmentMetric` (234 lines)

`OneIGReasoningMetric` (357 lines)

LLM2Vec ignores `doc_max_length` constructor parameter