Skip to content

feat(text-metrics): add text-based VLM judge metrics#639

Closed
davidberenstein1957 wants to merge 3 commits into
mainfrom
feat/vlm-pr-3-text-metrics
Closed

feat(text-metrics): add text-based VLM judge metrics#639
davidberenstein1957 wants to merge 3 commits into
mainfrom
feat/vlm-pr-3-text-metrics

Conversation

@davidberenstein1957
Copy link
Copy Markdown
Member

Summary

Five text-based judge metrics using vision-language understanding. Evaluates text outputs (captions, answers, reasoning) against images and reference text.

Metrics Added

TextScoreMetric (372 lines)

Zero-shot semantic similarity between reference text and generated output given an image.

  • Input: image + reference text + generated text
  • Output: 0-1 similarity score
  • Use case: Image caption evaluation, description quality

OneIGTextScoreMetric

OneIG-tuned variant of TextScore (same logic, tuned prompts/weights).

QAAccuracyMetric (204 lines)

Visual question-answering correctness.

  • Input: image + question + answer
  • Output: 0-1 accuracy score
  • Use case: VQA task evaluation

OneIGAlignmentMetric (234 lines)

Vision-instruction alignment (does output align with task instructions?).

  • Input: image + instruction + output
  • Output: 0-1 alignment score
  • OneIG-specific tuning for alignment tasks

OneIGReasoningMetric (357 lines)

Step-by-step reasoning quality evaluation.

  • Input: image + reasoning steps
  • Output: 0-1 reasoning quality score
  • Evaluates logical flow and task relevance

Files

New:

  • src/pruna/evaluation/metrics/metric_text_score.py — TextScore + OneIGTextScore
  • src/pruna/evaluation/metrics/metric_text_score_utils.py — Shared prompt templates
  • src/pruna/evaluation/metrics/metric_qa_accuracy.py — QA metric
  • src/pruna/evaluation/metrics/metric_oneig_alignment.py — OneIG alignment
  • src/pruna/evaluation/metrics/metric_oneig_reasoning.py — OneIG reasoning
  • tests/evaluation/test_text_metrics.py (228 lines) — Unit + integration tests

Modified:

  • src/pruna/evaluation/metrics/__init__.py — Export 5 metrics
  • src/pruna/evaluation/metrics/registry.py — Register in metric registry
  • src/pruna/evaluation/benchmarks.py — Add text metric benchmark configs

Testing

# Unit tests (mocked VLM)
pytest tests/evaluation/test_text_metrics.py -v

# Specific metric
pytest tests/evaluation/test_text_metrics.py::test_text_score_metric -v

# With real VLM (requires OPENAI_API_KEY)
pytest tests/evaluation/test_text_metrics.py::test_text_score_real -v

Usage

from pruna.evaluation.metrics import TextScoreMetric

metric = TextScoreMetric(
    vlm_type="litellm",  # or "transformers"
    model_name="openai/gpt-4o",
    api_key="sk-..."  # optional if env var set
)

score = metric(
    image=image_tensor,
    reference_text="A cat sleeping",
    generated_text="A sleeping feline"
)
print(score)  # ~0.95

Benchmarks

New benchmark configs:

  • text-judge-zero-shot — TextScore baseline
  • text-judge-qa — QA accuracy
  • text-judge-oneig-align — OneIG alignment
  • text-judge-oneig-reason — OneIG reasoning

Context

Part of 5-PR segmentation:

  1. PR-1: Vendor Code ✓
  2. PR-2: Infrastructure ✓
  3. [THIS] PR-3: Text Metrics — First metric family
  4. PR-4: Vision Metrics — Second metric family
  5. PR-5: E2E Tests — Integration + docs

Dependencies: PR-2 (infrastructure + VLM base classes)
Blocks: PR-5 (e2e tests depend on all metrics)

Review Focus

  • Prompt engineering (are prompts effective? any injection risks?)
  • Scoring logic (how are VLM outputs converted to 0-1 scores?)
  • OneIG tuning (what makes OneIG variants different?)
  • Error handling (VLM failures, malformed inputs)
  • Test coverage (edge cases, empty batches)
  • Docstring accuracy (do docstrings match behavior?)

🤖 Generated with Claude Code

- Add LLM2Vec from OneIG vendor source
- Includes Llama encoder and bidirectional models
- Self-contained, no dependencies on Pruna internals
- Licensed under Apache 2.0
- Add BaseVLM abstract interface
- Add LitellmVLM for API-based inference (OpenAI, Anthropic, etc.)
- Add TransformersVLM for local Hugging Face models
- Add StatefulVLMMeanScoresMetric base class for judge metrics
- Add vlm_utils.py with image/batch utilities
- Add pyproject.toml dependency pins (peft, litellm)
- Add unit tests for infrastructure
- Add TextScoreMetric for semantic similarity
- Add OneIGTextScoreMetric (OneIG-tuned variant)
- Add QAAccuracyMetric for question-answering
- Add OneIGAlignmentMetric for vision-instruction alignment
- Add OneIGReasoningMetric for step-by-step reasoning
- Register all text metrics in registry
- Add benchmark configs for text-based evaluation
- Include unit and integration tests with mocked VLM
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 077bcca. Configure here.

name="OneIG Text Rendering",
description="OneIG subset: text and graphics painted into the image.",
metrics=["oneig_text_score"],
task_type=TASK_TYPE_TEXT_IMAGE,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OneIG subset benchmarks crash at import time

High Severity

The six OneIG subset benchmarks (e.g. "OneIG Anime Stylization") produce lookup_key values like "OneIGAnimeStylization" via name.replace(" ", ""), but base_datasets only contains a single "OneIG" entry. The _register method checks benchmark.lookup_key not in base_datasets and raises ValueError when it's missing. Since this loop runs at module level, importing benchmarks.py (done by task.py) will crash immediately on the first OneIG subset benchmark.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 077bcca. Configure here.

self.pooling_mode = pooling_mode
self.skip_instruction = skip_instruction
self.max_length = max_length
self.doc_max_length = 512
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLM2Vec ignores doc_max_length constructor parameter

Low Severity

LLM2Vec.__init__ accepts a doc_max_length parameter but hardcodes self.doc_max_length = 512 instead of using the passed value. Any caller providing a custom doc_max_length will have it silently ignored. Currently the only caller also passes 512, so there's no runtime impact yet, but it's a latent bug.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 077bcca. Configure here.

out.append(Image.fromarray(np.asarray(img)).convert("RGB"))
else:
out.append(img)
return out
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate tensor-to-PIL conversion logic across files

Low Severity

_to_pil_list in metric_oneig_reasoning.py largely duplicates _tensor_to_pil and _process_images from vlm_utils.py. Both convert tensors to PIL images with nearly identical logic (handle 4D batch, normalize, transpose CHW→HWC). Having two copies risks divergent bug fixes — notably _to_pil_list conditionally transposes only when shape[0] == 3, while _tensor_to_pil always transposes. The existing utility could be reused.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 077bcca. Configure here.

@davidberenstein1957
Copy link
Copy Markdown
Member Author

Superseded by metric-focused stacked PRs: #645, #646, #647, #648, #649, #650, #651, and stacked e2e #641.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant