Feat/vlm evaluation pipeline by camilobrownpinilla · Pull Request #124 · KempnerInstitute/KempnerForge

camilobrownpinilla · 2026-06-25T15:00:10Z

Summary

Adds a standalone VLM evaluation pipeline that evaluates any KempnerForge VLM checkpoint on the standard multimodal benchmarks lmms-eval implements as generate_until tasks (MMMU, MMBench, ScienceQA, SEED, AI2D, …), by integrating the lmms-eval harness through a custom model adapter. The adapter wraps VLMWrapper and loads directly from a DCP checkpoint. It registers as an lmms-eval chat-model plugin via a pyproject.toml entry point, so lmms-eval stays unmodified and is not added as a dependency. v1 is single-GPU, image-only, and generation-only. All changes are additive and backward compatible; the only edit to existing code is a behavior-preserving refactor.

lmms-eval chat-model adapter (kempnerforge/eval/vlm/adapter.py)

KempnerForgeVLM(lmms), a chat model (is_simple = False) wrapping VLMWrapper. Arch-agnostic across the generative arches (joint-decoder / cross-attention / MoT); MoMa fails fast — its non-causal expert-choice routing cannot autoregressively generate, and eval requires generation.
Implements generate_until only; loglikelihood and generate_until_multi_round raise NotImplementedError (chat tasks are generation-only — standard multiple-choice benchmarks run as generate_until).
Cache-less, single-GPU, batched decode. Re-runs VLMWrapper.forward over the growing right-padded batch each step — no transformer KV cache (Transformer.forward forbids kv_caches + any image-conditioning route, and there is no image-conditioned KV-cache path) — reusing kempnerforge.model.generate.sample. Requests are grouped by gen_kwargs and right-padded to the batch-max length (the layout training uses: image prefix at 0..n-1, text contiguous from n, trailing pads causally masked), so a batched forward gives each row the same real-position logits as decoding it alone (batch-equivalence pinned by a test); EOS / until / max_new_tokens are tracked per row. Single-GPU is the validated invocation, not baked in: rank/world_size come from the lmms base and model construction sits behind a _build_model seam.
Guards with clear NotImplementedError / ValueError: MoMa, video/audio, multi-image, multi-turn/few-shot.

Checkpoint loading + preprocessing reuse

Single-process dcp.load of the model shards only (DCP reshards, so FSDP/PP checkpoints load into the full model); resolve_resume_path with a specific-step_N fallback; reads plain-JSON metadata.json for step/tokens_seen.
Reuses the exact training-time preprocessing: a behavior-preserving refactor of kempnerforge/data/vlm_dataset.py exposes pil_to_tensor, build_tokenizer, resolve_pad_id, and DEFAULT_IMAGE_MEAN/STD (the image/text paths stay bit-exact). Prompt rendering flattens the ChatMessages text blocks (no chat template, no <image> placeholder — images are conditioned via pixel_values); model-specific chat-template support is a documented follow-up.

Registration (lmms-eval stays unmodified)

kempnerforge/eval/vlm/registry.py: MANIFEST = ModelManifest(model_id="kempnerforge_vlm", chat_class_path="kempnerforge.eval.vlm.adapter.KempnerForgeVLM").
pyproject.toml: a [project.entry-points."lmms_eval.models"] entry for auto-discovery — metadata only; lmms-eval is not a dependency (install separately with uv pip install lmms-eval, mirroring how lm-eval is handled). The eval subsystem is import-isolated — kempnerforge/eval/ is not imported on the default import kempnerforge path, so the main package keeps working with lmms-eval absent (pinned by test_import_isolation.py); a file-level # pyright: reportMissingImports=false keeps pyright kempnerforge/ green in CI (where the undeclared lmms-eval is absent).

Harness + docs

scripts/vlm_eval_harness.py: CLI mirroring scripts/eval_harness.py (no conversion). --config / --checkpoint / --tasks required (default suite TBD), plus --limit / --output / --device / --dtype / --batch-size / --max-new-tokens; lazy lmms_eval.evaluator.simple_evaluate import with a helpful error if it is not installed.
docs/how-to/run-vlm-evaluation.md, wired into the how-to toctree; CHANGELOG.

Design faithfulness

New, parallel subsystem under kempnerforge/eval/vlm/; no changes to training, the existing loss/perplexity eval path, or model code. Visual input is modeled as an ordered list of frames (a single image is the F == 1 case) so video is a localized future addition.

Deferred to follow-up PRs:
video, multi-turn, few-shot, and multi-image tasks (need the video model / model-side changes); loglikelihood / multiple_choice scoring and multi-round (chat tasks are generation-only); MoMa eval (needs generation support); data-parallel and sharded multi-GPU inference; single image-encode per request (a model-side seam); model-specific chat templates; a representative default benchmark suite; and whether to formalize the lmms-eval dependency.

Testing

uv run ruff check kempnerforge/ tests/ passes
uv run ruff format --check kempnerforge/ tests/ scripts/ passes
uv run pyright kempnerforge/ passes (0 errors)
uv run pytest tests/unit/ -v --timeout=60 passes
(N/A) [ ] If distributed code changed: uv run torchrun --nproc_per_node=4 -m pytest tests/distributed/ -v
(N/A) [ ] If training loop / parallelism / optimizers changed: uv run pytest tests/e2e/ --e2e -v

Closes #122

codecov · 2026-06-25T15:03:23Z

Codecov Report

❌ Patch coverage is 5.63380% with 201 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
kempnerforge/eval/vlm/adapter.py	0.00%	199 Missing ⚠️
kempnerforge/eval/vlm/registry.py	0.00%	2 Missing ⚠️

Files with missing lines	Coverage Δ
kempnerforge/data/vlm_dataset.py	`99.01% <100.00%> (+0.05%)`	⬆️
kempnerforge/eval/vlm/registry.py	`0.00% <0.00%> (ø)`
kempnerforge/eval/vlm/adapter.py	`0.00% <0.00%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

Adds a standalone VLM evaluation subsystem that integrates the optional lmms-eval harness via a KempnerForge chat-model adapter, enabling benchmarking of KempnerForge VLM DCP checkpoints without modifying or depending on lmms-eval at install time. It also refactors VLM dataset preprocessing helpers so the adapter can reuse the training-time image/text preprocessing paths exactly.

Changes:

Introduce kempnerforge/eval/vlm/ with an lmms-eval chat-model adapter (KempnerForgeVLM) + registration manifest, and wire it via a pyproject.toml entry point.
Add a CLI harness (scripts/vlm_eval_harness.py) plus docs and changelog entries for running VLM evaluation.
Refactor kempnerforge/data/vlm_dataset.py to expose pil_to_tensor, build_tokenizer, and resolve_pad_id, updating tests accordingly and adding new unit/integration tests for the adapter.

Reviewed changes

Copilot reviewed 16 out of 18 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/unit/test_vlm_dataset.py	Updates unit tests to use the now-public `pil_to_tensor`.
tests/unit/eval/vlm/test_registry.py	Verifies the `lmms-eval` manifest shape when `lmms-eval` is installed.
tests/unit/eval/vlm/test_import_isolation.py	Ensures `import kempnerforge` (and `kempnerforge.eval.*`) does not import `lmms_eval`.
tests/unit/eval/vlm/test_adapter.py	CPU unit tests for rendering, preprocessing, gen_kwargs resolution, and batched decode behavior.
tests/unit/eval/vlm/init.py	New test package marker for VLM eval tests.
tests/unit/eval/init.py	New eval test package marker.
tests/integration/test_vlm_eval.py	DCP round-trip integration test for `generate_until` + env-gated real-task path.
tests/conftest.py	Adds tiny VLM config/wrapper fixtures for CPU-side adapter tests.
scripts/vlm_eval_harness.py	New CLI to run `lmms-eval` tasks against KempnerForge VLM checkpoints.
pyproject.toml	Registers `kempnerforge_vlm` via `[project.entry-points."lmms_eval.models"]`.
kempnerforge/eval/vlm/registry.py	Adds `ModelManifest` for `lmms-eval` plugin discovery.
kempnerforge/eval/vlm/adapter.py	Implements the `lmms-eval` chat-model adapter, loader, preprocessing, and batched decode.
kempnerforge/eval/vlm/init.py	Documents import-isolation constraints (intentionally no eager imports).
kempnerforge/eval/init.py	Introduces the eval namespace and documents import-isolation constraints.
kempnerforge/data/vlm_dataset.py	Refactors preprocessing/tokenizer helpers into reusable public functions.
docs/how-to/run-vlm-evaluation.md	New how-to doc for installing `lmms-eval` and running the VLM eval harness.
docs/how-to/index.md	Wires the new VLM eval doc into the how-to index/toctree.
CHANGELOG.md	Documents the new VLM evaluation pipeline and related refactor/tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

…close file Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

camilobrownpinilla added 2 commits June 24, 2026 17:06

Add support for VLM evaluation via 'lmms-eval' harness

b9b6ce3

Add batched VLM evaluation

9d4affd

camilobrownpinilla requested review from amazloumi, Copilot and mmshad June 25, 2026 15:00

camilobrownpinilla added the enhancement New feature or request label Jun 25, 2026

Copilot started reviewing on behalf of camilobrownpinilla June 25, 2026 15:00 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

Comment thread scripts/vlm_eval_harness.py Outdated

Comment thread scripts/vlm_eval_harness.py

Comment thread kempnerforge/eval/vlm/adapter.py

Fix task limit guarding

ac3ce11

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

amazloumi requested a review from Naeemkh June 25, 2026 15:08

camilobrownpinilla and others added 2 commits June 25, 2026 11:09

Fix batch size argument passing

d4d1820

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Fix potential file mishandling; use context manager to properly open/…

91dc80b

…close file Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat/vlm evaluation pipeline#124

Feat/vlm evaluation pipeline#124
camilobrownpinilla wants to merge 5 commits into
mainfrom
feat/vlm-evaluation-pipeline

camilobrownpinilla commented Jun 25, 2026

Uh oh!

codecov Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

camilobrownpinilla commented Jun 25, 2026

Summary

Testing

Uh oh!

codecov Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jun 25, 2026 •

edited

Loading