Skip to content

Feat/vlm evaluation pipeline#124

Draft
camilobrownpinilla wants to merge 5 commits into
mainfrom
feat/vlm-evaluation-pipeline
Draft

Feat/vlm evaluation pipeline#124
camilobrownpinilla wants to merge 5 commits into
mainfrom
feat/vlm-evaluation-pipeline

Conversation

@camilobrownpinilla

Copy link
Copy Markdown
Collaborator

Summary

Adds a standalone VLM evaluation pipeline that evaluates any KempnerForge VLM checkpoint on the standard multimodal benchmarks lmms-eval implements as generate_until tasks (MMMU, MMBench, ScienceQA, SEED, AI2D, …), by integrating the lmms-eval harness through a custom model adapter. The adapter wraps VLMWrapper and loads directly from a DCP checkpoint. It registers as an lmms-eval chat-model plugin via a pyproject.toml entry point, so lmms-eval stays unmodified and is not added as a dependency. v1 is single-GPU, image-only, and generation-only. All changes are additive and backward compatible; the only edit to existing code is a behavior-preserving refactor.

lmms-eval chat-model adapter (kempnerforge/eval/vlm/adapter.py)

  • KempnerForgeVLM(lmms), a chat model (is_simple = False) wrapping VLMWrapper. Arch-agnostic across the generative arches (joint-decoder / cross-attention / MoT); MoMa fails fast — its non-causal expert-choice routing cannot autoregressively generate, and eval requires generation.
  • Implements generate_until only; loglikelihood and generate_until_multi_round raise NotImplementedError (chat tasks are generation-only — standard multiple-choice benchmarks run as generate_until).
  • Cache-less, single-GPU, batched decode. Re-runs VLMWrapper.forward over the growing right-padded batch each step — no transformer KV cache (Transformer.forward forbids kv_caches + any image-conditioning route, and there is no image-conditioned KV-cache path) — reusing kempnerforge.model.generate.sample. Requests are grouped by gen_kwargs and right-padded to the batch-max length (the layout training uses: image prefix at 0..n-1, text contiguous from n, trailing pads causally masked), so a batched forward gives each row the same real-position logits as decoding it alone (batch-equivalence pinned by a test); EOS / until / max_new_tokens are tracked per row. Single-GPU is the validated invocation, not baked in: rank/world_size come from the lmms base and model construction sits behind a _build_model seam.
  • Guards with clear NotImplementedError / ValueError: MoMa, video/audio, multi-image, multi-turn/few-shot.

Checkpoint loading + preprocessing reuse

  • Single-process dcp.load of the model shards only (DCP reshards, so FSDP/PP checkpoints load into the full model); resolve_resume_path with a specific-step_N fallback; reads plain-JSON metadata.json for step/tokens_seen.
  • Reuses the exact training-time preprocessing: a behavior-preserving refactor of kempnerforge/data/vlm_dataset.py exposes pil_to_tensor, build_tokenizer, resolve_pad_id, and DEFAULT_IMAGE_MEAN/STD (the image/text paths stay bit-exact). Prompt rendering flattens the ChatMessages text blocks (no chat template, no <image> placeholder — images are conditioned via pixel_values); model-specific chat-template support is a documented follow-up.

Registration (lmms-eval stays unmodified)

  • kempnerforge/eval/vlm/registry.py: MANIFEST = ModelManifest(model_id="kempnerforge_vlm", chat_class_path="kempnerforge.eval.vlm.adapter.KempnerForgeVLM").
  • pyproject.toml: a [project.entry-points."lmms_eval.models"] entry for auto-discovery — metadata only; lmms-eval is not a dependency (install separately with uv pip install lmms-eval, mirroring how lm-eval is handled). The eval subsystem is import-isolatedkempnerforge/eval/ is not imported on the default import kempnerforge path, so the main package keeps working with lmms-eval absent (pinned by test_import_isolation.py); a file-level # pyright: reportMissingImports=false keeps pyright kempnerforge/ green in CI (where the undeclared lmms-eval is absent).

Harness + docs

  • scripts/vlm_eval_harness.py: CLI mirroring scripts/eval_harness.py (no conversion). --config / --checkpoint / --tasks required (default suite TBD), plus --limit / --output / --device / --dtype / --batch-size / --max-new-tokens; lazy lmms_eval.evaluator.simple_evaluate import with a helpful error if it is not installed.
  • docs/how-to/run-vlm-evaluation.md, wired into the how-to toctree; CHANGELOG.

Design faithfulness

  • New, parallel subsystem under kempnerforge/eval/vlm/; no changes to training, the existing loss/perplexity eval path, or model code. Visual input is modeled as an ordered list of frames (a single image is the F == 1 case) so video is a localized future addition.

Deferred to follow-up PRs:
video, multi-turn, few-shot, and multi-image tasks (need the video model / model-side changes); loglikelihood / multiple_choice scoring and multi-round (chat tasks are generation-only); MoMa eval (needs generation support); data-parallel and sharded multi-GPU inference; single image-encode per request (a model-side seam); model-specific chat templates; a representative default benchmark suite; and whether to formalize the lmms-eval dependency.

Testing

  • uv run ruff check kempnerforge/ tests/ passes
  • uv run ruff format --check kempnerforge/ tests/ scripts/ passes
  • uv run pyright kempnerforge/ passes (0 errors)
  • uv run pytest tests/unit/ -v --timeout=60 passes
  • (N/A) [ ] If distributed code changed: uv run torchrun --nproc_per_node=4 -m pytest tests/distributed/ -v
  • (N/A) [ ] If training loop / parallelism / optimizers changed: uv run pytest tests/e2e/ --e2e -v

Closes #122

@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 5.63380% with 201 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
kempnerforge/eval/vlm/adapter.py 0.00% 199 Missing ⚠️
kempnerforge/eval/vlm/registry.py 0.00% 2 Missing ⚠️
Files with missing lines Coverage Δ
kempnerforge/data/vlm_dataset.py 99.01% <100.00%> (+0.05%) ⬆️
kempnerforge/eval/vlm/registry.py 0.00% <0.00%> (ø)
kempnerforge/eval/vlm/adapter.py 0.00% <0.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a standalone VLM evaluation subsystem that integrates the optional lmms-eval harness via a KempnerForge chat-model adapter, enabling benchmarking of KempnerForge VLM DCP checkpoints without modifying or depending on lmms-eval at install time. It also refactors VLM dataset preprocessing helpers so the adapter can reuse the training-time image/text preprocessing paths exactly.

Changes:

  • Introduce kempnerforge/eval/vlm/ with an lmms-eval chat-model adapter (KempnerForgeVLM) + registration manifest, and wire it via a pyproject.toml entry point.
  • Add a CLI harness (scripts/vlm_eval_harness.py) plus docs and changelog entries for running VLM evaluation.
  • Refactor kempnerforge/data/vlm_dataset.py to expose pil_to_tensor, build_tokenizer, and resolve_pad_id, updating tests accordingly and adding new unit/integration tests for the adapter.

Reviewed changes

Copilot reviewed 16 out of 18 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/unit/test_vlm_dataset.py Updates unit tests to use the now-public pil_to_tensor.
tests/unit/eval/vlm/test_registry.py Verifies the lmms-eval manifest shape when lmms-eval is installed.
tests/unit/eval/vlm/test_import_isolation.py Ensures import kempnerforge (and kempnerforge.eval.*) does not import lmms_eval.
tests/unit/eval/vlm/test_adapter.py CPU unit tests for rendering, preprocessing, gen_kwargs resolution, and batched decode behavior.
tests/unit/eval/vlm/init.py New test package marker for VLM eval tests.
tests/unit/eval/init.py New eval test package marker.
tests/integration/test_vlm_eval.py DCP round-trip integration test for generate_until + env-gated real-task path.
tests/conftest.py Adds tiny VLM config/wrapper fixtures for CPU-side adapter tests.
scripts/vlm_eval_harness.py New CLI to run lmms-eval tasks against KempnerForge VLM checkpoints.
pyproject.toml Registers kempnerforge_vlm via [project.entry-points."lmms_eval.models"].
kempnerforge/eval/vlm/registry.py Adds ModelManifest for lmms-eval plugin discovery.
kempnerforge/eval/vlm/adapter.py Implements the lmms-eval chat-model adapter, loader, preprocessing, and batched decode.
kempnerforge/eval/vlm/init.py Documents import-isolation constraints (intentionally no eager imports).
kempnerforge/eval/init.py Introduces the eval namespace and documents import-isolation constraints.
kempnerforge/data/vlm_dataset.py Refactors preprocessing/tokenizer helpers into reusable public functions.
docs/how-to/run-vlm-evaluation.md New how-to doc for installing lmms-eval and running the VLM eval harness.
docs/how-to/index.md Wires the new VLM eval doc into the how-to index/toctree.
CHANGELOG.md Documents the new VLM evaluation pipeline and related refactor/tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/vlm_eval_harness.py Outdated
Comment thread scripts/vlm_eval_harness.py
Comment thread kempnerforge/eval/vlm/adapter.py
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@amazloumi amazloumi requested a review from Naeemkh June 25, 2026 15:08
camilobrownpinilla and others added 2 commits June 25, 2026 11:09
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…close file

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add VLM evaluation pipeline

2 participants