Feat/vlm evaluation pipeline#124
Draft
camilobrownpinilla wants to merge 5 commits into
Draft
Conversation
Codecov Report❌ Patch coverage is
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Adds a standalone VLM evaluation subsystem that integrates the optional lmms-eval harness via a KempnerForge chat-model adapter, enabling benchmarking of KempnerForge VLM DCP checkpoints without modifying or depending on lmms-eval at install time. It also refactors VLM dataset preprocessing helpers so the adapter can reuse the training-time image/text preprocessing paths exactly.
Changes:
- Introduce
kempnerforge/eval/vlm/with anlmms-evalchat-model adapter (KempnerForgeVLM) + registration manifest, and wire it via apyproject.tomlentry point. - Add a CLI harness (
scripts/vlm_eval_harness.py) plus docs and changelog entries for running VLM evaluation. - Refactor
kempnerforge/data/vlm_dataset.pyto exposepil_to_tensor,build_tokenizer, andresolve_pad_id, updating tests accordingly and adding new unit/integration tests for the adapter.
Reviewed changes
Copilot reviewed 16 out of 18 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_vlm_dataset.py | Updates unit tests to use the now-public pil_to_tensor. |
| tests/unit/eval/vlm/test_registry.py | Verifies the lmms-eval manifest shape when lmms-eval is installed. |
| tests/unit/eval/vlm/test_import_isolation.py | Ensures import kempnerforge (and kempnerforge.eval.*) does not import lmms_eval. |
| tests/unit/eval/vlm/test_adapter.py | CPU unit tests for rendering, preprocessing, gen_kwargs resolution, and batched decode behavior. |
| tests/unit/eval/vlm/init.py | New test package marker for VLM eval tests. |
| tests/unit/eval/init.py | New eval test package marker. |
| tests/integration/test_vlm_eval.py | DCP round-trip integration test for generate_until + env-gated real-task path. |
| tests/conftest.py | Adds tiny VLM config/wrapper fixtures for CPU-side adapter tests. |
| scripts/vlm_eval_harness.py | New CLI to run lmms-eval tasks against KempnerForge VLM checkpoints. |
| pyproject.toml | Registers kempnerforge_vlm via [project.entry-points."lmms_eval.models"]. |
| kempnerforge/eval/vlm/registry.py | Adds ModelManifest for lmms-eval plugin discovery. |
| kempnerforge/eval/vlm/adapter.py | Implements the lmms-eval chat-model adapter, loader, preprocessing, and batched decode. |
| kempnerforge/eval/vlm/init.py | Documents import-isolation constraints (intentionally no eager imports). |
| kempnerforge/eval/init.py | Introduces the eval namespace and documents import-isolation constraints. |
| kempnerforge/data/vlm_dataset.py | Refactors preprocessing/tokenizer helpers into reusable public functions. |
| docs/how-to/run-vlm-evaluation.md | New how-to doc for installing lmms-eval and running the VLM eval harness. |
| docs/how-to/index.md | Wires the new VLM eval doc into the how-to index/toctree. |
| CHANGELOG.md | Documents the new VLM evaluation pipeline and related refactor/tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…close file Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a standalone VLM evaluation pipeline that evaluates any KempnerForge VLM checkpoint on the standard multimodal benchmarks
lmms-evalimplements asgenerate_untiltasks (MMMU, MMBench, ScienceQA, SEED, AI2D, …), by integrating the lmms-eval harness through a custom model adapter. The adapter wrapsVLMWrapperand loads directly from a DCP checkpoint. It registers as an lmms-eval chat-model plugin via apyproject.tomlentry point, so lmms-eval stays unmodified and is not added as a dependency. v1 is single-GPU, image-only, and generation-only. All changes are additive and backward compatible; the only edit to existing code is a behavior-preserving refactor.lmms-eval chat-model adapter (
kempnerforge/eval/vlm/adapter.py)KempnerForgeVLM(lmms), a chat model (is_simple = False) wrappingVLMWrapper. Arch-agnostic across the generative arches (joint-decoder / cross-attention / MoT); MoMa fails fast — its non-causal expert-choice routing cannot autoregressively generate, and eval requires generation.generate_untilonly;loglikelihoodandgenerate_until_multi_roundraiseNotImplementedError(chat tasks are generation-only — standard multiple-choice benchmarks run asgenerate_until).VLMWrapper.forwardover the growing right-padded batch each step — no transformer KV cache (Transformer.forwardforbidskv_caches+ any image-conditioning route, and there is no image-conditioned KV-cache path) — reusingkempnerforge.model.generate.sample. Requests are grouped bygen_kwargsand right-padded to the batch-max length (the layout training uses: image prefix at0..n-1, text contiguous fromn, trailing pads causally masked), so a batched forward gives each row the same real-position logits as decoding it alone (batch-equivalence pinned by a test); EOS /until/max_new_tokensare tracked per row. Single-GPU is the validated invocation, not baked in: rank/world_size come from the lmms base and model construction sits behind a_build_modelseam.NotImplementedError/ValueError: MoMa, video/audio, multi-image, multi-turn/few-shot.Checkpoint loading + preprocessing reuse
dcp.loadof the model shards only (DCP reshards, so FSDP/PP checkpoints load into the full model);resolve_resume_pathwith a specific-step_Nfallback; reads plain-JSONmetadata.jsonfor step/tokens_seen.kempnerforge/data/vlm_dataset.pyexposespil_to_tensor,build_tokenizer,resolve_pad_id, andDEFAULT_IMAGE_MEAN/STD(the image/text paths stay bit-exact). Prompt rendering flattens theChatMessagestext blocks (no chat template, no<image>placeholder — images are conditioned viapixel_values); model-specific chat-template support is a documented follow-up.Registration (lmms-eval stays unmodified)
kempnerforge/eval/vlm/registry.py:MANIFEST = ModelManifest(model_id="kempnerforge_vlm", chat_class_path="kempnerforge.eval.vlm.adapter.KempnerForgeVLM").pyproject.toml: a[project.entry-points."lmms_eval.models"]entry for auto-discovery — metadata only; lmms-eval is not a dependency (install separately withuv pip install lmms-eval, mirroring how lm-eval is handled). The eval subsystem is import-isolated —kempnerforge/eval/is not imported on the defaultimport kempnerforgepath, so the main package keeps working with lmms-eval absent (pinned bytest_import_isolation.py); a file-level# pyright: reportMissingImports=falsekeepspyright kempnerforge/green in CI (where the undeclared lmms-eval is absent).Harness + docs
scripts/vlm_eval_harness.py: CLI mirroringscripts/eval_harness.py(no conversion).--config/--checkpoint/--tasksrequired (default suite TBD), plus--limit/--output/--device/--dtype/--batch-size/--max-new-tokens; lazylmms_eval.evaluator.simple_evaluateimport with a helpful error if it is not installed.docs/how-to/run-vlm-evaluation.md, wired into the how-totoctree; CHANGELOG.Design faithfulness
kempnerforge/eval/vlm/; no changes to training, the existing loss/perplexity eval path, or model code. Visual input is modeled as an ordered list of frames (a single image is theF == 1case) so video is a localized future addition.Deferred to follow-up PRs:
video, multi-turn, few-shot, and multi-image tasks (need the video model / model-side changes);
loglikelihood/multiple_choicescoring and multi-round (chat tasks are generation-only); MoMa eval (needs generation support); data-parallel and sharded multi-GPU inference; single image-encode per request (a model-side seam); model-specific chat templates; a representative default benchmark suite; and whether to formalize the lmms-eval dependency.Testing
uv run ruff check kempnerforge/ tests/passesuv run ruff format --check kempnerforge/ tests/ scripts/passesuv run pyright kempnerforge/passes (0 errors)uv run pytest tests/unit/ -v --timeout=60passesuv run torchrun --nproc_per_node=4 -m pytest tests/distributed/ -vuv run pytest tests/e2e/ --e2e -vCloses #122