Thanks for your interest in contributing to lmms-eval! This guide covers everything you need to get started.
- Python >= 3.9
- uv (package manager)
- Git
git clone https://github.com/EvolvingLMMs-Lab/lmms-eval.git
cd lmms-eval
uv syncpython -m lmms_eval \
--model qwen2_5_vl \
--model_args pretrained=Qwen/Qwen2.5-VL-3B-Instruct \
--tasks mme \
--batch_size 1 \
--limit 8- Fork the repository
- Create a branch from
main(git checkout -b my-feature) - Make changes and test locally
- Run linting before committing:
pip install pre-commit pre-commit install pre-commit run --all-files
- Commit with a descriptive message (see Commit Style)
- Open a pull request against
main
To make onboarding predictable, we use a lightweight funnel:
Discover -> First Run -> First Issue -> First PR -> Repeat PRs -> Maintainer Track
Issue labels to use and follow:
good first issue- Small task, scoped for first-time contributorshelp wanted- Community contribution requested with maintainer supportpriority:high- Important work for upcoming releaseneeds reproduction- Missing minimal repro before triageneeds decision- Maintainer decision requiredblocked- Waiting on dependency or external change
Target response times:
- Triage first response: within 48 hours
- First PR review: within 72 hours
For design-heavy or follow-up issues, use the same structure as PRs for consistency:
- Summary - max 3 bullets (problem, impact, desired outcome)
- In scope - explicit list of intended changes
- Out of scope - explicit list of excluded work
- Proposed plan - 3-6 concrete implementation steps
- Validation plan - commands/benchmarks and expected pass criteria
- Risk / Compatibility - behavior changes, migration risk, blockers
- Acceptance criteria - objective done conditions (checkbox style)
GitHub supports this via .github/ISSUE_TEMPLATE/design_proposal.yml.
For Linear tickets, use the same headings and keep each section concise.
Use conventional commit prefixes:
feat:- New feature or benchmarkfix:- Bug fixrefactor:- Code restructuring without behavior changedocs:- Documentation onlytest:- Adding or updating testschore:- Maintenance (deps, CI, configs)
Examples:
feat: add SWE-bench Verified benchmark
fix: handle empty response in MMMU parsing
refactor: extract shared flatten() to model_utils
- Formatter: Black (line length 240) + isort (handled by pre-commit)
- Naming: PEP 8 -
snake_casefor functions/variables,PascalCasefor classes - Type hints: Required for all new code
- Docstrings: Required for public APIs
- Line length: 88 characters recommended, 240 max (enforced by Black)
- Imports: Use
from lmms_eval.imports import optional_importfor model-specific packages
This is the most common contribution. Each benchmark lives in its own directory under lmms_eval/tasks/.
-
Create the directory:
lmms_eval/tasks/my_benchmark/ my_benchmark.yaml # Task config utils.py # Processing functions _default_template_yaml # Shared defaults (optional) -
Write the YAML config:
task: my_benchmark dataset_path: hf-org/my-dataset # HuggingFace dataset test_split: test output_type: generate_until doc_to_visual: !function utils.doc_to_visual doc_to_text: !function utils.doc_to_text doc_to_messages: !function utils.doc_to_messages doc_to_target: "answer" process_results: !function utils.process_results metric_list: - metric: accuracy aggregation: mean higher_is_better: true generation_kwargs: max_new_tokens: 128
-
Implement the functions in
utils.py:doc_to_visual(doc)- Extract images/video/audio from a dataset sampledoc_to_text(doc, lmms_eval_specific_kwargs)- Format the text promptdoc_to_messages(doc, lmms_eval_specific_kwargs)- Build structured chat messages (recommended)process_results(doc, results)- Parse model output and compute metrics
-
Test your benchmark:
python -m lmms_eval --model qwen2_5_vl \ --model_args pretrained=Qwen/Qwen2.5-VL-3B-Instruct \ --tasks my_benchmark --limit 8
See docs/guides/task_guide.md for a detailed walkthrough.
Models live under lmms_eval/models/chat/ (recommended) or lmms_eval/models/simple/ (legacy).
- Create
models/chat/my_model.py - Inherit from
lmms, setis_simple = False - Implement required methods:
generate_until,loglikelihood,generate_until_multi_round - Register in
models/__init__.py:AVAILABLE_CHAT_TEMPLATE_MODELS = {"my_model": "MyModel", ...}
- Use
optional_importfor model-specific dependencies:from lmms_eval.imports import optional_import MyLib, _has_mylib = optional_import("mylib", "MyLib")
See docs/guides/model_guide.md for details.
- Check existing issues for duplicates
- If the bug is new, open an issue first using the Bug Report template
- Reference the issue in your PR
Documentation improvements are always welcome. Key docs:
README.md- Project overview (available in 16 languages underdocs/)docs/guides/task_guide.md- How to add benchmarksdocs/guides/model_guide.md- How to add modelsdocs/releases/- Release notes and changelog
Always use uv, never pip.
uv sync # Install from lockfile
uv add package # Add a dependency
uv remove package # Remove a dependency
uv run tool # Run a tool in the environment- Discord: discord.gg/zdkwKUqrPy
- Issues: GitHub Issues
- Quick-start: Evaluate Your Model in 5 Minutes