|
| 1 | +# Agent Guidelines for lmms-eval |
| 2 | + |
| 3 | +This file provides context for AI coding agents (Codex, Devin, SWE-Agent, etc.) working on this codebase. |
| 4 | + |
| 5 | +For detailed development guidelines, see [CLAUDE.md](CLAUDE.md). |
| 6 | + |
| 7 | +## Quick Reference |
| 8 | + |
| 9 | +**Setup**: `uv sync && pre-commit install` |
| 10 | +**Run eval**: `python -m lmms_eval --model qwen2_5_vl --tasks mme --limit 5 --batch_size 1` |
| 11 | +**Lint**: `pre-commit run --all-files` |
| 12 | +**Test registry**: `uv run python -m unittest discover -s test/eval -p "test_model_registry_v2.py"` |
| 13 | + |
| 14 | +### Useful CLI Flags |
| 15 | + |
| 16 | +| Flag | Description | |
| 17 | +|------|-------------| |
| 18 | +| `--model` | Model backend name (e.g., `qwen2_5_vl`, `openai`, `vllm`) | |
| 19 | +| `--model_args` | Comma-separated key=value pairs (e.g., `pretrained=org/model,device_map=auto`) | |
| 20 | +| `--tasks` | Comma-separated task names | |
| 21 | +| `--limit N` | Only evaluate first N samples (use for quick testing) | |
| 22 | +| `--batch_size N` | Batch size for inference | |
| 23 | +| `--num_fewshot N` | Number of fewshot examples | |
| 24 | +| `--device cuda:0` | Device for local models | |
| 25 | +| `--output_path dir/` | Directory for result output | |
| 26 | +| `--log_samples` | Save per-sample predictions to output | |
| 27 | +| `--verbosity DEBUG` | Set log level (DEBUG, INFO, WARNING, ERROR) | |
| 28 | + |
| 29 | +### Environment Variables |
| 30 | + |
| 31 | +```bash |
| 32 | +export OPENAI_API_KEY="..." # Required for OpenAI/API-backed models |
| 33 | +export HF_TOKEN="..." # Required for gated HuggingFace datasets |
| 34 | +export HF_HOME="/path/to/cache" # HuggingFace cache directory |
| 35 | +export HF_HUB_ENABLE_HF_TRANSFER="1" # Faster downloads |
| 36 | +``` |
| 37 | + |
| 38 | +## Project Overview |
| 39 | + |
| 40 | +lmms-eval evaluates Large Multimodal Models (LMMs) across image, video, and audio tasks. It supports 100+ benchmarks and 30+ model backends. |
| 41 | + |
| 42 | +### Key Directories |
| 43 | + |
| 44 | +| Path | Purpose | |
| 45 | +|------|---------| |
| 46 | +| `lmms_eval/models/chat/` | Chat model wrappers (recommended for new models) | |
| 47 | +| `lmms_eval/models/simple/` | Legacy model wrappers | |
| 48 | +| `lmms_eval/models/__init__.py` | Model registry: `AVAILABLE_SIMPLE_MODELS`, `AVAILABLE_CHAT_TEMPLATE_MODELS`, `MODEL_ALIASES` | |
| 49 | +| `lmms_eval/models/registry_v2.py` | `ModelManifest`, `ModelRegistryV2` - aliasing and resolution | |
| 50 | +| `lmms_eval/tasks/<task_name>/` | Task configs (YAML) + helper functions (utils.py) | |
| 51 | +| `lmms_eval/protocol.py` | `ChatMessages` - structured multimodal message protocol | |
| 52 | +| `lmms_eval/api/model.py` | Base class `lmms` - all models subclass this | |
| 53 | +| `lmms_eval/api/instance.py` | `Instance` - request object passed to models | |
| 54 | +| `lmms_eval/entrypoints/` | HTTP eval server (EvalClient, ServerArgs) | |
| 55 | +| `lmms_eval/llm_judge/` | LLM-as-judge scoring providers | |
| 56 | + |
| 57 | +### Model Registry V2 |
| 58 | + |
| 59 | +Models are registered via two dicts in `__init__.py` that map `model_id` -> `ClassName`: |
| 60 | + |
| 61 | +- `AVAILABLE_CHAT_TEMPLATE_MODELS` - chat models in `models/chat/` |
| 62 | +- `AVAILABLE_SIMPLE_MODELS` - simple models in `models/simple/` |
| 63 | + |
| 64 | +If the same `model_id` exists in both, the registry creates one `ModelManifest` with both paths. Resolution prefers chat over simple. |
| 65 | + |
| 66 | +`MODEL_ALIASES` provides backward-compatible name mappings: `{"new_name": ("old_name_1", "old_name_2")}`. |
| 67 | + |
| 68 | +### Pipeline |
| 69 | + |
| 70 | +``` |
| 71 | +Dataset --> doc_to_messages (or doc_to_visual + doc_to_text) |
| 72 | + --> Model.generate_until() or Model.loglikelihood() |
| 73 | + --> process_results() |
| 74 | + --> metric aggregation |
| 75 | +``` |
| 76 | + |
| 77 | +## Common Tasks |
| 78 | + |
| 79 | +### Adding a New Model |
| 80 | + |
| 81 | +1. Create `lmms_eval/models/chat/<name>.py` |
| 82 | +2. Subclass `lmms`, set `is_simple = False`, implement `generate_until` |
| 83 | +3. Use `@register_model("<name>")` decorator |
| 84 | +4. Add `"<name>": "ClassName"` to `AVAILABLE_CHAT_TEMPLATE_MODELS` in `__init__.py` |
| 85 | +5. Test: `python -m lmms_eval --model <name> --model_args pretrained=org/model --tasks mme --limit 5` |
| 86 | + |
| 87 | +### Adding a New Task |
| 88 | + |
| 89 | +1. Create `lmms_eval/tasks/<name>/<name>.yaml` + `utils.py` |
| 90 | +2. YAML needs: `task`, `dataset_path`, `test_split`, `output_type`, `doc_to_messages`, `process_results`, `metric_list` |
| 91 | +3. Tasks auto-register from YAML - no manual registration needed |
| 92 | +4. Test: `python -m lmms_eval --model qwen2_5_vl --tasks <name> --limit 8` |
| 93 | + |
| 94 | +#### Task YAML Advanced Features |
| 95 | + |
| 96 | +**`lmms_eval_specific_kwargs`** - Model-specific prompt overrides. Framework selects matching key based on model, falls back to `default`: |
| 97 | + |
| 98 | +```yaml |
| 99 | +lmms_eval_specific_kwargs: |
| 100 | + default: |
| 101 | + pre_prompt: "" |
| 102 | + post_prompt: "\nAnswer directly." |
| 103 | + qwen3_vl: |
| 104 | + format: "qwen3_vl" |
| 105 | + pre_prompt: "Question: " |
| 106 | + post_prompt: "Answer with the option letter only." |
| 107 | +``` |
| 108 | +
|
| 109 | +These kwargs are passed to `doc_to_messages(doc, lmms_eval_specific_kwargs=...)`. |
| 110 | + |
| 111 | +**`include`** - Inherit shared config from a template file (avoids duplication across variants): |
| 112 | + |
| 113 | +```yaml |
| 114 | +include: _default_template_yaml |
| 115 | +``` |
| 116 | + |
| 117 | +**`group` + `task` list** - Define task families: |
| 118 | + |
| 119 | +```yaml |
| 120 | +group: mmmu |
| 121 | +task: |
| 122 | +- mmmu_val |
| 123 | +- mmmu_test |
| 124 | +``` |
| 125 | + |
| 126 | +**`output_type`** options: `generate_until` (free-form), `loglikelihood` (multiple-choice), `generate_until_multi_round` (multi-turn conversation) |
| 127 | + |
| 128 | +### Fixing a Model Bug |
| 129 | + |
| 130 | +1. Find the model file in `models/chat/` or `models/simple/` |
| 131 | +2. Check `generate_until` for generation issues, `loglikelihood` for multiple-choice |
| 132 | +3. Look at `req.args` unpacking - chat models get 5 elements, simple get 6 |
| 133 | +4. Run with `--limit 5` to verify the fix quickly |
| 134 | + |
| 135 | +### Fixing a Task Bug |
| 136 | + |
| 137 | +1. Task YAML is in `lmms_eval/tasks/<task_name>/` |
| 138 | +2. Helper functions are in `utils.py` next to the YAML |
| 139 | +3. `process_results` handles scoring, `doc_to_messages` handles input formatting |
| 140 | +4. Test with `--limit 8` to verify |
| 141 | + |
| 142 | +## Debugging |
| 143 | + |
| 144 | +### Quick Diagnostics |
| 145 | + |
| 146 | +- **Verbose logging**: `python -m lmms_eval --model ... --verbosity DEBUG` - shows detailed traces |
| 147 | +- **Small test run**: `--limit 5` evaluates only 5 samples - always use this when testing changes |
| 148 | +- **Log samples**: `--log_samples` saves per-sample predictions to output directory for inspection |
| 149 | + |
| 150 | +### Common Errors and Fixes |
| 151 | + |
| 152 | +| Error | Cause | Fix | |
| 153 | +|-------|-------|-----| |
| 154 | +| `ValueError: gen_kwargs['until']` | Wrong type for `until` in generation_kwargs | Must be `str` or `list[str]` | |
| 155 | +| `NotImplementedError: loglikelihood` | Model doesn't support multiple-choice | Implement `loglikelihood()` or use `generate_until` tasks only | |
| 156 | +| `AttributeError: '_max_length'` | Missing initialization in model `__init__` | Set `self._max_length` in constructor | |
| 157 | +| Visual is `None` or `[]` | Dataset sample has no image/video | Guard with `if visual is not None and len(visual) > 0` | |
| 158 | +| API timeout/rate limit | API model hitting limits | Use `max_retries` and `retry_backoff_s` in model_args | |
| 159 | + |
| 160 | +### Logging |
| 161 | + |
| 162 | +The codebase uses `eval_logger` from loguru. To add debug logging in your code: |
| 163 | + |
| 164 | +```python |
| 165 | +from lmms_eval.utils import eval_logger |
| 166 | +eval_logger.debug("Processing batch of {} samples", len(batch)) |
| 167 | +eval_logger.warning("Missing visual for doc_id={}", doc_id) |
| 168 | +``` |
| 169 | + |
| 170 | +### Retry Patterns (API Models) |
| 171 | + |
| 172 | +API-backed models (openai, gemini, etc.) support retry configuration: |
| 173 | + |
| 174 | +```bash |
| 175 | +python -m lmms_eval --model openai --model_args pretrained=gpt-4o,max_retries=5,retry_backoff_s=2.0 --tasks mme |
| 176 | +``` |
| 177 | + |
| 178 | +## Constraints |
| 179 | + |
| 180 | +- **Package manager**: uv only, never pip |
| 181 | +- **Formatting**: Black (line-length=240) + isort (profile=black). Run `pre-commit run --all-files` before committing. |
| 182 | +- **No type suppression**: Never use `as any`, `@ts-ignore`, `type: ignore` to suppress type errors |
| 183 | +- **Commits**: Never mention co-authored-by or AI tools |
| 184 | +- **Minimal changes**: Fix the specific issue, don't refactor unrelated code |
| 185 | +- **Follow patterns**: Match the style of neighboring files exactly |
0 commit comments