Skip to content

Commit 9bac529

Browse files
authored
docs: add comprehensive developer guidance for AI agents and contributors (#1085)
* docs: add comprehensive developer guidance for AI agents and contributors Rewrite CLAUDE.md with accurate setup instructions (pre-commit install), correct formatting tools (Black+isort, not Ruff+Prettier), architecture overview, model creation guide, task creation guide, and v0.6 summary. Create AGENTS.md as a self-contained reference for Codex, Devin, and other AI coding agents with project overview, registry v2 conventions, common task walkthroughs, and codebase constraints. * docs: add lmms_eval_specific_kwargs, debugging guide, task YAML advanced features Address gaps identified in deep analysis audit: - Document lmms_eval_specific_kwargs model-specific prompt overrides - Add task YAML include/group inheritance patterns - Add Debugging section with common errors, logging, retry patterns - Add CLI flags table and environment variables to AGENTS.md - Document generate_until_multi_round output type - Add model_specific_generation_kwargs to task YAML example
1 parent 8710077 commit 9bac529

2 files changed

Lines changed: 455 additions & 132 deletions

File tree

AGENTS.md

Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
# Agent Guidelines for lmms-eval
2+
3+
This file provides context for AI coding agents (Codex, Devin, SWE-Agent, etc.) working on this codebase.
4+
5+
For detailed development guidelines, see [CLAUDE.md](CLAUDE.md).
6+
7+
## Quick Reference
8+
9+
**Setup**: `uv sync && pre-commit install`
10+
**Run eval**: `python -m lmms_eval --model qwen2_5_vl --tasks mme --limit 5 --batch_size 1`
11+
**Lint**: `pre-commit run --all-files`
12+
**Test registry**: `uv run python -m unittest discover -s test/eval -p "test_model_registry_v2.py"`
13+
14+
### Useful CLI Flags
15+
16+
| Flag | Description |
17+
|------|-------------|
18+
| `--model` | Model backend name (e.g., `qwen2_5_vl`, `openai`, `vllm`) |
19+
| `--model_args` | Comma-separated key=value pairs (e.g., `pretrained=org/model,device_map=auto`) |
20+
| `--tasks` | Comma-separated task names |
21+
| `--limit N` | Only evaluate first N samples (use for quick testing) |
22+
| `--batch_size N` | Batch size for inference |
23+
| `--num_fewshot N` | Number of fewshot examples |
24+
| `--device cuda:0` | Device for local models |
25+
| `--output_path dir/` | Directory for result output |
26+
| `--log_samples` | Save per-sample predictions to output |
27+
| `--verbosity DEBUG` | Set log level (DEBUG, INFO, WARNING, ERROR) |
28+
29+
### Environment Variables
30+
31+
```bash
32+
export OPENAI_API_KEY="..." # Required for OpenAI/API-backed models
33+
export HF_TOKEN="..." # Required for gated HuggingFace datasets
34+
export HF_HOME="/path/to/cache" # HuggingFace cache directory
35+
export HF_HUB_ENABLE_HF_TRANSFER="1" # Faster downloads
36+
```
37+
38+
## Project Overview
39+
40+
lmms-eval evaluates Large Multimodal Models (LMMs) across image, video, and audio tasks. It supports 100+ benchmarks and 30+ model backends.
41+
42+
### Key Directories
43+
44+
| Path | Purpose |
45+
|------|---------|
46+
| `lmms_eval/models/chat/` | Chat model wrappers (recommended for new models) |
47+
| `lmms_eval/models/simple/` | Legacy model wrappers |
48+
| `lmms_eval/models/__init__.py` | Model registry: `AVAILABLE_SIMPLE_MODELS`, `AVAILABLE_CHAT_TEMPLATE_MODELS`, `MODEL_ALIASES` |
49+
| `lmms_eval/models/registry_v2.py` | `ModelManifest`, `ModelRegistryV2` - aliasing and resolution |
50+
| `lmms_eval/tasks/<task_name>/` | Task configs (YAML) + helper functions (utils.py) |
51+
| `lmms_eval/protocol.py` | `ChatMessages` - structured multimodal message protocol |
52+
| `lmms_eval/api/model.py` | Base class `lmms` - all models subclass this |
53+
| `lmms_eval/api/instance.py` | `Instance` - request object passed to models |
54+
| `lmms_eval/entrypoints/` | HTTP eval server (EvalClient, ServerArgs) |
55+
| `lmms_eval/llm_judge/` | LLM-as-judge scoring providers |
56+
57+
### Model Registry V2
58+
59+
Models are registered via two dicts in `__init__.py` that map `model_id` -> `ClassName`:
60+
61+
- `AVAILABLE_CHAT_TEMPLATE_MODELS` - chat models in `models/chat/`
62+
- `AVAILABLE_SIMPLE_MODELS` - simple models in `models/simple/`
63+
64+
If the same `model_id` exists in both, the registry creates one `ModelManifest` with both paths. Resolution prefers chat over simple.
65+
66+
`MODEL_ALIASES` provides backward-compatible name mappings: `{"new_name": ("old_name_1", "old_name_2")}`.
67+
68+
### Pipeline
69+
70+
```
71+
Dataset --> doc_to_messages (or doc_to_visual + doc_to_text)
72+
--> Model.generate_until() or Model.loglikelihood()
73+
--> process_results()
74+
--> metric aggregation
75+
```
76+
77+
## Common Tasks
78+
79+
### Adding a New Model
80+
81+
1. Create `lmms_eval/models/chat/<name>.py`
82+
2. Subclass `lmms`, set `is_simple = False`, implement `generate_until`
83+
3. Use `@register_model("<name>")` decorator
84+
4. Add `"<name>": "ClassName"` to `AVAILABLE_CHAT_TEMPLATE_MODELS` in `__init__.py`
85+
5. Test: `python -m lmms_eval --model <name> --model_args pretrained=org/model --tasks mme --limit 5`
86+
87+
### Adding a New Task
88+
89+
1. Create `lmms_eval/tasks/<name>/<name>.yaml` + `utils.py`
90+
2. YAML needs: `task`, `dataset_path`, `test_split`, `output_type`, `doc_to_messages`, `process_results`, `metric_list`
91+
3. Tasks auto-register from YAML - no manual registration needed
92+
4. Test: `python -m lmms_eval --model qwen2_5_vl --tasks <name> --limit 8`
93+
94+
#### Task YAML Advanced Features
95+
96+
**`lmms_eval_specific_kwargs`** - Model-specific prompt overrides. Framework selects matching key based on model, falls back to `default`:
97+
98+
```yaml
99+
lmms_eval_specific_kwargs:
100+
default:
101+
pre_prompt: ""
102+
post_prompt: "\nAnswer directly."
103+
qwen3_vl:
104+
format: "qwen3_vl"
105+
pre_prompt: "Question: "
106+
post_prompt: "Answer with the option letter only."
107+
```
108+
109+
These kwargs are passed to `doc_to_messages(doc, lmms_eval_specific_kwargs=...)`.
110+
111+
**`include`** - Inherit shared config from a template file (avoids duplication across variants):
112+
113+
```yaml
114+
include: _default_template_yaml
115+
```
116+
117+
**`group` + `task` list** - Define task families:
118+
119+
```yaml
120+
group: mmmu
121+
task:
122+
- mmmu_val
123+
- mmmu_test
124+
```
125+
126+
**`output_type`** options: `generate_until` (free-form), `loglikelihood` (multiple-choice), `generate_until_multi_round` (multi-turn conversation)
127+
128+
### Fixing a Model Bug
129+
130+
1. Find the model file in `models/chat/` or `models/simple/`
131+
2. Check `generate_until` for generation issues, `loglikelihood` for multiple-choice
132+
3. Look at `req.args` unpacking - chat models get 5 elements, simple get 6
133+
4. Run with `--limit 5` to verify the fix quickly
134+
135+
### Fixing a Task Bug
136+
137+
1. Task YAML is in `lmms_eval/tasks/<task_name>/`
138+
2. Helper functions are in `utils.py` next to the YAML
139+
3. `process_results` handles scoring, `doc_to_messages` handles input formatting
140+
4. Test with `--limit 8` to verify
141+
142+
## Debugging
143+
144+
### Quick Diagnostics
145+
146+
- **Verbose logging**: `python -m lmms_eval --model ... --verbosity DEBUG` - shows detailed traces
147+
- **Small test run**: `--limit 5` evaluates only 5 samples - always use this when testing changes
148+
- **Log samples**: `--log_samples` saves per-sample predictions to output directory for inspection
149+
150+
### Common Errors and Fixes
151+
152+
| Error | Cause | Fix |
153+
|-------|-------|-----|
154+
| `ValueError: gen_kwargs['until']` | Wrong type for `until` in generation_kwargs | Must be `str` or `list[str]` |
155+
| `NotImplementedError: loglikelihood` | Model doesn't support multiple-choice | Implement `loglikelihood()` or use `generate_until` tasks only |
156+
| `AttributeError: '_max_length'` | Missing initialization in model `__init__` | Set `self._max_length` in constructor |
157+
| Visual is `None` or `[]` | Dataset sample has no image/video | Guard with `if visual is not None and len(visual) > 0` |
158+
| API timeout/rate limit | API model hitting limits | Use `max_retries` and `retry_backoff_s` in model_args |
159+
160+
### Logging
161+
162+
The codebase uses `eval_logger` from loguru. To add debug logging in your code:
163+
164+
```python
165+
from lmms_eval.utils import eval_logger
166+
eval_logger.debug("Processing batch of {} samples", len(batch))
167+
eval_logger.warning("Missing visual for doc_id={}", doc_id)
168+
```
169+
170+
### Retry Patterns (API Models)
171+
172+
API-backed models (openai, gemini, etc.) support retry configuration:
173+
174+
```bash
175+
python -m lmms_eval --model openai --model_args pretrained=gpt-4o,max_retries=5,retry_backoff_s=2.0 --tasks mme
176+
```
177+
178+
## Constraints
179+
180+
- **Package manager**: uv only, never pip
181+
- **Formatting**: Black (line-length=240) + isort (profile=black). Run `pre-commit run --all-files` before committing.
182+
- **No type suppression**: Never use `as any`, `@ts-ignore`, `type: ignore` to suppress type errors
183+
- **Commits**: Never mention co-authored-by or AI tools
184+
- **Minimal changes**: Fix the specific issue, don't refactor unrelated code
185+
- **Follow patterns**: Match the style of neighboring files exactly

0 commit comments

Comments
 (0)