You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/lmms-eval-0.6.md
+118-7Lines changed: 118 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,9 +11,11 @@ Building on lmms-eval's existing framework, v0.6 transforms evaluation from a on
11
11
-**During training**: Evaluation runs as a standalone service, decoupled from the training loop. Submit checkpoints for async evaluation without blocking GPU training.
12
12
-**Post-training**: Rapid, comprehensive evaluation across all modalities with statistical guarantees on the results.
13
13
14
+
From the engineering side, v0.6 also ships a substantial API-throughput upgrade. With the latest API control-path updates (adaptive concurrency, refill scheduling, prefix-aware queueing, and retry/backoff decoupling), we observe about **7.5x throughput improvement** on a fixed `LIMIT=100` benchmark (`0.3278 -> 2.4584 req/s`), while preserving metric outputs for the same task/model setup.
15
+
14
16
| Area | Key Features |
15
17
|------|--------------|
16
-
|**Performance**| Fully async and decoupled inference; data layer optimization for high-throughput multimodal access|
18
+
|**Performance**| Fully async and decoupled inference; adaptive API concurrency control; prefix-aware queueing; measured ~7x+ throughput gain on API benchmark path|
17
19
|**Evaluation as a Service**| Async job submission without blocking GPU training; separately hosted eval service on dedicated GPUs |
We recommend `async_openai` for all API-backed evaluation — it uses native async I/O and achieves significantly higher throughput.
66
+
67
+
Both resolve to **chat mode by default** via Model Registry V2. The simple mode (`doc_to_visual` + `doc_to_text`) is deprecated and will be removed in a future release. See [Model Registry V2](#model-registry-v2) below for details.
68
+
69
+
> **Naming change in v0.6**: the canonical model names have been shortened from `openai_compatible` / `async_openai_compatible` to `openai` / `async_openai`. These are the names used in filenames, registry keys, and `@register_model` decorators. The old names (`openai_compatible`, `openai_compatible_chat`, `async_openai_compatible`, `async_openai_compatible_chat`) continue to work as aliases via `MODEL_ALIASES` in `__init__.py`, so existing scripts are not affected.
70
+
71
+
#### Adaptive Concurrency Control
72
+
73
+
v0.6 adds adaptive concurrency for API-backed evaluation (`async_openai`, `openai`).
74
+
75
+
The controller continuously adjusts in-flight request count using three online signals:
76
+
- request failure rate
77
+
- rate-limit hit rate (e.g., 429 / throttling)
78
+
- p95 latency against a target latency budget
79
+
80
+
Execution also uses refill-style scheduling (no full-window barrier), so completed requests immediately release slots for new work.
81
+
82
+
For API models with repeated prompt prefixes, v0.6 also supports prefix-aware queueing to improve prefill-cache hit opportunities by dispatching same-prefix requests close together.
- The latest API control path reaches about **7.5x throughput** over baseline on the same `LIMIT=100` setup.
137
+
- Compared to the previous adaptive run (`v1`), the latest adaptive run (`v2`) still improves (`2.4047 -> 2.4584 req/s`, `+2.23%`). This is a small but measurable delta in a noisy environment (shared network + provider-side scheduling), so the right takeaway is not "a new ceiling", but "less overhead and better utilization under the same constraints."
138
+
- The core point: this speedup is not from changing benchmark difficulty. We keep the same task (`mme`), model (`bytedance-seed/seed-1.6-flash`), limit (`100`), and evaluation prompts/settings. The gain comes from changes in the API request scheduling/control path.
139
+
- What `adaptive (v2)` means in practice:
140
+
- Refill scheduling (no window barrier): maintain a steady pool of in-flight requests and immediately dispatch new work as soon as a request completes. This reduces idle gaps and prevents the slowest request in a window from gating progress.
141
+
- Rolling controller updates: adjust concurrency based on a rolling batch of completions (failure rate, rate-limit hits, and p95 latency vs target) rather than only after fixed windows. This makes the controller more responsive and less sensitive to outliers.
142
+
- Hysteresis for stability: use separate "reduce" vs "increase" conditions (and minimum sample thresholds) to avoid oscillating on a single transient 429 or a brief latency spike.
143
+
- Retry/backoff decoupling: `retry_backoff_s` is explicitly separate from request timeout, so retries don't sleep for long timeouts and tie up worker slots.
144
+
- Prefix-aware queueing (when enabled): reorder dispatch by prefix hash so same-prefix requests are sent close together, improving prefill-cache hit opportunities on providers that support prefix caching. (Some routing layers may dilute this benefit; the mechanism is still safe.)
v0.6 introduces `ModelRegistryV2` — a unified model registry that replaces the previous ad-hoc import system. All model names (`--model X`) resolve through a single path.
154
+
155
+
**How it works**
156
+
157
+
Two dicts in `lmms_eval/models/__init__.py` declare available models:
158
+
159
+
-`AVAILABLE_SIMPLE_MODELS`: maps `model_id` -> `ClassName` for simple (legacy) models in `models/simple/`
160
+
-`AVAILABLE_CHAT_TEMPLATE_MODELS`: maps `model_id` -> `ClassName` for chat models in `models/chat/`
161
+
162
+
At startup, the registry merges both dicts into `ModelManifest` objects. Each manifest holds a `model_id` and up to two class paths (simple + chat). Class paths are auto-constructed: `lmms_eval.models.{type}.{model_id}.{ClassName}`, so the dict key **must match the filename**.
163
+
164
+
**Resolution**: chat is always preferred over simple (unless `force_simple=True`). This means `--model openai` transparently resolves to the chat implementation.
165
+
166
+
**Aliasing**: backward-compatible names are supported via `MODEL_ALIASES` in `__init__.py` and via `ModelManifest.aliases`. Old names like `openai_compatible`, `openai_compatible_chat`, `async_openai_compatible`, and `async_openai_compatible_chat` continue to work.
167
+
168
+
**Simple mode deprecation**: the simple model interface (`doc_to_visual` + `doc_to_text`) for API models is deprecated. New integrations should always use chat (`doc_to_messages` + `ChatMessages`). The simple implementations in `models/simple/openai.py` will be removed in a future release.
169
+
59
170
### 1.2 Data Layer
60
171
61
172
#### Storage Format
@@ -317,20 +428,20 @@ Same accuracy, but Model A is 3× more stable.
317
428
318
429
---
319
430
320
-
## 3. Frontier Multimodal Evaluation (TODO)
431
+
## 3. Evaluating Multimodal Models in 2026 (TODO)
321
432
322
-
> **Note**: This section outlines planned frontier evaluation features. Implementation is in progress.
433
+
> **Note**: This section outlines planned evaluation features. Implementation is in progress.
323
434
324
-
### 3.1 Why Frontier Scenarios Matter
435
+
### 3.1 More features are expected
325
436
326
-
Static image QA benchmarks are saturating. Building frontier multimodal systems requires setting more challenging tasks and evaluating them in more realistic scenarios:
437
+
Static image QA benchmarks are saturating. Building multimodal systems requires setting more challenging tasks and evaluating them in more realistic scenarios:
327
438
328
439
| Capability | Challenge | Current Gap |
329
440
|------------|-----------|-------------|
330
441
| Long video understanding | 10min+ videos, 1000+ frames | Most benchmarks use <128 frames |
331
-
| High temporal resolution | Event detection at 30fps | Sparse sampling loses fine-grained actions |
442
+
| High motion | Objects movement at 30fps | Sparse sampling loses fine-grained actions |
332
443
| Spatial reasoning | 3D world understanding | 2D perception ≠ physical grounding |
| Agentic interaction | Multi-step task execution and feedback | Static QA can't measure planning/tool use well|
334
445
335
446
**Key insight**: These capabilities require **in-environment evaluation**, the model must interact with simulators, receive feedback, and adapt. Static input-output pairs cannot capture this.
0 commit comments