Skip to content

Commit 6ca20c4

Browse files
committed
Update readme
1 parent 96e1937 commit 6ca20c4

1 file changed

Lines changed: 68 additions & 116 deletions

File tree

src/maxtext/eval/README.md

Lines changed: 68 additions & 116 deletions
Original file line numberDiff line numberDiff line change
@@ -1,197 +1,150 @@
1-
# MaxText Model Evaluation Framework
1+
# MaxText vLLM Eval Framework
22

3-
A vLLM-native evaluation framework for MaxText models.
3+
A vLLM-native evaluation framework for MaxText models supporting harness-based eval (lm-eval, evalchemy) and custom datasets.
44

55
## Quick Start
66

7-
### eval_runner: With MaxText checkpoint
7+
All runners share a single entry point:
88

99
```bash
10-
python -m maxtext.eval.runner.eval_runner \
10+
python -m maxtext.eval.runner.run --runner <eval|lm_eval|evalchemy> [flags]
11+
```
12+
13+
### Custom dataset (MLPerf OpenOrca, ROUGE scoring, Other)
14+
15+
```bash
16+
python -m maxtext.eval.runner.run \
17+
--runner eval \
1118
--config src/maxtext/eval/configs/mlperf.yml \
1219
--checkpoint_path gs://<bucket>/checkpoints/0/items \
1320
--model_name llama3.1-8b \
1421
--hf_path meta-llama/Llama-3.1-8B-Instruct \
1522
--base_output_directory gs://<bucket>/ \
16-
--run_name mlperf_eval_run \
23+
--run_name eval_run \
24+
--max_model_len 8192 \
1725
--hf_token $HF_TOKEN
1826
```
1927

20-
### eval_runner: With HF model
21-
22-
Use `--hf_mode` with a public HF model to test the framework
23-
without any MaxText checkpoint.
28+
HF safetensors mode (no MaxText checkpoint):
2429

2530
```bash
26-
python -m maxtext.eval.runner.eval_runner \
31+
python -m maxtext.eval.runner.run \
32+
--runner eval \
2733
--config src/maxtext/eval/configs/mlperf.yml \
2834
--hf_path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
2935
--model_name tinyllama \
30-
--base_output_directory /tmp/eval_test/ \
31-
--run_name smoke_test \
36+
--base_output_directory gs://<bucket>/ \
37+
--run_name eval_test \
3238
--hf_mode \
3339
--num_samples 20 \
40+
--max_model_len 2048 \
3441
--tensor_parallel_size 1
3542
```
3643

37-
### lm_eval_runner
38-
39-
Uses lm-evaluation-harness with loglikelihood scoring.
44+
### LM Eval
4045

4146
Requires: `pip install "lm_eval[api]"`
4247

4348
```bash
44-
python -m maxtext.eval.runner.lm_eval_runner \
49+
python -m maxtext.eval.runner.run \
50+
--runner lm_eval \
4551
--checkpoint_path gs://<bucket>/checkpoints/0/items \
46-
--model_name llama3.1-8b \
47-
--hf_path meta-llama/Llama-3.1-8B-Instruct \
48-
--tasks mmlu gpqa \
52+
--model_name qwen3-30b-a3b \
53+
--hf_path Qwen/Qwen3-30B-A3B \
54+
--tasks gsm8k \
4955
--base_output_directory gs://<bucket>/ \
5056
--run_name my_run \
5157
--max_model_len 8192 \
52-
--tensor_parallel_size 4 \
58+
--tensor_parallel_size 8 \
59+
--expert_parallel_size 8 \
5360
--hf_token $HF_TOKEN
5461
```
5562

56-
### evalchemy_runner
63+
### Evalchemy
5764

58-
Uses [mlfoundations/evalchemy](https://github.com/mlfoundations/evalchemy), which
59-
extends lm-evaluation-harness with chat-completions-based benchmarks. Imports
60-
`evalchemy` for task registration then drives evaluation via `lm_eval.simple_evaluate`.
61-
62-
Requires: `pip install evalchemy`
65+
Requires: `pip install git+https://github.com/mlfoundations/evalchemy.git`
6366

6467
```bash
65-
python -m maxtext.eval.runner.evalchemy_runner \
68+
python -m maxtext.eval.runner.run \
69+
--runner evalchemy \
6670
--checkpoint_path gs://<bucket>/checkpoints/0/items \
6771
--model_name llama3.1-8b \
6872
--hf_path meta-llama/Llama-3.1-8B-Instruct \
6973
--tasks ifeval math500 gpqa_diamond \
7074
--base_output_directory gs://<bucket>/ \
71-
--run_name my_run \
75+
--run_name eval_run \
7276
--max_model_len 8192 \
7377
--tensor_parallel_size 4 \
7478
--hf_token $HF_TOKEN
7579
```
7680

77-
## HuggingFace Token
78-
79-
Llama, Gemma, and other gated models require a HuggingFace token. You must
80-
also have accepted the model license on huggingface.co.
81-
82-
In the `MaxTextForCausalLM` mode, the token is only needed to
83-
download the tokenizer, not model weights.
84-
85-
Pass the token in order of preference:
86-
87-
1. `--hf_token` — forwarded to the server and tokenizer loading.
88-
2. `HF_TOKEN` environment variable (picked up automatically if `--hf_token` is not set).
89-
90-
```bash
91-
# Pass hf_token.
92-
python -m maxtext.eval.runner.eval_runner ... --hf_token hf_...
93-
94-
# Or export env variable.
95-
export HF_TOKEN=hf_...
96-
python -m maxtext.eval.runner.eval_runner ...
97-
```
98-
99-
### Configuration (eval_runner)
81+
## Common Flags
10082

10183
| Flag | Description |
10284
|---|---|
103-
| `--config` | Path to benchmark YAML. |
104-
| `--base_config` | Path to MaxText config |
105-
| `--checkpoint_path` | MaxText orbax checkpoint. Enables MaxTextForCausalLM mode. |
106-
| `--hf_path` | HF model ID or tokenizer dir. |
85+
| `--checkpoint_path` | MaxText Orbax checkpoint path. Enables `MaxTextForCausalLM` mode. |
10786
| `--model_name` | MaxText model name (e.g. `llama3.1-8b`) |
108-
| `--base_output_directory` | GCS or local base directory for results |
109-
| `--run_name` | Run name, used in results path |
110-
| `--hf_token` | HuggingFace token for gated models |
111-
| `--num_samples` | Limit number of eval samples |
112-
| `--hf_mode` | Force HF safetensors mode (disables MaxTextForCausalLM mode) |
113-
| `--tensor_parallel_size` | vLLM tensor parallelism |
114-
| `--max_num_batched_tokens` | vLLM scheduler tokens per step |
115-
| `--max_num_seqs` | vLLM max concurrent sequences (KV cache cap) |
116-
117-
### Configuration (lm_eval_runner)
87+
| `--hf_path` | HF model ID or local path |
88+
| `--max_model_len` | vLLM max context length. |
89+
| `--tensor_parallel_size` | Total number of chips |
90+
| `--expert_parallel_size` | Number of EP chips |
91+
| `--hbm_memory_utilization` | Fraction of HBM reserved for KV cache |
92+
| `--hf_token` | HF token (or set `HF_TOKEN` env var) |
93+
| `--hf_mode` | HF safetensors mode, no MaxText checkpoint loading |
94+
| `--server_host` / `--server_port` | vLLM server address (default: localhost:8000) |
95+
| `--max_num_batched_tokens` | vLLM tokens per scheduler step |
96+
| `--max_num_seqs` | vLLM max concurrent sequences |
97+
| `--gcs_results_path` | GCS path to upload results JSON |
98+
| `--log_level` | Logging verbosity (default: INFO) |
99+
100+
Custom `eval` specific:
118101

119102
| Flag | Description |
120103
|---|---|
121-
| `--checkpoint_path` | MaxText orbax checkpoint. Enables MaxTextForCausalLM mode. |
122-
| `--model_name` | MaxText model name |
123-
| `--hf_path` | HF model ID for tokenizer |
124-
| `--tasks` | Space-separated lm-eval task names (e.g. `mmlu gpqa`) |
125-
| `--base_output_directory` | GCS or local base directory for results |
126-
| `--run_name` | Run name |
127-
| `--max_model_len` | vLLM max context length |
128-
| `--tensor_parallel_size` | Number of chips |
129-
| `--num_fewshot` | Few-shot examples per task (default: 0) |
130-
| `--num_samples` | Limit samples per task (default: full dataset) |
131-
| `--hf_token` | HuggingFace token for gated models |
132-
| `--hf_mode` | Force HF safetensors mode |
104+
| `--config` | Benchmark YAML config (required) |
105+
| `--num_samples` | Limit eval samples |
106+
| `--max_tokens` | Max tokens per generation |
107+
| `--temperature` | Sampling temperature (default: 0.0) |
108+
| `--concurrency` | HTTP request concurrency (default: 64) |
133109

134-
### Configuration (evalchemy_runner)
110+
Harness `lm_eval` / `evalchemy` specific:
135111

136112
| Flag | Description |
137113
|---|---|
138-
| `--checkpoint_path` | MaxText orbax checkpoint. Enables MaxTextForCausalLM mode. |
139-
| `--model_name` | MaxText model name |
140-
| `--hf_path` | HF model ID for tokenizer |
141-
| `--tasks` | Space-separated task names from the table above |
142-
| `--base_output_directory` | GCS or local base directory for results |
143-
| `--run_name` | Run name |
144-
| `--max_model_len` | vLLM max context length |
145-
| `--tensor_parallel_size` | Number of chips |
114+
| `--tasks` | Space-separated task names |
146115
| `--num_fewshot` | Few-shot examples per task (default: 0) |
147116
| `--num_samples` | Limit samples per task (default: full dataset) |
148-
| `--hf_token` | HuggingFace token for gated models |
149-
| `--hf_mode` | Force HF safetensors mode |
150117

151-
## Async Evaluation for RL Training
118+
## Eval on RL Checkpoints
152119

153-
After an RL training run saves a checkpoint, you can evaluate it asynchronously
154-
on a separate machine/job using `evalchemy_runner`.
155120

156-
**Prerequisites:**
157-
- The checkpoint is written to GCS.
158-
- Ensure evalchemy is installed (`pip show evalchemy` or `pip install evalchemy`)
159-
- `HF_TOKEN` exported (needed for tokenizer download only)
160121

161-
**Supported math tasks:** `math500`, `aime24`, `aime25`, `amc23`, `gsm8k`
122+
Example (Qwen3-30B-A3B, v6e-8):
162123

163124
```bash
164-
STEP=1000 # training step to evaluate
125+
STEP=244
165126
MODEL=qwen3-30b-a3b
166127
HF_PATH=Qwen/Qwen3-30B-A3B
167-
CHECKPOINT=gs://<bucket>/run/checkpoints/${STEP}/items
128+
CHECKPOINT=gs://<bucket>/run/checkpoints/actor/${STEP}/model_params
168129
OUTPUT=gs://<bucket>/eval/
169130

170-
python -m maxtext.eval.runner.evalchemy_runner \
131+
python -m maxtext.eval.runner.run \
132+
--runner lm_eval \
171133
--checkpoint_path ${CHECKPOINT} \
172134
--model_name ${MODEL} \
173135
--hf_path ${HF_PATH} \
174-
--tasks math500 aime24 gsm8k \
136+
--tasks gsm8k \
175137
--base_output_directory ${OUTPUT} \
176138
--run_name rl_${MODEL}_step${STEP} \
177-
--max_model_len 8192 \
139+
--max_model_len 4096 \
178140
--tensor_parallel_size 8 \
141+
--expert_parallel_size 8 \
142+
--num_samples 20 \
179143
--hf_token $HF_TOKEN
180144
```
181145

182-
Results are written to `${OUTPUT}/rl_${MODEL}_step${STEP}/eval_results/` as JSON,
183-
and optionally uploaded to GCS via `--gcs_results_path`.
184-
185-
**Notes:**
186-
- `--hf_path` is required since vLLM uses it to fetch the model architecture
187-
config and tokenizer even when loading weights from the MaxText checkpoint.
188-
- Do not run this on the same machine as an active training job, both use vLLM
189-
and will contend for TPU HBM.
190-
- To limit dataset size, add `--num_samples 50`.
191-
192-
## Adding a New Benchmark
193146

194-
For custom datasets not covered by lm-eval or evalchemy:
147+
## Adding a Custom Benchmark
195148

196149
1. Implement `BenchmarkDataset` in `src/maxtext/eval/datasets/`:
197150

@@ -205,12 +158,11 @@ class MyDataset(BenchmarkDataset):
205158
# load dataset, build prompts, return SampleRequest list
206159
```
207160

208-
2. Register it in `src/maxtext/eval/datasets/registry.py`:
161+
2. Register in `src/maxtext/eval/datasets/registry.py`:
209162

210163
```python
211164
from maxtext.eval.datasets.my_dataset import MyDataset
212165
DATASET_REGISTRY["my_benchmark"] = MyDataset
213166
```
214167

215-
3. Add a scorer in `src/maxtext/eval/scoring/` and register it in
216-
`src/maxtext/eval/scoring/registry.py`.
168+
3. Add a scorer in `src/maxtext/eval/scoring/` and register it in `src/maxtext/eval/scoring/registry.py`.

0 commit comments

Comments
 (0)