1- # MaxText Model Evaluation Framework
1+ # MaxText vLLM Eval Framework
22
3- A vLLM-native evaluation framework for MaxText models.
3+ A vLLM-native evaluation framework for MaxText models supporting harness-based eval (lm-eval, evalchemy) and custom datasets .
44
55## Quick Start
66
7- ### eval_runner: With MaxText checkpoint
7+ All runners share a single entry point:
88
99``` bash
10- python -m maxtext.eval.runner.eval_runner \
10+ python -m maxtext.eval.runner.run --runner < eval| lm_eval| evalchemy> [flags]
11+ ```
12+
13+ ### Custom dataset (MLPerf OpenOrca, ROUGE scoring, Other)
14+
15+ ``` bash
16+ python -m maxtext.eval.runner.run \
17+ --runner eval \
1118 --config src/maxtext/eval/configs/mlperf.yml \
1219 --checkpoint_path gs://< bucket> /checkpoints/0/items \
1320 --model_name llama3.1-8b \
1421 --hf_path meta-llama/Llama-3.1-8B-Instruct \
1522 --base_output_directory gs://< bucket> / \
16- --run_name mlperf_eval_run \
23+ --run_name eval_run \
24+ --max_model_len 8192 \
1725 --hf_token $HF_TOKEN
1826```
1927
20- ### eval_runner: With HF model
21-
22- Use ` --hf_mode ` with a public HF model to test the framework
23- without any MaxText checkpoint.
28+ HF safetensors mode (no MaxText checkpoint):
2429
2530``` bash
26- python -m maxtext.eval.runner.eval_runner \
31+ python -m maxtext.eval.runner.run \
32+ --runner eval \
2733 --config src/maxtext/eval/configs/mlperf.yml \
2834 --hf_path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
2935 --model_name tinyllama \
30- --base_output_directory /tmp/eval_test / \
31- --run_name smoke_test \
36+ --base_output_directory gs:// < bucket > / \
37+ --run_name eval_test \
3238 --hf_mode \
3339 --num_samples 20 \
40+ --max_model_len 2048 \
3441 --tensor_parallel_size 1
3542```
3643
37- ### lm_eval_runner
38-
39- Uses lm-evaluation-harness with loglikelihood scoring.
44+ ### LM Eval
4045
4146Requires: ` pip install "lm_eval[api]" `
4247
4348``` bash
44- python -m maxtext.eval.runner.lm_eval_runner \
49+ python -m maxtext.eval.runner.run \
50+ --runner lm_eval \
4551 --checkpoint_path gs://< bucket> /checkpoints/0/items \
46- --model_name llama3.1-8b \
47- --hf_path meta-llama/Llama-3.1-8B-Instruct \
48- --tasks mmlu gpqa \
52+ --model_name qwen3-30b-a3b \
53+ --hf_path Qwen/Qwen3-30B-A3B \
54+ --tasks gsm8k \
4955 --base_output_directory gs://< bucket> / \
5056 --run_name my_run \
5157 --max_model_len 8192 \
52- --tensor_parallel_size 4 \
58+ --tensor_parallel_size 8 \
59+ --expert_parallel_size 8 \
5360 --hf_token $HF_TOKEN
5461```
5562
56- ### evalchemy_runner
63+ ### Evalchemy
5764
58- Uses [ mlfoundations/evalchemy] ( https://github.com/mlfoundations/evalchemy ) , which
59- extends lm-evaluation-harness with chat-completions-based benchmarks. Imports
60- ` evalchemy ` for task registration then drives evaluation via ` lm_eval.simple_evaluate ` .
61-
62- Requires: ` pip install evalchemy `
65+ Requires: ` pip install git+https://github.com/mlfoundations/evalchemy.git `
6366
6467``` bash
65- python -m maxtext.eval.runner.evalchemy_runner \
68+ python -m maxtext.eval.runner.run \
69+ --runner evalchemy \
6670 --checkpoint_path gs://< bucket> /checkpoints/0/items \
6771 --model_name llama3.1-8b \
6872 --hf_path meta-llama/Llama-3.1-8B-Instruct \
6973 --tasks ifeval math500 gpqa_diamond \
7074 --base_output_directory gs://< bucket> / \
71- --run_name my_run \
75+ --run_name eval_run \
7276 --max_model_len 8192 \
7377 --tensor_parallel_size 4 \
7478 --hf_token $HF_TOKEN
7579```
7680
77- ## HuggingFace Token
78-
79- Llama, Gemma, and other gated models require a HuggingFace token. You must
80- also have accepted the model license on huggingface.co.
81-
82- In the ` MaxTextForCausalLM ` mode, the token is only needed to
83- download the tokenizer, not model weights.
84-
85- Pass the token in order of preference:
86-
87- 1 . ` --hf_token ` — forwarded to the server and tokenizer loading.
88- 2 . ` HF_TOKEN ` environment variable (picked up automatically if ` --hf_token ` is not set).
89-
90- ``` bash
91- # Pass hf_token.
92- python -m maxtext.eval.runner.eval_runner ... --hf_token hf_...
93-
94- # Or export env variable.
95- export HF_TOKEN=hf_...
96- python -m maxtext.eval.runner.eval_runner ...
97- ```
98-
99- ### Configuration (eval_runner)
81+ ## Common Flags
10082
10183| Flag | Description |
10284| ---| ---|
103- | ` --config ` | Path to benchmark YAML. |
104- | ` --base_config ` | Path to MaxText config |
105- | ` --checkpoint_path ` | MaxText orbax checkpoint. Enables MaxTextForCausalLM mode. |
106- | ` --hf_path ` | HF model ID or tokenizer dir. |
85+ | ` --checkpoint_path ` | MaxText Orbax checkpoint path. Enables ` MaxTextForCausalLM ` mode. |
10786| ` --model_name ` | MaxText model name (e.g. ` llama3.1-8b ` ) |
108- | ` --base_output_directory ` | GCS or local base directory for results |
109- | ` --run_name ` | Run name, used in results path |
110- | ` --hf_token ` | HuggingFace token for gated models |
111- | ` --num_samples ` | Limit number of eval samples |
112- | ` --hf_mode ` | Force HF safetensors mode (disables MaxTextForCausalLM mode) |
113- | ` --tensor_parallel_size ` | vLLM tensor parallelism |
114- | ` --max_num_batched_tokens ` | vLLM scheduler tokens per step |
115- | ` --max_num_seqs ` | vLLM max concurrent sequences (KV cache cap) |
116-
117- ### Configuration (lm_eval_runner)
87+ | ` --hf_path ` | HF model ID or local path |
88+ | ` --max_model_len ` | vLLM max context length. |
89+ | ` --tensor_parallel_size ` | Total number of chips |
90+ | ` --expert_parallel_size ` | Number of EP chips |
91+ | ` --hbm_memory_utilization ` | Fraction of HBM reserved for KV cache |
92+ | ` --hf_token ` | HF token (or set ` HF_TOKEN ` env var) |
93+ | ` --hf_mode ` | HF safetensors mode, no MaxText checkpoint loading |
94+ | ` --server_host ` / ` --server_port ` | vLLM server address (default: localhost:8000) |
95+ | ` --max_num_batched_tokens ` | vLLM tokens per scheduler step |
96+ | ` --max_num_seqs ` | vLLM max concurrent sequences |
97+ | ` --gcs_results_path ` | GCS path to upload results JSON |
98+ | ` --log_level ` | Logging verbosity (default: INFO) |
99+
100+ Custom ` eval ` specific:
118101
119102| Flag | Description |
120103| ---| ---|
121- | ` --checkpoint_path ` | MaxText orbax checkpoint. Enables MaxTextForCausalLM mode. |
122- | ` --model_name ` | MaxText model name |
123- | ` --hf_path ` | HF model ID for tokenizer |
124- | ` --tasks ` | Space-separated lm-eval task names (e.g. ` mmlu gpqa ` ) |
125- | ` --base_output_directory ` | GCS or local base directory for results |
126- | ` --run_name ` | Run name |
127- | ` --max_model_len ` | vLLM max context length |
128- | ` --tensor_parallel_size ` | Number of chips |
129- | ` --num_fewshot ` | Few-shot examples per task (default: 0) |
130- | ` --num_samples ` | Limit samples per task (default: full dataset) |
131- | ` --hf_token ` | HuggingFace token for gated models |
132- | ` --hf_mode ` | Force HF safetensors mode |
104+ | ` --config ` | Benchmark YAML config (required) |
105+ | ` --num_samples ` | Limit eval samples |
106+ | ` --max_tokens ` | Max tokens per generation |
107+ | ` --temperature ` | Sampling temperature (default: 0.0) |
108+ | ` --concurrency ` | HTTP request concurrency (default: 64) |
133109
134- ### Configuration (evalchemy_runner)
110+ Harness ` lm_eval ` / ` evalchemy ` specific:
135111
136112| Flag | Description |
137113| ---| ---|
138- | ` --checkpoint_path ` | MaxText orbax checkpoint. Enables MaxTextForCausalLM mode. |
139- | ` --model_name ` | MaxText model name |
140- | ` --hf_path ` | HF model ID for tokenizer |
141- | ` --tasks ` | Space-separated task names from the table above |
142- | ` --base_output_directory ` | GCS or local base directory for results |
143- | ` --run_name ` | Run name |
144- | ` --max_model_len ` | vLLM max context length |
145- | ` --tensor_parallel_size ` | Number of chips |
114+ | ` --tasks ` | Space-separated task names |
146115| ` --num_fewshot ` | Few-shot examples per task (default: 0) |
147116| ` --num_samples ` | Limit samples per task (default: full dataset) |
148- | ` --hf_token ` | HuggingFace token for gated models |
149- | ` --hf_mode ` | Force HF safetensors mode |
150117
151- ## Async Evaluation for RL Training
118+ ## Eval on RL Checkpoints
152119
153- After an RL training run saves a checkpoint, you can evaluate it asynchronously
154- on a separate machine/job using ` evalchemy_runner ` .
155120
156- ** Prerequisites:**
157- - The checkpoint is written to GCS.
158- - Ensure evalchemy is installed (` pip show evalchemy ` or ` pip install evalchemy ` )
159- - ` HF_TOKEN ` exported (needed for tokenizer download only)
160121
161- ** Supported math tasks: ** ` math500 ` , ` aime24 ` , ` aime25 ` , ` amc23 ` , ` gsm8k `
122+ Example (Qwen3-30B-A3B, v6e-8):
162123
163124``` bash
164- STEP=1000 # training step to evaluate
125+ STEP=244
165126MODEL=qwen3-30b-a3b
166127HF_PATH=Qwen/Qwen3-30B-A3B
167- CHECKPOINT=gs://< bucket> /run/checkpoints/${STEP} /items
128+ CHECKPOINT=gs://< bucket> /run/checkpoints/actor/ ${STEP} /model_params
168129OUTPUT=gs://< bucket> /eval/
169130
170- python -m maxtext.eval.runner.evalchemy_runner \
131+ python -m maxtext.eval.runner.run \
132+ --runner lm_eval \
171133 --checkpoint_path ${CHECKPOINT} \
172134 --model_name ${MODEL} \
173135 --hf_path ${HF_PATH} \
174- --tasks math500 aime24 gsm8k \
136+ --tasks gsm8k \
175137 --base_output_directory ${OUTPUT} \
176138 --run_name rl_${MODEL} _step${STEP} \
177- --max_model_len 8192 \
139+ --max_model_len 4096 \
178140 --tensor_parallel_size 8 \
141+ --expert_parallel_size 8 \
142+ --num_samples 20 \
179143 --hf_token $HF_TOKEN
180144```
181145
182- Results are written to ` ${OUTPUT}/rl_${MODEL}_step${STEP}/eval_results/ ` as JSON,
183- and optionally uploaded to GCS via ` --gcs_results_path ` .
184-
185- ** Notes:**
186- - ` --hf_path ` is required since vLLM uses it to fetch the model architecture
187- config and tokenizer even when loading weights from the MaxText checkpoint.
188- - Do not run this on the same machine as an active training job, both use vLLM
189- and will contend for TPU HBM.
190- - To limit dataset size, add ` --num_samples 50 ` .
191-
192- ## Adding a New Benchmark
193146
194- For custom datasets not covered by lm-eval or evalchemy:
147+ ## Adding a Custom Benchmark
195148
1961491 . Implement ` BenchmarkDataset ` in ` src/maxtext/eval/datasets/ ` :
197150
@@ -205,12 +158,11 @@ class MyDataset(BenchmarkDataset):
205158 # load dataset, build prompts, return SampleRequest list
206159```
207160
208- 2 . Register it in ` src/maxtext/eval/datasets/registry.py ` :
161+ 2 . Register in ` src/maxtext/eval/datasets/registry.py ` :
209162
210163``` python
211164from maxtext.eval.datasets.my_dataset import MyDataset
212165DATASET_REGISTRY [" my_benchmark" ] = MyDataset
213166```
214167
215- 3 . Add a scorer in ` src/maxtext/eval/scoring/ ` and register it in
216- ` src/maxtext/eval/scoring/registry.py ` .
168+ 3 . Add a scorer in ` src/maxtext/eval/scoring/ ` and register it in ` src/maxtext/eval/scoring/registry.py ` .
0 commit comments