@@ -148,6 +148,47 @@ python -m maxtext.eval.runner.eval_runner ...
148148| ` --hf_token ` | HuggingFace token for gated models |
149149| ` --hf_mode ` | Force HF safetensors mode |
150150
151+ ## Async Evaluation for RL Training
152+
153+ After an RL training run saves a checkpoint, you can evaluate it asynchronously
154+ on a separate machine/job using ` evalchemy_runner ` .
155+
156+ ** Prerequisites:**
157+ - The checkpoint is written to GCS.
158+ - Ensure evalchemy is installed (` pip show evalchemy ` or ` pip install evalchemy ` )
159+ - ` HF_TOKEN ` exported (needed for tokenizer download only)
160+
161+ ** Supported math tasks:** ` math500 ` , ` aime24 ` , ` aime25 ` , ` amc23 ` , ` gsm8k `
162+
163+ ``` bash
164+ STEP=1000 # training step to evaluate
165+ MODEL=qwen3-30b-a3b
166+ HF_PATH=Qwen/Qwen3-30B-A3B
167+ CHECKPOINT=gs://< bucket> /run/checkpoints/${STEP} /items
168+ OUTPUT=gs://< bucket> /eval/
169+
170+ python -m maxtext.eval.runner.evalchemy_runner \
171+ --checkpoint_path ${CHECKPOINT} \
172+ --model_name ${MODEL} \
173+ --hf_path ${HF_PATH} \
174+ --tasks math500 aime24 gsm8k \
175+ --base_output_directory ${OUTPUT} \
176+ --run_name rl_${MODEL} _step${STEP} \
177+ --max_model_len 8192 \
178+ --tensor_parallel_size 8 \
179+ --hf_token $HF_TOKEN
180+ ```
181+
182+ Results are written to ` ${OUTPUT}/rl_${MODEL}_step${STEP}/eval_results/ ` as JSON,
183+ and optionally uploaded to GCS via ` --gcs_results_path ` .
184+
185+ ** Notes:**
186+ - ` --hf_path ` is required since vLLM uses it to fetch the model architecture
187+ config and tokenizer even when loading weights from the MaxText checkpoint.
188+ - Do not run this on the same machine as an active training job, both use vLLM
189+ and will contend for TPU HBM.
190+ - To limit dataset size, add ` --num_samples 50 ` .
191+
151192## Adding a New Benchmark
152193
153194For custom datasets not covered by lm-eval or evalchemy:
0 commit comments