Skip to content

Commit 6352537

Browse files
committed
Add async RL instructions
1 parent 7dc5809 commit 6352537

2 files changed

Lines changed: 42 additions & 0 deletions

File tree

src/maxtext/eval/README.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -148,6 +148,47 @@ python -m maxtext.eval.runner.eval_runner ...
148148
| `--hf_token` | HuggingFace token for gated models |
149149
| `--hf_mode` | Force HF safetensors mode |
150150

151+
## Async Evaluation for RL Training
152+
153+
After an RL training run saves a checkpoint, you can evaluate it asynchronously
154+
on a separate machine/job using `evalchemy_runner`.
155+
156+
**Prerequisites:**
157+
- The checkpoint is written to GCS.
158+
- Ensure evalchemy is installed (`pip show evalchemy` or `pip install evalchemy`)
159+
- `HF_TOKEN` exported (needed for tokenizer download only)
160+
161+
**Supported math tasks:** `math500`, `aime24`, `aime25`, `amc23`, `gsm8k`
162+
163+
```bash
164+
STEP=1000 # training step to evaluate
165+
MODEL=qwen3-30b-a3b
166+
HF_PATH=Qwen/Qwen3-30B-A3B
167+
CHECKPOINT=gs://<bucket>/run/checkpoints/${STEP}/items
168+
OUTPUT=gs://<bucket>/eval/
169+
170+
python -m maxtext.eval.runner.evalchemy_runner \
171+
--checkpoint_path ${CHECKPOINT} \
172+
--model_name ${MODEL} \
173+
--hf_path ${HF_PATH} \
174+
--tasks math500 aime24 gsm8k \
175+
--base_output_directory ${OUTPUT} \
176+
--run_name rl_${MODEL}_step${STEP} \
177+
--max_model_len 8192 \
178+
--tensor_parallel_size 8 \
179+
--hf_token $HF_TOKEN
180+
```
181+
182+
Results are written to `${OUTPUT}/rl_${MODEL}_step${STEP}/eval_results/` as JSON,
183+
and optionally uploaded to GCS via `--gcs_results_path`.
184+
185+
**Notes:**
186+
- `--hf_path` is required since vLLM uses it to fetch the model architecture
187+
config and tokenizer even when loading weights from the MaxText checkpoint.
188+
- Do not run this on the same machine as an active training job, both use vLLM
189+
and will contend for TPU HBM.
190+
- To limit dataset size, add `--num_samples 50`.
191+
151192
## Adding a New Benchmark
152193

153194
For custom datasets not covered by lm-eval or evalchemy:

src/maxtext/eval/runner/evalchemy_runner.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@
6060
"gpqa_diamond": "gpqa_diamond",
6161
"humaneval": "humaneval",
6262
"livecodebench": "livecodebench",
63+
"gsm8k": "gsm8k",
6364
}
6465

6566

0 commit comments

Comments
 (0)