|
| 1 | +# MaxText vLLM Eval Framework |
| 2 | + |
| 3 | +A vLLM-native evaluation framework for MaxText models supporting harness-based eval (lm-eval, evalchemy) and custom datasets. |
| 4 | + |
| 5 | +## Quick Start |
| 6 | + |
| 7 | +All runners share a single entry point: |
| 8 | + |
| 9 | +```bash |
| 10 | +python -m maxtext.eval.runner.run --runner <eval|lm_eval|evalchemy> [flags] |
| 11 | +``` |
| 12 | + |
| 13 | +### Custom dataset (MLPerf OpenOrca, ROUGE scoring, Other) |
| 14 | + |
| 15 | +```bash |
| 16 | +python -m maxtext.eval.runner.run \ |
| 17 | + --runner eval \ |
| 18 | + --config src/maxtext/eval/configs/mlperf.yml \ |
| 19 | + --checkpoint_path gs://<bucket>/checkpoints/0/items \ |
| 20 | + --model_name llama3.1-8b \ |
| 21 | + --hf_path meta-llama/Llama-3.1-8B-Instruct \ |
| 22 | + --base_output_directory gs://<bucket>/ \ |
| 23 | + --run_name eval_run \ |
| 24 | + --max_model_len 8192 \ |
| 25 | + --hf_token $HF_TOKEN |
| 26 | +``` |
| 27 | + |
| 28 | +HF safetensors mode (no MaxText checkpoint): |
| 29 | + |
| 30 | +```bash |
| 31 | +python -m maxtext.eval.runner.run \ |
| 32 | + --runner eval \ |
| 33 | + --config src/maxtext/eval/configs/mlperf.yml \ |
| 34 | + --hf_path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \ |
| 35 | + --model_name tinyllama \ |
| 36 | + --base_output_directory gs://<bucket>/ \ |
| 37 | + --run_name eval_test \ |
| 38 | + --hf_mode \ |
| 39 | + --num_samples 20 \ |
| 40 | + --max_model_len 2048 \ |
| 41 | + --tensor_parallel_size 1 |
| 42 | +``` |
| 43 | + |
| 44 | +### LM Eval |
| 45 | + |
| 46 | +Requires: `pip install "lm_eval[api]"` |
| 47 | + |
| 48 | +```bash |
| 49 | +python -m maxtext.eval.runner.run \ |
| 50 | + --runner lm_eval \ |
| 51 | + --checkpoint_path gs://<bucket>/checkpoints/0/items \ |
| 52 | + --model_name qwen3-30b-a3b \ |
| 53 | + --hf_path Qwen/Qwen3-30B-A3B \ |
| 54 | + --tasks gsm8k \ |
| 55 | + --base_output_directory gs://<bucket>/ \ |
| 56 | + --run_name my_run \ |
| 57 | + --max_model_len 8192 \ |
| 58 | + --tensor_parallel_size 8 \ |
| 59 | + --expert_parallel_size 8 \ |
| 60 | + --hf_token $HF_TOKEN |
| 61 | +``` |
| 62 | + |
| 63 | +### Evalchemy |
| 64 | + |
| 65 | +Requires: `pip install git+https://github.com/mlfoundations/evalchemy.git` |
| 66 | + |
| 67 | +```bash |
| 68 | +python -m maxtext.eval.runner.run \ |
| 69 | + --runner evalchemy \ |
| 70 | + --checkpoint_path gs://<bucket>/checkpoints/0/items \ |
| 71 | + --model_name llama3.1-8b \ |
| 72 | + --hf_path meta-llama/Llama-3.1-8B-Instruct \ |
| 73 | + --tasks ifeval math500 gpqa_diamond \ |
| 74 | + --base_output_directory gs://<bucket>/ \ |
| 75 | + --run_name eval_run \ |
| 76 | + --max_model_len 8192 \ |
| 77 | + --tensor_parallel_size 4 \ |
| 78 | + --hf_token $HF_TOKEN |
| 79 | +``` |
| 80 | + |
| 81 | +## Common Flags |
| 82 | + |
| 83 | +| Flag | Description | |
| 84 | +|---|---| |
| 85 | +| `--checkpoint_path` | MaxText Orbax checkpoint path. Enables `MaxTextForCausalLM` mode. | |
| 86 | +| `--model_name` | MaxText model name (e.g. `llama3.1-8b`) | |
| 87 | +| `--hf_path` | HF model ID or local path | |
| 88 | +| `--max_model_len` | vLLM max context length. | |
| 89 | +| `--tensor_parallel_size` | Chips per model replica | |
| 90 | +| `--expert_parallel_size` | Chips for the expert mesh axis | |
| 91 | +| `--data_parallel_size` | Number of model replicas | |
| 92 | +| `--hbm_memory_utilization` | Fraction of HBM reserved for KV cache | |
| 93 | +| `--hf_token` | HF token (or set `HF_TOKEN` env var) | |
| 94 | +| `--hf_mode` | HF safetensors mode, no MaxText checkpoint loading | |
| 95 | +| `--server_host` / `--server_port` | vLLM server address (default: localhost:8000) | |
| 96 | +| `--max_num_batched_tokens` | vLLM tokens per scheduler step | |
| 97 | +| `--max_num_seqs` | vLLM max concurrent sequences | |
| 98 | +| `--gcs_results_path` | GCS path to upload results JSON | |
| 99 | +| `--log_level` | Logging verbosity (default: INFO) | |
| 100 | + |
| 101 | + Custom `eval` specific: |
| 102 | + |
| 103 | +| Flag | Description | |
| 104 | +|---|---| |
| 105 | +| `--config` | Benchmark YAML config (required) | |
| 106 | +| `--num_samples` | Limit eval samples | |
| 107 | +| `--max_tokens` | Max tokens per generation | |
| 108 | +| `--temperature` | Sampling temperature (default: 0.0) | |
| 109 | +| `--concurrency` | HTTP request concurrency (default: 64) | |
| 110 | + |
| 111 | +Harness `lm_eval` / `evalchemy` specific: |
| 112 | + |
| 113 | +| Flag | Description | |
| 114 | +|---|---| |
| 115 | +| `--tasks` | Space-separated task names | |
| 116 | +| `--num_fewshot` | Few-shot examples per task (default: 0) | |
| 117 | +| `--num_samples` | Limit samples per task (default: full dataset) | |
| 118 | + |
| 119 | +## Eval on RL Checkpoints |
| 120 | + |
| 121 | + |
| 122 | + |
| 123 | +Example (Qwen3-30B-A3B, v6e-8): |
| 124 | + |
| 125 | +```bash |
| 126 | +STEP=244 |
| 127 | +MODEL=qwen3-30b-a3b |
| 128 | +HF_PATH=Qwen/Qwen3-30B-A3B |
| 129 | +CHECKPOINT=gs://<bucket>/run/checkpoints/actor/${STEP}/model_params |
| 130 | +OUTPUT=gs://<bucket>/eval/ |
| 131 | + |
| 132 | +python -m maxtext.eval.runner.run \ |
| 133 | + --runner lm_eval \ |
| 134 | + --checkpoint_path ${CHECKPOINT} \ |
| 135 | + --model_name ${MODEL} \ |
| 136 | + --hf_path ${HF_PATH} \ |
| 137 | + --tasks gsm8k \ |
| 138 | + --base_output_directory ${OUTPUT} \ |
| 139 | + --run_name rl_${MODEL}_step${STEP} \ |
| 140 | + --max_model_len 4096 \ |
| 141 | + --tensor_parallel_size 8 \ |
| 142 | + --expert_parallel_size 8 \ |
| 143 | + --num_samples 20 \ |
| 144 | + --hf_token $HF_TOKEN |
| 145 | +``` |
| 146 | + |
| 147 | + |
| 148 | +## Adding a Custom Benchmark |
| 149 | + |
| 150 | +1. Implement `BenchmarkDataset` in `src/maxtext/eval/datasets/`: |
| 151 | + |
| 152 | +```python |
| 153 | +from maxtext.eval.datasets.base import BenchmarkDataset, SampleRequest |
| 154 | + |
| 155 | +class MyDataset(BenchmarkDataset): |
| 156 | + name = "my_benchmark" |
| 157 | + |
| 158 | + def sample_requests(self, num_samples, tokenizer) -> list[SampleRequest]: |
| 159 | + # load dataset, build prompts, return SampleRequest list |
| 160 | +``` |
| 161 | + |
| 162 | +2. Register in `src/maxtext/eval/datasets/registry.py`: |
| 163 | + |
| 164 | +```python |
| 165 | +from maxtext.eval.datasets.my_dataset import MyDataset |
| 166 | +DATASET_REGISTRY["my_benchmark"] = MyDataset |
| 167 | +``` |
| 168 | + |
| 169 | +3. Add a scorer in `src/maxtext/eval/scoring/` and register it in `src/maxtext/eval/scoring/registry.py`. |
0 commit comments