Skip to content

Commit b842fe3

Browse files
Merge pull request #3413 from AI-Hypercomputer:nicogrande/add-docs-fixit
PiperOrigin-RevId: 884655914
2 parents 01fbe6d + afcb2ee commit b842fe3

1 file changed

Lines changed: 152 additions & 0 deletions

File tree

docs/tutorials/inference.md

Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
<!--
2+
Copyright 2023–2026 Google LLC
3+
4+
Licensed under the Apache License, Version 2.0 (the "License");
5+
you may not use this file except in compliance with the License.
6+
You may obtain a copy of the License at
7+
8+
https://www.apache.org/licenses/LICENSE-2.0
9+
10+
Unless required by applicable law or agreed to in writing, software
11+
distributed under the License is distributed on an "AS IS" BASIS,
12+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
See the License for the specific language governing permissions and
14+
limitations under the License.
15+
-->
16+
17+
(inference)=
18+
19+
# Inference on MaxText
20+
21+
We support inference of MaxText models on vLLM via an [out-of-tree](https://github.com/vllm-project/tpu-inference/blob/main/docs/getting_started/out-of-tree.md) model plugin for vLLM. In this guide we will show how to leverage this for offline inference, online inference and for use in our reinforcement learning (RL) workflows.
22+
23+
> **_NOTE:_**
24+
> The commands in this tutorial assume access to a v6e-8 VM.
25+
26+
# Installation
27+
28+
Follow the instructions in [install maxtext](https://maxtext.readthedocs.io/en/latest/install_maxtext.html) to install MaxText with post-training dependencies. We recommend installing from PyPI to ensure you have the latest stable versionset of dependencies.
29+
30+
After finishing the installation, ensure that the MaxText on vLLM adapter plugin has been installed. To do so, run the following command:
31+
32+
```bash
33+
uv pip show maxtext_vllm_adapter
34+
```
35+
36+
You should see an output similar to the following if everything has been installed correctly:
37+
38+
```bash
39+
Using Python 3.12.12 environment at: maxtext_venv
40+
Name: maxtext-vllm-adapter
41+
Version: 0.1.0
42+
Location: ~/maxtext/maxtext_venv/lib/python3.12/site-packages
43+
Requires:
44+
Required-by:
45+
```
46+
47+
If the plugin is not installed, please run the install post training extra dependencies script again with the following command:
48+
49+
```bash
50+
install_maxtext_tpu_post_train_extra_deps
51+
```
52+
53+
# Offline Inference
54+
55+
We include a script for convenient offline inference of MaxText models in `src/maxtext/inference/vllm_decode.py`. This is helpful to ensure correctness of MaxText checkpoints. This script invokes the [`LLM`](https://docs.vllm.ai/en/latest/serving/offline_inference/#offline-inference) API from vLLM.
56+
57+
> **_NOTE:_**
58+
> You will need to convert a checkpoint from HuggingFace in order to run the command. Do so first by following the steps in the [convert checkpoint](https://maxtext.readthedocs.io/en/latest/guides/checkpointing_solutions/convert_checkpoint.html) tutorial.
59+
60+
> **_NOTE:_**
61+
> The remainder of this tutorial assumes that the path to the converted MaxText checkpoint is stored in \$CHECKPOINT_PATH.
62+
63+
An example of how to run this script can be found below:
64+
65+
```bash
66+
python3 -m maxtext.inference.vllm_decode src/maxtext/configs/base.yml \
67+
model_name=qwen3-30b-a3b \
68+
tokenizer_path=Qwen/Qwen3-30B-A3B \
69+
load_parameters_path=$CHECKPOINT_PATH \
70+
vllm_hf_overrides='{architectures: ["MaxTextForCausalLM"]}' \
71+
ici_tensor_parallelism=8 \
72+
enable_dp_attention=True \
73+
hbm_utilization_vllm=0.5 \
74+
prompt="Suggest some famous landmarks in London." \
75+
decode_sampling_temperature=0.0 \
76+
decode_sampling_nucleus_p=1.0 \
77+
decode_sampling_top_k=0.0 \
78+
use_chat_template=True
79+
```
80+
81+
In the command above we pass in the `vllm_hf_overrides='{architectures: ["MaxTextForCausalLM"]}'` argument. This argument tells vLLM to use the MaxText implementation of the target model architecture.
82+
83+
# Online Inference
84+
85+
We can also run online inference (an inference server) running a MaxText model by using the [`vllm serve`](https://docs.vllm.ai/en/stable/cli/serve/) API. In order to invoke this with a MaxText model, we provide the following additional arguments:
86+
87+
```bash
88+
# --hf-overrides specifies that the MaxText model architecture should be used.
89+
--hf-overrides "{\"architectures\": [\"MaxTextForCausalLM\"]}"
90+
91+
# --additional-config passes in "maxtext_config" which contains overrides to initialize the model.
92+
--additional-config "{\"maxtext_config\": {\"model_name\": \"qwen3-235b-a22b\", \"log_config\": false, \"load_parameters_path\": \"$CHECKPOINT_PATH\"}"
93+
```
94+
95+
An example of how to run `vllm serve` can be found below:
96+
97+
```bash
98+
vllm serve Qwen/Qwen3-30B-A3B \
99+
--seed 42 \
100+
--max-model-len=5120 \
101+
--gpu-memory-utilization 0.8 \
102+
--no-enable-prefix-caching \
103+
--disable-log-requests \
104+
--tensor-parallel-size 4 \
105+
--data-parallel-size 2 \
106+
--max-num-batched-tokens 4096 \
107+
--max_num_seqs 128 \
108+
--hf-overrides "{\"architectures\" [\"MaxTextForCausalLM\"]}" \
109+
--additional-config "{\"maxtext_config\": {\"model_name\": \"qwen3-30b-a3b\", \"log_config\": false, \"load_parameters_path\": \"$CHECKPOINT_PATH\}}"
110+
```
111+
112+
In a separate bash shell, you can send a request to this server by running the following:
113+
114+
```bash
115+
curl http://localhost:8000/v1/completions \
116+
-H "Content-Type: application/json" \
117+
-d '{
118+
"model": "Qwen/Qwen3-30B-A3B",
119+
"prompt": ["Suggest some famous landmarks in London."],
120+
"max_tokens": 4096,
121+
"temperature": 0
122+
}'
123+
```
124+
125+
# Reinforcement Learning (RL)
126+
127+
> **_NOTE:_**
128+
> Please refer to the [reinforcement learning tutorial](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/rl.html) to get started with reinforcement learning on MaxText.
129+
130+
> **_NOTE:_**
131+
> You will need a HuggingFace token to run this command in addition to a MaxText model checkpoint. Please see the following [guide](https://huggingface.co/docs/hub/en/security-tokens) to generate one.
132+
133+
To use a MaxText model architecture for samplers in reinforcement learning algorithms like GRPO, we can override the vLLM model architecture and pass in MaxText specific config arguments similar to the [online inference](online-inference) use-case. An example of an RL command using the MaxText model for samplers can be found below:
134+
135+
```bash
136+
python3 -m src.maxtext.trainers.post_train.rl.train_rl src/maxtext/configs/post_train/rl.yml \
137+
model_name=qwen3-0.6b \
138+
tokenizer_path=Qwen/Qwen3-0.6B \
139+
run_name=$WORKLOAD \
140+
base_output_directory=$OUTPUT_PATH \
141+
hf_access_token=$HF_TOKEN \
142+
batch_size=4 \
143+
num_batches=5 \
144+
scan_layers=True \
145+
hbm_utilization_vllm=0.4 \
146+
rollout_data_parallelism=2 \
147+
rollout_tensor_parallelism=4 \
148+
allow_split_physical_axes=true \
149+
load_parameters_path=$CHECKPOINT_PATH \
150+
vllm_hf_overrides='{architectures: ["MaxTextForCausalLM"]}' \
151+
vllm_additional_config='{"maxtext_config": {"model_name": "qwen3-0.6b", "log_config": "false"}}'
152+
```

0 commit comments

Comments
 (0)