|
| 1 | +<!-- |
| 2 | + Copyright 2023–2026 Google LLC |
| 3 | +
|
| 4 | + Licensed under the Apache License, Version 2.0 (the "License"); |
| 5 | + you may not use this file except in compliance with the License. |
| 6 | + You may obtain a copy of the License at |
| 7 | +
|
| 8 | + https://www.apache.org/licenses/LICENSE-2.0 |
| 9 | +
|
| 10 | + Unless required by applicable law or agreed to in writing, software |
| 11 | + distributed under the License is distributed on an "AS IS" BASIS, |
| 12 | + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 13 | + See the License for the specific language governing permissions and |
| 14 | + limitations under the License. |
| 15 | + --> |
| 16 | + |
| 17 | +(inference)= |
| 18 | + |
| 19 | +# Inference on MaxText |
| 20 | + |
| 21 | +We support inference of MaxText models on vLLM via an [out-of-tree](https://github.com/vllm-project/tpu-inference/blob/main/docs/getting_started/out-of-tree.md) model plugin for vLLM. In this guide we will show how to leverage this for offline inference, online inference and for use in our reinforcement learning (RL) workflows. |
| 22 | + |
| 23 | +> **_NOTE:_** |
| 24 | +> The commands in this tutorial assume access to a v6e-8 VM. |
| 25 | +
|
| 26 | +# Installation |
| 27 | + |
| 28 | +Follow the instructions in [install maxtext](https://maxtext.readthedocs.io/en/latest/install_maxtext.html) to install MaxText with post-training dependencies. We recommend installing from PyPI to ensure you have the latest stable versionset of dependencies. |
| 29 | + |
| 30 | +After finishing the installation, ensure that the MaxText on vLLM adapter plugin has been installed. To do so, run the following command: |
| 31 | + |
| 32 | +```bash |
| 33 | +uv pip show maxtext_vllm_adapter |
| 34 | +``` |
| 35 | + |
| 36 | +You should see an output similar to the following if everything has been installed correctly: |
| 37 | + |
| 38 | +```bash |
| 39 | +Using Python 3.12.12 environment at: maxtext_venv |
| 40 | +Name: maxtext-vllm-adapter |
| 41 | +Version: 0.1.0 |
| 42 | +Location: ~/maxtext/maxtext_venv/lib/python3.12/site-packages |
| 43 | +Requires: |
| 44 | +Required-by: |
| 45 | +``` |
| 46 | + |
| 47 | +If the plugin is not installed, please run the install post training extra dependencies script again with the following command: |
| 48 | + |
| 49 | +```bash |
| 50 | +install_maxtext_tpu_post_train_extra_deps |
| 51 | +``` |
| 52 | + |
| 53 | +# Offline Inference |
| 54 | + |
| 55 | +We include a script for convenient offline inference of MaxText models in `src/maxtext/inference/vllm_decode.py`. This is helpful to ensure correctness of MaxText checkpoints. This script invokes the [`LLM`](https://docs.vllm.ai/en/latest/serving/offline_inference/#offline-inference) API from vLLM. |
| 56 | + |
| 57 | +> **_NOTE:_** |
| 58 | +> You will need to convert a checkpoint from HuggingFace in order to run the command. Do so first by following the steps in the [convert checkpoint](https://maxtext.readthedocs.io/en/latest/guides/checkpointing_solutions/convert_checkpoint.html) tutorial. |
| 59 | +
|
| 60 | +> **_NOTE:_** |
| 61 | +> The remainder of this tutorial assumes that the path to the converted MaxText checkpoint is stored in \$CHECKPOINT_PATH. |
| 62 | +
|
| 63 | +An example of how to run this script can be found below: |
| 64 | + |
| 65 | +```bash |
| 66 | + python3 -m maxtext.inference.vllm_decode src/maxtext/configs/base.yml \ |
| 67 | + model_name=qwen3-30b-a3b \ |
| 68 | + tokenizer_path=Qwen/Qwen3-30B-A3B \ |
| 69 | + load_parameters_path=$CHECKPOINT_PATH \ |
| 70 | + vllm_hf_overrides='{architectures: ["MaxTextForCausalLM"]}' \ |
| 71 | + ici_tensor_parallelism=8 \ |
| 72 | + enable_dp_attention=True \ |
| 73 | + hbm_utilization_vllm=0.5 \ |
| 74 | + prompt="Suggest some famous landmarks in London." \ |
| 75 | + decode_sampling_temperature=0.0 \ |
| 76 | + decode_sampling_nucleus_p=1.0 \ |
| 77 | + decode_sampling_top_k=0.0 \ |
| 78 | + use_chat_template=True |
| 79 | +``` |
| 80 | + |
| 81 | +In the command above we pass in the `vllm_hf_overrides='{architectures: ["MaxTextForCausalLM"]}'` argument. This argument tells vLLM to use the MaxText implementation of the target model architecture. |
| 82 | + |
| 83 | +# Online Inference |
| 84 | + |
| 85 | +We can also run online inference (an inference server) running a MaxText model by using the [`vllm serve`](https://docs.vllm.ai/en/stable/cli/serve/) API. In order to invoke this with a MaxText model, we provide the following additional arguments: |
| 86 | + |
| 87 | +```bash |
| 88 | +# --hf-overrides specifies that the MaxText model architecture should be used. |
| 89 | +--hf-overrides "{\"architectures\": [\"MaxTextForCausalLM\"]}" |
| 90 | + |
| 91 | +# --additional-config passes in "maxtext_config" which contains overrides to initialize the model. |
| 92 | +--additional-config "{\"maxtext_config\": {\"model_name\": \"qwen3-235b-a22b\", \"log_config\": false, \"load_parameters_path\": \"$CHECKPOINT_PATH\"}" |
| 93 | +``` |
| 94 | + |
| 95 | +An example of how to run `vllm serve` can be found below: |
| 96 | + |
| 97 | +```bash |
| 98 | +vllm serve Qwen/Qwen3-30B-A3B \ |
| 99 | + --seed 42 \ |
| 100 | + --max-model-len=5120 \ |
| 101 | + --gpu-memory-utilization 0.8 \ |
| 102 | + --no-enable-prefix-caching \ |
| 103 | + --disable-log-requests \ |
| 104 | + --tensor-parallel-size 4 \ |
| 105 | + --data-parallel-size 2 \ |
| 106 | + --max-num-batched-tokens 4096 \ |
| 107 | + --max_num_seqs 128 \ |
| 108 | + --hf-overrides "{\"architectures\" [\"MaxTextForCausalLM\"]}" \ |
| 109 | + --additional-config "{\"maxtext_config\": {\"model_name\": \"qwen3-30b-a3b\", \"log_config\": false, \"load_parameters_path\": \"$CHECKPOINT_PATH\}}" |
| 110 | +``` |
| 111 | + |
| 112 | +In a separate bash shell, you can send a request to this server by running the following: |
| 113 | + |
| 114 | +```bash |
| 115 | +curl http://localhost:8000/v1/completions \ |
| 116 | + -H "Content-Type: application/json" \ |
| 117 | + -d '{ |
| 118 | + "model": "Qwen/Qwen3-30B-A3B", |
| 119 | + "prompt": ["Suggest some famous landmarks in London."], |
| 120 | + "max_tokens": 4096, |
| 121 | + "temperature": 0 |
| 122 | + }' |
| 123 | +``` |
| 124 | + |
| 125 | +# Reinforcement Learning (RL) |
| 126 | + |
| 127 | +> **_NOTE:_** |
| 128 | +> Please refer to the [reinforcement learning tutorial](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/rl.html) to get started with reinforcement learning on MaxText. |
| 129 | +
|
| 130 | +> **_NOTE:_** |
| 131 | +> You will need a HuggingFace token to run this command in addition to a MaxText model checkpoint. Please see the following [guide](https://huggingface.co/docs/hub/en/security-tokens) to generate one. |
| 132 | +
|
| 133 | +To use a MaxText model architecture for samplers in reinforcement learning algorithms like GRPO, we can override the vLLM model architecture and pass in MaxText specific config arguments similar to the [online inference](online-inference) use-case. An example of an RL command using the MaxText model for samplers can be found below: |
| 134 | + |
| 135 | +```bash |
| 136 | +python3 -m src.maxtext.trainers.post_train.rl.train_rl src/maxtext/configs/post_train/rl.yml \ |
| 137 | + model_name=qwen3-0.6b \ |
| 138 | + tokenizer_path=Qwen/Qwen3-0.6B \ |
| 139 | + run_name=$WORKLOAD \ |
| 140 | + base_output_directory=$OUTPUT_PATH \ |
| 141 | + hf_access_token=$HF_TOKEN \ |
| 142 | + batch_size=4 \ |
| 143 | + num_batches=5 \ |
| 144 | + scan_layers=True \ |
| 145 | + hbm_utilization_vllm=0.4 \ |
| 146 | + rollout_data_parallelism=2 \ |
| 147 | + rollout_tensor_parallelism=4 \ |
| 148 | + allow_split_physical_axes=true \ |
| 149 | + load_parameters_path=$CHECKPOINT_PATH \ |
| 150 | + vllm_hf_overrides='{architectures: ["MaxTextForCausalLM"]}' \ |
| 151 | + vllm_additional_config='{"maxtext_config": {"model_name": "qwen3-0.6b", "log_config": "false"}}' |
| 152 | +``` |
0 commit comments