Skip to content

Commit 16d567f

Browse files
committed
Merge branch 'main' of github.com:AI-Hypercomputer/maxtext into shuningjin-qwix1
2 parents 17800bf + ca7e2df commit 16d567f

39 files changed

Lines changed: 563 additions & 199 deletions

.github/workflows/run_jupyter_notebooks.yml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -64,8 +64,6 @@ jobs:
6464
6565
# 2. Install MaxText package and all the post training dependencies
6666
uv pip install ${maxtext_wheel}[tpu-post-train] --resolution=lowest
67-
#TODO: @mazumdera: replace this with the following after release
68-
# uv pip install maxtext[tpu-post-train] --resolution=lowest
6967
install_maxtext_tpu_post_train_extra_deps
7068
.venv/bin/python3 -m ipykernel install --user --name maxtext_venv
7169

PREFLIGHT.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,12 @@ Before you run ML workload on Multihost with GCE or GKE, simply apply `bash pref
77

88
Here is an example for GCE:
99
```
10-
bash preflight.sh PLATFORM=GCE && python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml run_name=${YOUR_JOB_NAME?}
10+
bash preflight.sh PLATFORM=GCE && python3 -m maxtext.trainers.pre_train.train run_name=${YOUR_JOB_NAME?}
1111
```
1212

1313
Here is an example for GKE:
1414
```
15-
bash preflight.sh PLATFORM=GKE && python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml run_name=${YOUR_JOB_NAME?}
15+
bash preflight.sh PLATFORM=GKE && python3 -m maxtext.trainers.pre_train.train run_name=${YOUR_JOB_NAME?}
1616
```
1717

1818
# Optimization 2: Numa binding (You can only apply this to v4 and v5p)
@@ -22,14 +22,14 @@ For GCE,
2222
[preflight.sh](https://github.com/google/maxtext/blob/main/preflight.sh) will help you install `numactl` dependency, so you can use it directly, here is an example:
2323

2424
```
25-
bash preflight.sh PLATFORM=GCE && numactl --membind 0 --cpunodebind=0 python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml run_name=${YOUR_JOB_NAME?}
25+
bash preflight.sh PLATFORM=GCE && numactl --membind 0 --cpunodebind=0 python3 -m maxtext.trainers.pre_train.train run_name=${YOUR_JOB_NAME?}
2626
```
2727

2828
For GKE,
2929
`numactl` should be built into your docker image from [maxtext_tpu_dependencies.Dockerfile](https://github.com/google/maxtext/blob/main/src/dependencies/dockerfiles/maxtext_tpu_dependencies.Dockerfile), so you can use it directly if you built the maxtext docker image. Here is an example
3030

3131
```
32-
bash preflight.sh PLATFORM=GKE && numactl --membind 0 --cpunodebind=0 python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml run_name=${YOUR_JOB_NAME?}
32+
bash preflight.sh PLATFORM=GKE && numactl --membind 0 --cpunodebind=0 python3 -m maxtext.trainers.pre_train.train run_name=${YOUR_JOB_NAME?}
3333
```
3434

3535
1. `numactl`: This is the command-line tool used for controlling NUMA policy for processes or shared memory. It's particularly useful on multi-socket systems where memory locality can impact performance.

docs/guides/checkpointing_solutions/convert_checkpoint.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ Finally, run below command to complete the conversion
7070
# Optional: If run out of disk space when downloading HuggingFace safetensors,
7171
# customize your "HF_HOME" to redirect the cache to a larger or mounted disk (e.g., on a TPU VM).
7272
# export HF_HOME="/dev/shm/huggingface_tmp"
73-
python3 -m maxtext.checkpoint_conversion.to_maxtext maxtext/configs/base.yml \
73+
python3 -m maxtext.checkpoint_conversion.to_maxtext \
7474
model_name=${MODEL_NAME?} \
7575
hf_access_token=${HF_TOKEN?} \
7676
base_output_directory=${MODEL_CHECKPOINT_DIRECTORY?} \
@@ -108,7 +108,7 @@ Use the `to_huggingface.py` script to convert a MaxText checkpoint into the Hugg
108108
The following command converts a MaxText checkpoint and saves it locally, to GCS, or uploads it directly to the Hugging Face Hub.
109109

110110
```bash
111-
python3 -m maxtext.checkpoint_conversion.to_huggingface src/maxtext/configs/base.yml \
111+
python3 -m maxtext.checkpoint_conversion.to_huggingface \
112112
model_name=<MODEL_NAME> \
113113
load_parameters_path=<path-to-maxtext-checkpoint> \
114114
base_output_directory=<path-to-save-converted-checkpoint> \
@@ -221,7 +221,7 @@ To extend conversion support to a new model architecture, you must define its sp
221221
- In [`utils/param_mapping.py`](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/checkpoint_conversion/utils/param_mapping.py), add the `hook_fn` logic (`def {MODEL}_MAXTEXT_TO_HF_PARAM_HOOK_FN`). This is the transformation needed per layer.
222222

223223
2. **Add Hugging Face weights Shape**: In [`utils/hf_shape.py`](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/checkpoint_conversion/utils/hf_shape.py), define the tensor shape of Hugging Face format (`def {MODEL}_HF_WEIGHTS_TO_SHAPE`). This is used to ensure the tensor shape is matched after to_huggingface conversion.
224-
3. **Register model key**: In [`utils/utils.py`](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/checkpoint_conversion/utils/utils.py), add the new model key in `HF_IDS`.
224+
3. **Register model key**: In [`utils/utils.py`](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/utils/globals.py), add the new model key in `HF_IDS`.
225225
4. **Add transformer config**: In [`utils/hf_model_configs.py`](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/checkpoint_conversion/utils/hf_model_configs.py), add the `transformers.Config` object, describing the Hugging Face model configuration (defined in [`src/maxtext/configs/models`](https://github.com/AI-Hypercomputer/maxtext/tree/main/src/maxtext/configs/models)). **Note**: This configuration must precisely match the MaxText model's architecture.
226226

227227
Here is an example [PR to add support for gemma3 multi-modal model](https://github.com/AI-Hypercomputer/maxtext/pull/1983)

docs/guides/run_python_notebook.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,7 @@ To install, click the `Extensions` icon on the left sidebar (or press `Ctrl+Shif
103103

104104
### Step 4: Install MaxText and Dependencies
105105

106-
To execute post-training notebooks on your TPU-VM, follow the official [MaxText installation guides](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/rl.html#create-virtual-environment-and-install-maxtext-dependencies) to install MaxText and its dependencies inside a dedicated virtual environment.
106+
To execute post-training notebooks on your TPU-VM, follow the official [MaxText installation guides](https://maxtext.readthedocs.io/en/latest/install_maxtext.html#from-source) and specifically follow `Option 3: Installing [tpu-post-train]`. This will ensure all post-training dependencies are installed inside your virtual environment.
107107

108108
### Step 5: Install the necessary library for Jupyter
109109

@@ -162,7 +162,7 @@ pip3 install jupyterlab
162162

163163
### Step 4: Install MaxText and Dependencies
164164

165-
To execute post-training notebooks on your TPU-VM, follow the official [MaxText installation guides](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/rl.html#create-virtual-environment-and-install-maxtext-dependencies) to install MaxText and its dependencies inside a dedicated virtual environment.
165+
To execute post-training notebooks on your TPU-VM, follow the official [MaxText installation guides](https://maxtext.readthedocs.io/en/latest/install_maxtext.html#from-source) and specifically follow `Option 3: Installing [tpu-post-train]`. This will ensure all post-training dependencies are installed inside your virtual environment.
166166

167167
### Step 5: Register virtual environment as a Jupyter Kernel
168168

docs/install_maxtext.md

Lines changed: 22 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ MaxText offers three installation modes:
2424
3. maxtext[tpu-post-train]. Used for post-training on TPUs. Currently, this option should also be used for running vllm_decode on TPUs.
2525

2626
## From PyPI (Recommended)
27+
2728
This is the easiest way to get started with the latest stable version.
2829

2930
```bash
@@ -38,24 +39,26 @@ source maxtext_venv/bin/activate
3839
# installation option from this list to fit your use case.
3940

4041
# Option 1: Installing maxtext[tpu]
41-
uv pip install maxtext[tpu] --resolution=lowest
42+
uv pip install "maxtext[tpu]>=0.2.0" --resolution=lowest
4243
install_maxtext_tpu_github_deps
4344

4445
# Option 2: Installing maxtext[cuda12]
45-
uv pip install maxtext[cuda12] --resolution=lowest
46+
uv pip install "maxtext[cuda12]>=0.2.0" --resolution=lowest
4647
install_maxtext_cuda12_github_dep
4748

4849
# Option 3: Installing maxtext[tpu-post-train]
49-
uv pip install maxtext[tpu-post-train] --resolution=lowest
50+
uv pip install "maxtext[tpu-post-train]>=0.2.0" --resolution=lowest
5051
install_maxtext_tpu_post_train_extra_deps
5152
```
53+
5254
> **Note:** The `install_maxtext_tpu_github_deps`, `install_maxtext_cuda12_github_dep`, and
53-
`install_maxtext_tpu_post_train_extra_deps` commands are temporarily required to install dependencies directly from GitHub
54-
that are not yet available on PyPI. As shown above, choose the one that corresponds to your use case.
55+
> `install_maxtext_tpu_post_train_extra_deps` commands are temporarily required to install dependencies directly from GitHub
56+
> that are not yet available on PyPI. As shown above, choose the one that corresponds to your use case.
5557
5658
> **Note:** The maxtext package contains a comprehensive list of all direct and transitive dependencies, with lower bounds, generated by [seed-env](https://github.com/google-ml-infra/actions/tree/main/python_seed_env). We highly recommend the `--resolution=lowest` flag. It instructs `uv` to install the specific, tested versions of dependencies defined by MaxText, rather than the latest available ones. This ensures a consistent and reproducible environment, which is critical for stable performance and for running benchmarks.
5759
5860
## From Source
61+
5962
If you plan to contribute to MaxText or need the latest unreleased features, install from source.
6063

6164
```bash
@@ -98,11 +101,11 @@ Please keep dependencies updated throughout development. This will allow each co
98101

99102
To update dependencies, you will follow these general steps:
100103

101-
1. **Modify Base Requirements**: Update the desired dependencies in `base_requirements/requirements.txt` or the hardware-specific files (`base_requirements/tpu-base-requirements.txt`, `base_requirements/gpu-base-requirements.txt`).
102-
2. **Generate New Files**: Run the `seed-env` CLI tool to generate new, fully-pinned requirements files based on your changes.
103-
3. **Update Project Files**: Copy the newly generated files into the `generated_requirements/` directory.
104-
4. **Handle GitHub Dependencies**: Move any dependencies that are installed directly from GitHub from the generated files to `src/install_maxtext_extra_deps/extra_deps_from_github.txt`.
105-
5. **Verify**: Test the new dependencies to ensure the project installs and runs correctly.
104+
1. **Modify Base Requirements**: Update the desired dependencies in `base_requirements/requirements.txt` or the hardware-specific files (`base_requirements/tpu-base-requirements.txt`, `base_requirements/gpu-base-requirements.txt`).
105+
2. **Generate New Files**: Run the `seed-env` CLI tool to generate new, fully-pinned requirements files based on your changes.
106+
3. **Update Project Files**: Copy the newly generated files into the `generated_requirements/` directory.
107+
4. **Handle GitHub Dependencies**: Move any dependencies that are installed directly from GitHub from the generated files to `src/install_maxtext_extra_deps/extra_deps_from_github.txt`.
108+
5. **Verify**: Test the new dependencies to ensure the project installs and runs correctly.
106109

107110
The following sections provide detailed instructions for each step.
108111

@@ -154,25 +157,26 @@ seed-env \
154157

155158
After generating the new requirements, you need to update the files in the MaxText repository.
156159

157-
1. **Copy the generated files:**
158-
- Move `generated_tpu_artifacts/tpu-requirements.txt` to `generated_requirements/tpu-requirements.txt`.
159-
- Move `generated_gpu_artifacts/cuda12-requirements.txt` to `generated_requirements/cuda12-requirements.txt`.
160+
1. **Copy the generated files:**
161+
162+
- Move `generated_tpu_artifacts/tpu-requirements.txt` to `generated_requirements/tpu-requirements.txt`.
163+
- Move `generated_gpu_artifacts/cuda12-requirements.txt` to `generated_requirements/cuda12-requirements.txt`.
160164

161-
2. **Update `extra_deps_from_github.txt` (if necessary):**
162-
Currently, MaxText uses a few dependencies, such as `mlperf-logging` and `google-jetstream`, that are installed directly from GitHub source. These are defined in `base_requirements/requirements.txt`, and the `seed-env` tool will carry them over to the generated requirements files.
165+
2. **Update `extra_deps_from_github.txt` (if necessary):**
166+
Currently, MaxText uses a few dependencies, such as `mlperf-logging` and `google-jetstream`, that are installed directly from GitHub source. These are defined in `base_requirements/requirements.txt`, and the `seed-env` tool will carry them over to the generated requirements files.
163167

164168
## Step 5: Verify the New Dependencies
165169

166170
Finally, test that the new dependencies install correctly and that MaxText runs as expected.
167171

168-
1. **Create a clean environment:** It's best to start with a fresh Python virtual environment.
172+
1. **Create a clean environment:** It's best to start with a fresh Python virtual environment.
169173

170174
```bash
171175
uv venv --python 3.12 --seed maxtext_venv
172176
source maxtext_venv/bin/activate
173177
```
174178

175-
2. **Run the setup script:** Execute `bash setup.sh` to install the new dependencies.
179+
2. **Run the setup script:** Execute `bash setup.sh` to install the new dependencies.
176180

177181
```bash
178182
pip install uv
@@ -183,4 +187,4 @@ uv pip install -e .[tpu] --resolution=lowest
183187
install_maxtext_github_deps
184188
```
185189

186-
3. **Run tests:** Run MaxText tests to ensure there are no regressions.
190+
3. **Run tests:** Run MaxText tests to ensure there are no regressions.

docs/run_maxtext/run_maxtext_localhost.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ bash tools/setup/setup.sh DEVICE={tpu|gpu}
5858
After the installation is complete, run a short training job using synthetic data to confirm everything is working correctly. This command trains a model for just 10 steps. Remember to replace `$YOUR_JOB_NAME` with a unique name for your run and `gs://<my-bucket>` with the path to the GCS bucket you configured in the prerequisites.
5959

6060
```bash
61-
python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
61+
python3 -m maxtext.trainers.pre_train.train \
6262
run_name=${YOUR_JOB_NAME?} \
6363
base_output_directory=gs://<my-bucket> \
6464
dataset_type=synthetic \
@@ -72,7 +72,7 @@ python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
7272
To demonstrate model output, run the following command:
7373

7474
```bash
75-
python3 -m maxtext.inference.decode src/maxtext/configs/base.yml \
75+
python3 -m maxtext.inference.decode \
7676
run_name=${YOUR_JOB_NAME?} \
7777
base_output_directory=gs://<my-bucket> \
7878
per_device_batch_size=1
@@ -92,7 +92,7 @@ To use a pre-configured model for TPUs, you override the `model_name` parameter,
9292
<summary><strong>llama3-8b (TPU)</strong></summary>
9393

9494
```bash
95-
python3 -m maxtext.trainers.pre_train.train maxtext/configs/base.yml \
95+
python3 -m maxtext.trainers.pre_train.train \
9696
model_name=llama3-8b \
9797
run_name=${YOUR_JOB_NAME?} \
9898
base_output_directory=gs://<my-bucket> \
@@ -106,7 +106,7 @@ python3 -m maxtext.trainers.pre_train.train maxtext/configs/base.yml \
106106
<summary><strong>qwen3-4b (TPU)</strong></summary>
107107

108108
```bash
109-
python3 -m maxtext.trainers.pre_train.train maxtext/configs/base.yml \
109+
python3 -m maxtext.trainers.pre_train.train \
110110
model_name=qwen3-4b \
111111
run_name=${YOUR_JOB_NAME?} \
112112
base_output_directory=gs://<my-bucket> \

docs/run_maxtext/run_maxtext_single_host_gpu.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -148,7 +148,7 @@ Hardware: GPU
148148
```
149149

150150
```bash
151-
python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml run_name=gpu01 base_output_directory=/deps/output \
151+
python3 -m maxtext.trainers.pre_train.train run_name=gpu01 base_output_directory=/deps/output \
152152
dataset_type=synthetic enable_checkpointing=True steps=10 attention=cudnn_flash_te scan_layers=False \
153153
use_iota_embed=True hardware=gpu per_device_batch_size=12
154154
```

docs/run_maxtext/run_maxtext_via_multihost_job.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ The `multihost_job.py` script:
6868

6969
```sh
7070
RUN_NAME=${YOUR_JOB_NAME?} # You may set this to any unique name for a fresh run.
71-
python3 multihost_job.py --NUM_SLICES=${NODE_COUNT?} --RUN_NAME=${RUN_NAME?} --BUCKET_NAME=${BUCKET_NAME?} --CQR_EXTRA_ARGS="--reserved" --COMMAND="bash tools/setup/setup.sh && python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml run_name=${RUN_NAME?}"
71+
python3 multihost_job.py --NUM_SLICES=${NODE_COUNT?} --RUN_NAME=${RUN_NAME?} --BUCKET_NAME=${BUCKET_NAME?} --CQR_EXTRA_ARGS="--reserved" --COMMAND="bash tools/setup/setup.sh && python3 -m maxtext.trainers.pre_train.train run_name=${RUN_NAME?}"
7272
```
7373

7474
We tell `multihost_job` to target the `reserved` pool by by including `--reserved` as extra arguments to the CQR request, but you may instead target the `on-demand` pool by removing the `--CQR_EXTRA_ARGS` flag (on-demand is default), or the pre-emptible pool with `--CQR_EXTRA_ARGS="--best-effort"`, which may be necessary if your reservation is full.

docs/run_maxtext/run_maxtext_via_multihost_runner.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,7 @@ Although there are several steps below, most are for the initial setup. Once set
106106
Set config values for `base_output_directory` and `dataset_path` in `configs/base.yml` if not set already.
107107

108108
```
109-
python3 multihost_runner.py --TPU_PREFIX=${TPU_PREFIX?} --COMMAND="python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml run_name=${RUN_NAME?}"
109+
python3 multihost_runner.py --TPU_PREFIX=${TPU_PREFIX?} --COMMAND="python3 -m maxtext.trainers.pre_train.train run_name=${RUN_NAME?}"
110110
```
111111

112112
If you are running the `multihost_runner.py` script from a TPUVM, you will need to set `--INTERNAL_IP=true`.

docs/run_maxtext/run_maxtext_via_pathways.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@ xpk workload create-pathways \
9696
--project=${PROJECT?} \
9797
--zone=${ZONE?} \
9898
--docker-image=${DOCKER_IMAGE?} \
99-
--command="python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
99+
--command="python3 -m maxtext.trainers.pre_train.train \
100100
base_output_directory=gs://${BUCKET_NAME?} \
101101
per_device_batch_size=1 \
102102
enable_checkpointing=false \
@@ -154,7 +154,7 @@ export JAX_PLATFORMS=proxy
154154
export JAX_BACKEND_TARGET=grpc://127.0.0.1:29000
155155

156156
# Run the training script
157-
python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
157+
python3 -m maxtext.trainers.pre_train.train \
158158
base_output_directory=gs://${BUCKET_NAME?} \
159159
per_device_batch_size=1 \
160160
enable_checkpointing=false \

0 commit comments

Comments
 (0)